Slides from the second meeting of the Toronto High Scalability Meetup @ http://www.meetup.com/toronto-high-scalability/
-Basics of High Scalability and High Availability
-Using a CDN to Achieve 99% Offload
-Caching at the Code Layer
2. Who am I?
• Jonathan Keebler @keebler keebler.net
• Built video player for all CTV properties
• Worked on news sites like CP24, CTV, TSN
• CTO, Founder of ScribbleLive
• Bootstrapped a high scalability startup
– Credit card limit wasn’t that high, had to find cheap
ways to handle the load of top tier news sites
2
3. Sample load test
17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008
3
4. Scalability vs Availability
• Often talked about separately
• Can’t have one without the other
• Let’s talk about the basic building blocks
4
5. Building blocks
• Content Distribution Network (CDN)
• Load-balancer
• Reverse proxy
• Caching server
• Origin server
5
9. Monitor or die
• If you aren’t monitoring your stack, you
have NO IDEA what’s going on
• Pingdom/WatchMouse/Gomez not enough
– Don’t help you when you’re trying to figure out
what’s going wrong
– You need actionable metrics
9
10. Monitor or die
• Outside monitoring e.g. Pingdom, Gomez
– DNS problems, localized problem, SLA
• Inside monitoring e.g. New Relic, CloudWatch,
Server Density
– High latency, CPU spikes, memory crunch,
peek-a-boo servers, rogue processes, SQL
queries per second, SQL wait time, SQL locks,
disk usage, disk IO performance, page file
usage, network traffic, requests per second, 10
14. Load-balancers
• Bandwidth limits on dedicated boxes
harder to work around
• F5s are great boxes, but have lousy live
reporting = can get into trouble quick
• Adding/removing servers sucks
• DNS load-balancing sucks for everyone
14
15. nginx
• Fantastic at handling massive number of
requests (low CPU, low memory)
• Easy to configure and change on-the-fly
• Gzip, modify headers, host names
• Proxy with error intercept
• No query string or IF-statement* support
15
16. Varnish
• Caching server but so much more
• Fantastic at handling massive number of
requests (low CPU, low memory)
• Easy to configure and change on-the-fly
• Protect your origin servers
• Deals with errors from origin servers
16
17. Origin servers
• Whatever tweaks you make will never help
enough
– e.g. If your disk IO is becoming a problem, it’s
already too late to save you
• Keep them stock so you don’t blow your mind,
easier to deploy
• Handle any query string hacking in Varnish
17
18. Databases
• No silver bullet
• Two options:
– Shard (split your data between servers)
– Cluster (many boxes working together as one)
• Shards commonly used today
– Lots of work on code level, no incremental IDs
• Clusters have a single point of failure
– Try upgrading one and tell me they don’t
18
21. Basics
• Worldwide network of DNS load-balanced
reverse proxies
• Not magic
• Can achieve 99% offload if you do it right
• Have to understand your requests
21
22. Market leaders
• Akamai: market leader, $$$, most options, yearly
contracts, pay for GB + request headers
• CloudFront: built on AWS, cheaper, pay-as-you-
go, less features, new features coming quickly,
GB + pay-per-request
• EdgeCast (pay-as-you-go through GoGrid),
CloudFlare (optimizer, security, easy!)
22
23. Tiered distribution
• More points-of-presence (POPs) = less caching if
your traffic is global
• Need to put a layer of servers between POPs
and your servers
• Sophisticated setups throttle requests
– if 100 come in at same time, only 1 gets
through
23
24. Cache keys
• Need to have same query string to get cached
result
• Some CDNs can ignore params
– important if you need a random number on the
query string to prevent browser caching
• Cool options: case sensitive/insensitive, cache
differently based on cookie, headers
24
25. Invalidations suck
• Trying to get CDN to drop its cache is hard
– takes a long time to reach all POPs
– triggers thundering herd
– takes out all caching for a bit
• Build the ability to change query strings at the code layer
– e.g. add version number to JS/CSS URLs. When you
rollout, breaks cache
25
26. How long to cache for?
• As long as you need, but no longer
• Make sure you think about error case i.e.
what if an error gets cached
– Some CDNs let you set your own rules for that
– Remember, invalidations suck
26
28. Thundering herds
• When you rollout or have high latency, all your
timeouts align
– Origins get slammed at regular interval by POPs
• Random TTLs are your friend
– Just +/- a few minutes can be a big help
– TIP: break into C in Varnish
28
29. Don’t build your own*
• You will never be as smart as Akamai/Amazon
• You will never be able to bring on new servers
fast enough to scale
• Spend your time building awesome software
• Build your own caching layer for the POPs (and
just in-case, to protect your origin servers)
29
32. Why do I need this?
• You can’t cache every request
• You can’t cache POST requests
• Protect the database!
• The longer you can go before you have to
shard your database, the better
32
33. What is it?
• In-process, in-memory caching
• Static variables work great
– TIP: .NET: static variables are scoped in the
thread, WHY?!
• Custom memory stores
• Whatever you want, just not the disk
33
34. Isn’t that what Memcached is for?
• Memcached is in-memory BUT so is your database
– Advantages of Memcached over your database:
• Cheaper to replicate
• Fast lookups...if your db sucks
– Disadvantages:
• Still has network latency, higher than db lookup (unless
your db sucks)
• IT’S NOT A DATABASE!
34
35. Getting started
• Think about your data + classes
• TTLs based on knowledge of your data
• Random TTLs (avoid the thundering herd)
• Use concurrent, thread-safe objects
• Wrap your code in try-catch
– Caching isn’t worth breaking your site for
35
36. Updating cache
• Use semaphores (that Comp Sci degree is finally going to come in handy)
• Semaphores should always unlock on their own
– Your thread could die/timeout at any time. You don’t want to lock forever
• Use a separate thread for the lookup. Why should one user suffer?
• Using a datetime semaphore is usually the best
– keep a time when the next update will take place
– 1st thread to hit that time, immediately adds some seconds to the time.
Buys itself enough time to do lookup
– Any blocked thread gets cached data. DON’T LOCK
36
37. Populating cache for first time
• How do you prevent thundering herd before
cache?
• Ok, you may have to lock. But be smart about it.
• Are you sure your database can’t handle it?
• This is where other caching layers help: CDN
throttling, Varnish throttling, Memcached, read-
only databases
37
38. Garbage collection
• Keep counters for metrics e.g. how many hits to the cached
object, datetime of last request for that object
• Every X something, run your garbage collection
– Use semaphores
– Don’t get rid of the most used objects
• You are going to collide with running code
– try-catch is your friend
• Don’t be afraid to dump the cache and start over
38
39. Watch out for references
• If you are storing something in a cache object, you
can save a lot of memory by passing reference to
object
• Don’t forget about the reference
• Watch out for garbage collection trying to destroy it
• Updating cache operation might involve updating an
existing object
39
40. The curse
• More servers = more caches = less
efficient
• Discipline: can’t throw more servers at the
problem
40