High Scalability Toronto: Meetup #2

High Scalability
Basics of scale and availability

Who am I?
• Jonathan Keebler @keebler keebler.net
• Built video player for all CTV properties
• Worked on news sites like CP24, CTV, TSN
• CTO, Founder of ScribbleLive
• Bootstrapped a high scalability startup
– Credit card limit wasn’t that high, had to find cheap
ways to handle the load of top tier news sites

2

Sample load test

17 x Windows Server 2008, 2 x Varnish, 4 x nginx, 1 x SQL Server 2008

3

Scalability vs Availability
• Often talked about separately
• Can’t have one without the other
• Let’s talk about the basic building blocks

4

Building blocks
• Content Distribution Network (CDN)
• Load-balancer
• Reverse proxy
• Caching server
• Origin server

5

Basic hosting structure

6


Akamai Amazon ELB nginx Varnish LAMP
CloudFront F5 Squid ASP.NET
EdgeCast HAProxy aiCache node.js

7

+ Monitoring + Monitoring + Monitoring + Monitoring + Monitoring

Akamai Amazon ELB nginx Varnish LAMP
CloudFront F5 Squid ASP.NET
EdgeCast HAProxy aiCache node.js

8

Monitor or die
• If you aren’t monitoring your stack, you
have NO IDEA what’s going on
• Pingdom/WatchMouse/Gomez not enough
– Don’t help you when you’re trying to figure out
what’s going wrong
– You need actionable metrics

9

Monitor or die
• Outside monitoring e.g. Pingdom, Gomez
– DNS problems, localized problem, SLA
• Inside monitoring e.g. New Relic, CloudWatch,
Server Density
– High latency, CPU spikes, memory crunch,
peek-a-boo servers, rogue processes, SQL
queries per second, SQL wait time, SQL locks,
disk usage, disk IO performance, page file
usage, network traffic, requests per second, 10

New Relic
• Dashboard

11

Alerting
• Don’t send them to your email
– Try to work with notifications coming in every
second
• PagerDuty
• Don’t over do it = alert fatigue

12

• Now back to our servers...

13

Load-balancers
• Bandwidth limits on dedicated boxes
harder to work around
• F5s are great boxes, but have lousy live
reporting = can get into trouble quick
• Adding/removing servers sucks
• DNS load-balancing sucks for everyone
14

nginx
• Fantastic at handling massive number of
requests (low CPU, low memory)
• Easy to configure and change on-the-fly
• Gzip, modify headers, host names
• Proxy with error intercept
• No query string or IF-statement* support

15

Varnish
• Caching server but so much more
• Fantastic at handling massive number of
requests (low CPU, low memory)
• Easy to configure and change on-the-fly
• Protect your origin servers
• Deals with errors from origin servers

16

Origin servers
• Whatever tweaks you make will never help
enough
– e.g. If your disk IO is becoming a problem, it’s
already too late to save you
• Keep them stock so you don’t blow your mind,
easier to deploy
• Handle any query string hacking in Varnish
17

Databases
• No silver bullet
• Two options:
– Shard (split your data between servers)
– Cluster (many boxes working together as one)
• Shards commonly used today
– Lots of work on code level, no incremental IDs
• Clusters have a single point of failure
– Try upgrading one and tell me they don’t

18

Discussion
• What stack do you use?
• What database do you use?
• SQL vs NoSQL

19

High Scalability
Content Distribution Networks

Basics
• Worldwide network of DNS load-balanced
reverse proxies
• Not magic
• Can achieve 99% offload if you do it right
• Have to understand your requests

21

Market leaders
• Akamai: market leader, $$$, most options, yearly
contracts, pay for GB + request headers
• CloudFront: built on AWS, cheaper, pay-as-you-
go, less features, new features coming quickly,
GB + pay-per-request
• EdgeCast (pay-as-you-go through GoGrid),
CloudFlare (optimizer, security, easy!)

22

Tiered distribution
• More points-of-presence (POPs) = less caching if
your traffic is global
• Need to put a layer of servers between POPs
and your servers
• Sophisticated setups throttle requests
– if 100 come in at same time, only 1 gets
through
23

Cache keys
• Need to have same query string to get cached
result
• Some CDNs can ignore params
– important if you need a random number on the
query string to prevent browser caching
• Cool options: case sensitive/insensitive, cache
differently based on cookie, headers
24

Invalidations suck
• Trying to get CDN to drop its cache is hard
– takes a long time to reach all POPs
– triggers thundering herd
– takes out all caching for a bit
• Build the ability to change query strings at the code layer
– e.g. add version number to JS/CSS URLs. When you
rollout, breaks cache

25

How long to cache for?
• As long as you need, but no longer
• Make sure you think about error case i.e.
what if an error gets cached
– Some CDNs let you set your own rules for that
– Remember, invalidations suck

26

Thundering herds

27

Thundering herds
• When you rollout or have high latency, all your
timeouts align
– Origins get slammed at regular interval by POPs
• Random TTLs are your friend
– Just +/- a few minutes can be a big help
– TIP: break into C in Varnish

28

Don’t build your own*
• You will never be as smart as Akamai/Amazon
• You will never be able to bring on new servers
fast enough to scale
• Spend your time building awesome software
• Build your own caching layer for the POPs (and
just in-case, to protect your origin servers)

29

Discussion
• What CDN do you use?
• War stories

30

High Scalability
Caching in Code

Why do I need this?
• You can’t cache every request
• You can’t cache POST requests
• Protect the database!
• The longer you can go before you have to
shard your database, the better

32

What is it?
• In-process, in-memory caching
• Static variables work great
– TIP: .NET: static variables are scoped in the
thread, WHY?!
• Custom memory stores
• Whatever you want, just not the disk
33

Isn’t that what Memcached is for?
• Memcached is in-memory BUT so is your database
– Advantages of Memcached over your database:
• Cheaper to replicate
• Fast lookups...if your db sucks
– Disadvantages:
• Still has network latency, higher than db lookup (unless
your db sucks)
• IT’S NOT A DATABASE!

34

Getting started
• Think about your data + classes
• TTLs based on knowledge of your data
• Random TTLs (avoid the thundering herd)
• Use concurrent, thread-safe objects
• Wrap your code in try-catch
– Caching isn’t worth breaking your site for

35

Updating cache
• Use semaphores (that Comp Sci degree is finally going to come in handy)
• Semaphores should always unlock on their own
– Your thread could die/timeout at any time. You don’t want to lock forever
• Use a separate thread for the lookup. Why should one user suffer?
• Using a datetime semaphore is usually the best
– keep a time when the next update will take place
– 1st thread to hit that time, immediately adds some seconds to the time.
Buys itself enough time to do lookup
– Any blocked thread gets cached data. DON’T LOCK

36

Populating cache for first time
• How do you prevent thundering herd before
cache?
• Ok, you may have to lock. But be smart about it.
• Are you sure your database can’t handle it?
• This is where other caching layers help: CDN
throttling, Varnish throttling, Memcached, read-
only databases
37

Garbage collection
• Keep counters for metrics e.g. how many hits to the cached
object, datetime of last request for that object
• Every X something, run your garbage collection
– Use semaphores
– Don’t get rid of the most used objects
• You are going to collide with running code
– try-catch is your friend
• Don’t be afraid to dump the cache and start over

38

Watch out for references
• If you are storing something in a cache object, you
can save a lot of memory by passing reference to
object
• Don’t forget about the reference
• Watch out for garbage collection trying to destroy it
• Updating cache operation might involve updating an
existing object

39

The curse
• More servers = more caches = less
efficient
• Discipline: can’t throw more servers at the
problem

40

Totally worth it!

Requests per minute to origin servers

41

Totally worth it!

CPU of 1 x SQL Server 2008 database

42

Discussion
• What do you use to cache at a code layer?
• War stories

43

Thank you!
• Jonathan Keebler
• jonathan@scribblelive.com
• @keebler

44

High Scalability Toronto: Meetup #2

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (6)

Similar to High Scalability Toronto: Meetup #2

Similar to High Scalability Toronto: Meetup #2 (20)

High Scalability Toronto: Meetup #2