This document discusses strategies for handling large amounts of data in web applications. It begins by providing examples of how much data some large websites contain, ranging from terabytes to petabytes. It then covers various techniques for scaling data handling capabilities including vertical and horizontal scaling, replication, partitioning, consistency models, normalization, caching, and using different data engine types beyond relational databases. The key lessons are that data volumes continue growing rapidly, and a variety of techniques are needed to scale across servers, datacenters, and provide high performance and availability.
5. How Big Does it Get 22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source:http://highscalability.com/digg-architecture
6. How Big Does it Get 1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
7. How Big Does it Get 2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
8. How Big Does it Get Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source:http://highscalability.com/ebay-architecture/
9. How Big Does it Get 450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 – 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source:http://highscalability.com/google-architecture/
10. Key Trends Data Size ~ PB Data Growth ~ TB / day No of servers – 10s to 10,000 No of datacenters – 1 to 10 Queries – B+ / day Specialized needs – more / other than RDBMS
12. Host RAM CPU CPU RAM CPU RAM App Server DB Server Vertical Scaling (Scaling Up)
13. Big Irons Sunfire E20k PowerEdge SC1435 36x 1.8GHz processors Dualcore 1.8 GHz processor $450,000 - $2,500,000 Around $1,500
14. Vertical Scaling (Scaling Up) Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
15. Host Host App Server DB Server Vertical Partitioning of Services
16. Vertical Partitioning of Services Split services on separate nodes Each node performs different tasks Pros Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases Cons Sub-optimal resource utilization May not increase overall availability Finite Scalability
17. Horizontal Scaling of App Server Web Server Load Balancer Web Server DB Server Web Server
18. Horizontal Scaling of App Server Add more nodes for the same service Identical, doing the same task Load Balancing Hardware balancers are faster Software balancers are more customizable
19. The problem - State Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
20. Sticky Sessions Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server Asymmetrical load distribution Downtime
21. Central Session Store Web Server User 1 Load Balancer Web Server Session Store User 2 Web Server SPOF Reads and Writes generate network + disk IO
23. Clustered Sessions Pros No SPOF Easier to setup Fast Reads Cons n x Writes Increase in network IO with increase in nodes Stale data (rare)
24. Sticky Sessions with Central Store Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
25. More Session Management No Sessions Stuff state in a cookie and sign it! Cookie is sent with every request / response Super Slim Sessions Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)
26. Sessions - Recommendation Bad Sticky sessions Good Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume Great No Sessions!
27. App Tier Scaling - More HTTP Accelerators / Reverse Proxy Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool CDN Get closer to your user Akamai, Limelight IP Anycasting Async NIO
28. Scaling a Web App App-Layer Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!! Data Store Tricky! Very Tricky!!!
31. Replication = Scaling by Duplication App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster
32. Replication Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data – each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
33. Master-Slave App Layer Master Slave Slave Slave Slave n x Writes – Async vs. Sync SPOF Async - Critical Reads from Master!
34. Multi-Master App Layer Master Master Slave Slave Slave n x Writes – Async vs. Sync No SPOF Conflicts!
35. Replication Considerations Asynchronous Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads Synchronous Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads
36. Replication Considerations Replication at RDBMS level Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master Replication at Driver / DAO level Driver / DAO layer ensures writes are performed on all connected DBs Reads are load balanced Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable
39. Partitioning = Scaling by Division Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
40. Vertical Partitioning App Layer T1, T2, T3, T4, T5 Note: A node here typically represents a shared nothing cluster
41. Vertical Partitioning App Layer T3 T4 T5 T2 T1 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
42. Horizontal Partitioning App Layer T3 T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
43. Horizontal Partitioning Schemes Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin
46. Transactions Transactions make you feel alone No one else manipulates the data when you are Transactional serializability The behavior is as if a serial order exists Source:http://blogs.msdn.com/pathelland/ Slide 46
47. Life in the “Now” Transactions live in the “now” inside services Time marches forward Transactions commit Advancing time Transactions see the committed transactions A service’s biz-logic lives in the “now” Source:http://blogs.msdn.com/pathelland/ Slide 47
60. In Einstein’s world, everything is “relative” to one’s perspective
61. Today: No attempt to blur the boundarySource:http://blogs.msdn.com/pathelland/ Slide 49
62. Versions and Distributed Systems Can’t have “the same” dataat many locations Unless it isa snapshot Changing distributed dataneeds versions Creates asnapshot… Source:http://blogs.msdn.com/pathelland/
63. Subjective Consistency Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help Subjective Consistency Given the information I have at hand, make a decision and act on it ! Remember the information at hand ! Ambassadors Had Authority Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later! Source:http://blogs.msdn.com/pathelland/
64.
65. Everyone sharing their knowledge leads to the same result...This is NOT magic; it is a design requirement ! Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement Source:http://blogs.msdn.com/pathelland/
67. Why Normalize? Classic problemwith de-normalization Can’t updateSam’s phone #since there aremany copies Emp # Emp Name Mgr # Mgr Name Emp Phone 47 Joe 13 Sam 5-1234 18 Sally 38 Harry 3-3123 91 Pete 13 Sam 2-1112 66 Mary 02 Betty 5-7349 Mgr Phone 6-9876 5-6782 6-9876 4-0101 Normalization’s Goal Is Eliminating Update Anomalies Can Be Changed Without “Funny Behavior” Each Data Item Lives in One Place De-normalization is OK if you aren’t going to update! Source:http://blogs.msdn.com/pathelland/
69. Eliminate Joins 6 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume But disk is cheap and getting cheaper And can lead to inconsistent data If you are lazy However this is not really an issue
70. “Append-Only” Data Many Kinds of Computing are “Append-Only” Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can’t change the history but you can add new observations Derived Results May Be Calculated Estimate of the “current” inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
71. Databases and Transaction Logs Transaction Logs Are the Truth High-performance & write-only Describe ALL the changes to the data Data-Base the Current Opinion Describes the latest value of the data as perceived by the application Log DB The Database Is a Caching of the Transaction Log ! It is the subset of the latest committed values represented in the transaction log… Source:http://blogs.msdn.com/pathelland/
72. We Are Swimming in a Sea of Immutable Data Source:http://blogs.msdn.com/pathelland/
74. Caching Makes scaling easier (cheaper) Core Idea Read data from persistent store into memory Store in a hash-table Read first from cache, if not, load from persistent store
79. How does it work In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table Which item is stored on which instance Memcached Server maintains a hash table Which item is stored in which memory location
81. It’s not all Relational! Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft – SQL Data Services, Azure Storages Facebook – Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable
82. Tuplespaces Basic Concepts No tables - Containers-Entity No schema - each tuple has its own set of properties Amazon SimpleDB – strings only Microsoft Azure SQL Data Services Strings, blob, datetime, bool, int, double, etc. No x-container joins as of now Google App Engine Datastore Strings, blob, datetime, bool, int, double, etc.
83. Key-Value Stores Google BigTable Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes Amazon Dynamo Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning Facebook Cassandra Used for Inbox search Open Source Scalaris Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence
84. In Summary Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!