4. A High Level look at RTB
1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
3. Exchanges serve as auction houses for the impressions
4. On behalf of the marketer,m6d bids the impressions via the
auction house. If m6d wins, we display our ad to the
browser.
5. Performance and Data
• Billions and billions of bid requests a day
• A single request can result in multiple
Cassandra Operations!
• One cluster is just under 10TB and growing
• Low latency requirement below 120 ms typical
• Limited data available tom6dvia the exchange
6. Segment Data
Segments are how we assign product or service
affinity to a group of users. User’s we consider to be
like minded with respect to a given brand will be
placed in the same segment.
Segment Data is just one component of our
overarching data model.
Segments help to reduce the number of calculations
we do in real time.
7. Old Approach for Segment Data
Application Nodes
(Tomcat + MySQL )
Limitations
•Periodically updated.
MySQL Data Push Event Logs •Only subsection of
the data.
•Cluster performance
is effected during a
data push.
Aggregation Hadoop
8. Cassandra Approach
for Segment Data
Application Nodes Better!
(Tomcat + Less • Updating in real time now
MySQL Usage) possible
• Distributed not duplicated
• Lesscomplexity to manage
• Storing more information
• We can now bid on users
Cassandra sooner!
10. During waking hours: Dr. Realtime
• User traffic is at peak
• Applications need low latency operations
• High volume of read and write operations
• Desire high cache hit rate to limit disk IO
• Dr. Realtime conducts 'experiments' on
optimization
11. Experiment: Active Set, VFS, cache
size tuning
• Cluster optimization is a topic that must be
revisited periodically
• User base and requests are perpetually growing
• Amount of physical data stored grows
• New features typically result in new data and
more requests
• How to tune your environment is application
and hardware dependent
12. Physical data directory
• sstable holds data
• Index holds offsets to
avoid disk seeks
• Bloom filter probabilistic
lookup system
– (also a stat table)
13. When RAM > Data Size
• If you can afford to keep
your data set in RAM:
• It is fast from VFS cache
• That's it. Your optimized.
• However you do not
usually need this much
ram
14. When RAM < Data Size
• The OS will cache the most
active portions of disk
• The write/compact model
causes the cache to churn
• User requests causes the
cache to churn
15. Understanding Active set with a
hypothetical example
Webmail service (Coldmail):
• I have an account for 10 years, I never log in
more than twice a month
• I have 1,000,000 items in my inbox
• Not in the active set
Social networking (chirper):
• I am logged in every day
• Commonly read get updates from my friends
• In the active set
16. $60,000 Question
How do you determine what the
active set of your application and
user base is?
18. Turn on a cache
• JMX allows you to tune only a single node
for side by side comparisons
• Set the size very large for key cache (be
more careful with row cache)
19. Analysis
• 8:30 hit rate 91%
1.2 mil
• 10:30 hit rate ~93%
1.7 mil
• Past 1.2 million
entry cache might
be better spent
elsewhere
20. Active set conclusions
• Determine sweet spot for hit rate and cache size
• Do not try to cache long tail of requests
• When all other things equal dedicate more
cache to most read column family
• Use row cache only if rows are a predictable size
• Large row caches can not be saved so cold on
restart
21. read_repair_chance – Cassandra's
version of an ethical dilemma
• Read Repair generates additional reads across the cluster
for each user read
• Read Repair Chance controls the probability of Read Repair
occurring.
• If data is write-once or write-rarely Read Repair may be
unnecessary
– data read ratio much larger then write ratio
– data that does not need strict consistency
• 1.0 Hinted handoff now does not need to wait on the failure
detector. Read Repair Chance default has been set to 10%
from 100%.
– Cassandra-2045 TX ntelford and co!
22. Analysis for RRC 'test subjects'
Candidate: Many reads few
writes
Inside story: This data used to
take 2 days. A few ms...
Come on man!
Candidate ?: Many writes
Inside story: This is used for
frequency capping, higher %
justified
23. Experiment: Test the limits of NoSQL
science with YCSB
YCSB is a distributed load generator
that comes in handy!
• Before our upgrade from 0.6.X->0.7.X
– All the benchmarks were better
– But good to kick the tires
• Prototyping new Column Family
– Time to write 500 million records
– How many reads/second on 50GB of data
25. Round 1 Results
RunTime: 410 Seconds
Throughput: 2437 Operations/Second
Shared the results on #cassandrairc.
Suggestion! Try: -threads 30
26. Trying it again…
Original Results:
-threads 10
RunTime: 410 Seconds
Throughput: 2437 Operations/Second
New Results:
-threads 30
RunTime: 196 Seconds
Throughput 5088 Operations/Second
27. Cassandra writes fast! (duh)
• Read path
– Row, Key, and VFS caches
– With enough data and read ops disks bottleneck
• Write path
– structured log writes are linear to disk-wide and fast
– compaction merges sstables in background
• Many threads maximizes write capability
• Many threads also stops a read blocking on IO
from limiting write potential
28. Night falls and Dr. Realtime
transforms...
/etc/cron.d/mr_batch_dr_realtime
# turn into Mr. batch at night
0 0 * * * root nodetool -h `hostname` setcompactionthroughput999
#turn back into Dr. Realtime for day
0 6 * * * root nodetool -h `hostname` setcompactionthroughput16
Setting throughput ensures
• During the day most iops are free to serve traffic
• At night can rip through compactions
29. Mr Batch ravages data creating
tombstones
• If User clears cookies they vanish forever
• In actuality they return as a new user
• Data has very high turnover
• We need to enforce retention policy on data
• TTL columns do not meet our requirements :(
• Cleanup daemon is a throttled range scanner
• Cleanup daemon also produces histograms
every cycle
31. A note about different workloads
• Structured log format of C* has deep implications
• Many factors effect performance and disk size:
• Write once data
• Wide rows (many columns)
• Wide rows over time (fragmented)
• Application read write profile
• Deletion/update percentage
• LevelDB inspired compaction in 1.0 different profile then current
tiered compaction
32. Tombstones have costs
• Physically live on disk
• Bloat data, index, and
bloom filters
• Tombstone live for a grace
period and then are
eligible to be removed
33. Caching after (major) compaction
• Our case (lots of churn)
major compaction shrinks
data significantly
• Rows fragmented over
many sstables are joined
• Tombstones and related
data columns removed
• All files should be smaller
• Smaller files means better
VFS caching