SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Cassandra in
            |   Online Advertising:
                Real Time Bidding




the prospect engine for brands.
Who are we?
Costa Sevdinoglou & Edward Capriolo
Impressions look like…
A High Level look at RTB




1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
3. Exchanges serve as auction houses for the impressions
4. On behalf of the marketer,m6d bids the impressions via the
   auction house. If m6d wins, we display our ad to the
   browser.
Performance and Data
• Billions and billions of bid requests a day
  • A single request can result in multiple
       Cassandra Operations!
  • One cluster is just under 10TB and growing
• Low latency requirement below 120 ms typical
• Limited data available tom6dvia the exchange
Segment Data

Segments are how we assign product or service
affinity to a group of users. User’s we consider to be
like minded with respect to a given brand will be
placed in the same segment.

Segment Data is just one component of our
overarching data model.

Segments help to reduce the number of calculations
we do in real time.
Old Approach for Segment Data
                  Application Nodes
                  (Tomcat + MySQL )
                                                   Limitations
                                                   •Periodically updated.
MySQL Data Push                       Event Logs   •Only subsection of
                                                   the data.
                                                   •Cluster performance
                                                   is effected during a
                                                   data push.
        Aggregation              Hadoop
Cassandra Approach
        for Segment Data

Application Nodes                  Better!
 (Tomcat + Less     •   Updating in real time now
 MySQL Usage)           possible
                    •   Distributed not duplicated
                    •   Lesscomplexity to manage
                    •   Storing more information
                    •   We can now bid on users
   Cassandra            sooner!
M6d cassandrapresentation
During waking hours: Dr. Realtime

•   User traffic is at peak
•   Applications need low latency operations
•   High volume of read and write operations
•   Desire high cache hit rate to limit disk IO
•   Dr. Realtime conducts 'experiments' on
    optimization
Experiment: Active Set, VFS, cache
             size tuning
• Cluster optimization is a topic that must be
  revisited periodically
• User base and requests are perpetually growing
• Amount of physical data stored grows
• New features typically result in new data and
  more requests
• How to tune your environment is application
  and hardware dependent
Physical data directory
• sstable holds data
• Index holds offsets to
  avoid disk seeks
• Bloom filter probabilistic
  lookup system
   – (also a stat table)
When RAM > Data Size
• If you can afford to keep
  your data set in RAM:
• It is fast from VFS cache
• That's it. Your optimized.
• However you do not
  usually need this much
  ram
When RAM < Data Size
• The OS will cache the most
  active portions of disk
• The write/compact model
  causes the cache to churn
• User requests causes the
  cache to churn
Understanding Active set with a
       hypothetical example
Webmail service (Coldmail):
  • I have an account for 10 years, I never log in
    more than twice a month
  • I have 1,000,000 items in my inbox
  • Not in the active set
Social networking (chirper):
  • I am logged in every day
  • Commonly read get updates from my friends
  • In the active set
$60,000 Question


How do you determine what the
active set of your application and
user base is?
Setup instruments for testing
Turn on a cache




• JMX allows you to tune only a single node
  for side by side comparisons
• Set the size very large for key cache (be
  more careful with row cache)
Analysis
    • 8:30 hit rate 91%
      1.2 mil
    • 10:30 hit rate ~93%
      1.7 mil
    • Past 1.2 million
      entry cache might
      be better spent
      elsewhere
Active set conclusions
• Determine sweet spot for hit rate and cache size
• Do not try to cache long tail of requests
• When all other things equal dedicate more
  cache to most read column family
• Use row cache only if rows are a predictable size
• Large row caches can not be saved so cold on
  restart
read_repair_chance – Cassandra's
     version of an ethical dilemma
• Read Repair generates additional reads across the cluster
  for each user read
• Read Repair Chance controls the probability of Read Repair
  occurring.
• If data is write-once or write-rarely Read Repair may be
  unnecessary
   – data read ratio much larger then write ratio
   – data that does not need strict consistency
• 1.0 Hinted handoff now does not need to wait on the failure
  detector. Read Repair Chance default has been set to 10%
  from 100%.
   – Cassandra-2045 TX ntelford and co!
Analysis for RRC 'test subjects'
                Candidate: Many reads few
                   writes
                Inside story: This data used to
                   take 2 days. A few ms...
                   Come on man!


                Candidate ?: Many writes
                Inside story: This is used for
                   frequency capping, higher %
                   justified
Experiment: Test the limits of NoSQL
        science with YCSB
YCSB is a distributed load generator
that comes in handy!
• Before our upgrade from 0.6.X->0.7.X
  – All the benchmarks were better
  – But good to kick the tires
• Prototyping new Column Family
  – Time to write 500 million records
  – How many reads/second on 50GB of data
Create a mixed workload
java -cp $CP com.yahoo.ycsb.Client -db
   com.yahoo.ycsb.db.CassandraClient7 -P
   workloads/workloadb -t 
-threads 10 
-precordcount=75000000 
-poperationcount=1000000 
-preadproportion=0.33 
-pupdateproportion=0.33 
-pscanproportion=0 
-pinsertproportion=0.33
Round 1 Results

RunTime: 410 Seconds
Throughput: 2437 Operations/Second

Shared the results on #cassandrairc.
Suggestion! Try: -threads 30
Trying it again…
Original Results:
  -threads 10
RunTime: 410 Seconds
  Throughput: 2437 Operations/Second

New Results:
  -threads 30
RunTime: 196 Seconds
  Throughput 5088 Operations/Second
Cassandra writes fast! (duh)
• Read path
  – Row, Key, and VFS caches
  – With enough data and read ops disks bottleneck
• Write path
  – structured log writes are linear to disk-wide and fast
  – compaction merges sstables in background
• Many threads maximizes write capability
• Many threads also stops a read blocking on IO
  from limiting write potential
Night falls and Dr. Realtime
                    transforms...
/etc/cron.d/mr_batch_dr_realtime
# turn into Mr. batch at night
0 0 * * * root nodetool -h `hostname` setcompactionthroughput999
#turn back into Dr. Realtime for day
0 6 * * * root nodetool -h `hostname` setcompactionthroughput16


Setting throughput ensures
         •    During the day most iops are free to serve traffic
         •    At night can rip through compactions
Mr Batch ravages data creating
               tombstones
•   If User clears cookies they vanish forever
•   In actuality they return as a new user
•   Data has very high turnover
•   We need to enforce retention policy on data
•   TTL columns do not meet our requirements :(
•   Cleanup daemon is a throttled range scanner
•   Cleanup daemon also produces histograms
    every cycle
Mr. Batch 'kills' rows while you sleep
A note about different workloads

• Structured log format of C* has deep implications
• Many factors effect performance and disk size:
     • Write once data
     • Wide rows (many columns)
     • Wide rows over time (fragmented)
     • Application read write profile
     • Deletion/update percentage
• LevelDB inspired compaction in 1.0 different profile then current
  tiered compaction
Tombstones have costs

• Physically live on disk
• Bloat data, index, and
  bloom filters
• Tombstone live for a grace
  period and then are
  eligible to be removed
Caching after (major) compaction
• Our case (lots of churn)
  major compaction shrinks
  data significantly
• Rows fragmented over
  many sstables are joined
• Tombstones and related
  data columns removed
• All files should be smaller
• Smaller files means better
  VFS caching
Simple compaction scheduler
HOSTS=$1
for i in $HOSTS ; do
nodetool -hcassandra{i} -p 8585 compact KS1
nodetool -hcassandra{i} -p 8585 compact KS2
nodetool -hcassandra{i} -p 8585 compact KS3
Done

30 23 * * * /root/compacto/spool.sh "01 02 03 04"
30 23 * * * /root/compacto/spool.sh "05 06 07 08"
Questions



?           ?
    ?

Contenu connexe

Tendances

C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applicationsBen Slater
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writesInstaclustr
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...DataStax
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Cassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large NodesCassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large Nodesaaronmorton
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandraAaron Ploetz
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...DataStax
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japanHiromitsu Komatsu
 

Tendances (19)

C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Cassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large NodesCassandra TK 2014 - Large Nodes
Cassandra TK 2014 - Large Nodes
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 

En vedette

The Film Industry
The Film IndustryThe Film Industry
The Film IndustryNShuttle
 
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"Peter Vermeulen
 
Report on OAS Round Table on Indigenous Trade and Development: Case Study of...
Report on OAS Round Table on Indigenous Trade and Development:  Case Study of...Report on OAS Round Table on Indigenous Trade and Development:  Case Study of...
Report on OAS Round Table on Indigenous Trade and Development: Case Study of...Wayne Dunn
 
CONCESSÕES E PPPs NO GOVERNO TEMER:ARTIGO MAURICIO PORTUGAL RIBEIRO
CONCESSÕES  E PPPs  NO GOVERNO TEMER:ARTIGO  MAURICIO  PORTUGAL RIBEIROCONCESSÕES  E PPPs  NO GOVERNO TEMER:ARTIGO  MAURICIO  PORTUGAL RIBEIRO
CONCESSÕES E PPPs NO GOVERNO TEMER:ARTIGO MAURICIO PORTUGAL RIBEIROPLANORS
 
MBA724 s6 w1 experimental design
MBA724 s6 w1 experimental designMBA724 s6 w1 experimental design
MBA724 s6 w1 experimental designRachel Chung
 
Convince your CEO to go digital
Convince your CEO to go digitalConvince your CEO to go digital
Convince your CEO to go digitalCraig Skipsey
 
Vayana dinam quiz shajal kakkodi master vol 1
Vayana dinam quiz shajal kakkodi master   vol 1Vayana dinam quiz shajal kakkodi master   vol 1
Vayana dinam quiz shajal kakkodi master vol 1Subhash Soman
 
Definition of Matter Lab + Phase Change- Day 2
Definition of Matter Lab + Phase Change- Day 2Definition of Matter Lab + Phase Change- Day 2
Definition of Matter Lab + Phase Change- Day 2jmori1
 
Myppt 100624015031-phpapp02
Myppt 100624015031-phpapp02Myppt 100624015031-phpapp02
Myppt 100624015031-phpapp02Bhagabat Barik
 

En vedette (20)

Leisure time
Leisure timeLeisure time
Leisure time
 
The Film Industry
The Film IndustryThe Film Industry
The Film Industry
 
อติมา อุ่นจิตร
อติมา  อุ่นจิตรอติมา  อุ่นจิตร
อติมา อุ่นจิตร
 
Comicus-TheGreatest-2016
Comicus-TheGreatest-2016Comicus-TheGreatest-2016
Comicus-TheGreatest-2016
 
C 2
C 2C 2
C 2
 
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"
 
Report on OAS Round Table on Indigenous Trade and Development: Case Study of...
Report on OAS Round Table on Indigenous Trade and Development:  Case Study of...Report on OAS Round Table on Indigenous Trade and Development:  Case Study of...
Report on OAS Round Table on Indigenous Trade and Development: Case Study of...
 
Los videojuegos
Los videojuegosLos videojuegos
Los videojuegos
 
CONCESSÕES E PPPs NO GOVERNO TEMER:ARTIGO MAURICIO PORTUGAL RIBEIRO
CONCESSÕES  E PPPs  NO GOVERNO TEMER:ARTIGO  MAURICIO  PORTUGAL RIBEIROCONCESSÕES  E PPPs  NO GOVERNO TEMER:ARTIGO  MAURICIO  PORTUGAL RIBEIRO
CONCESSÕES E PPPs NO GOVERNO TEMER:ARTIGO MAURICIO PORTUGAL RIBEIRO
 
บทที่ 11
บทที่ 11บทที่ 11
บทที่ 11
 
Pt 2
Pt 2Pt 2
Pt 2
 
MBA724 s6 w1 experimental design
MBA724 s6 w1 experimental designMBA724 s6 w1 experimental design
MBA724 s6 w1 experimental design
 
Convince your CEO to go digital
Convince your CEO to go digitalConvince your CEO to go digital
Convince your CEO to go digital
 
Global warming
Global warmingGlobal warming
Global warming
 
Vayana dinam quiz shajal kakkodi master vol 1
Vayana dinam quiz shajal kakkodi master   vol 1Vayana dinam quiz shajal kakkodi master   vol 1
Vayana dinam quiz shajal kakkodi master vol 1
 
Manal p.
Manal p.Manal p.
Manal p.
 
Jdbc 3
Jdbc 3Jdbc 3
Jdbc 3
 
Definition of Matter Lab + Phase Change- Day 2
Definition of Matter Lab + Phase Change- Day 2Definition of Matter Lab + Phase Change- Day 2
Definition of Matter Lab + Phase Change- Day 2
 
Myppt 100624015031-phpapp02
Myppt 100624015031-phpapp02Myppt 100624015031-phpapp02
Myppt 100624015031-phpapp02
 
W.cholamjiak
W.cholamjiakW.cholamjiak
W.cholamjiak
 

Similaire à M6d cassandrapresentation

Real World Cassandra
Real World CassandraReal World Cassandra
Real World CassandraGiltTech
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Work with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMsWork with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMsMalin Weiss
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at PollfishPollfish
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWSAmazon Web Services
 

Similaire à M6d cassandrapresentation (20)

Real World Cassandra
Real World CassandraReal World Cassandra
Real World Cassandra
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Work with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMsWork with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMs
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS(DAT202) Managed Database Options on AWS
(DAT202) Managed Database Options on AWS
 

Plus de Edward Capriolo

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeEdward Capriolo
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan partyEdward Capriolo
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with HiveEdward Capriolo
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best PracticesEdward Capriolo
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveEdward Capriolo
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statisticsEdward Capriolo
 

Plus de Edward Capriolo (16)

Nibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL storeNibiru: Building your own NoSQL store
Nibiru: Building your own NoSQL store
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
M6d cassandra summit
M6d cassandra summitM6d cassandra summit
M6d cassandra summit
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Cassandra NoSQL Lan party
Cassandra NoSQL Lan partyCassandra NoSQL Lan party
Cassandra NoSQL Lan party
 
Breaking first-normal form with Hive
Breaking first-normal form with HiveBreaking first-normal form with Hive
Breaking first-normal form with Hive
 
Casbase presentation
Casbase presentationCasbase presentation
Casbase presentation
 
Hadoop Monitoring best Practices
Hadoop Monitoring best PracticesHadoop Monitoring best Practices
Hadoop Monitoring best Practices
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Cli deep dive
Cli deep diveCli deep dive
Cli deep dive
 
Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Counters for real-time statistics
Counters for real-time statisticsCounters for real-time statistics
Counters for real-time statistics
 
Real world capacity
Real world capacityReal world capacity
Real world capacity
 

Dernier

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 

Dernier (20)

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 

M6d cassandrapresentation

  • 1. Cassandra in | Online Advertising: Real Time Bidding the prospect engine for brands.
  • 2. Who are we? Costa Sevdinoglou & Edward Capriolo
  • 4. A High Level look at RTB 1. Browsers visit Publishers and create impressions. 2. Publishers sell impressions via Exchanges. 3. Exchanges serve as auction houses for the impressions 4. On behalf of the marketer,m6d bids the impressions via the auction house. If m6d wins, we display our ad to the browser.
  • 5. Performance and Data • Billions and billions of bid requests a day • A single request can result in multiple Cassandra Operations! • One cluster is just under 10TB and growing • Low latency requirement below 120 ms typical • Limited data available tom6dvia the exchange
  • 6. Segment Data Segments are how we assign product or service affinity to a group of users. User’s we consider to be like minded with respect to a given brand will be placed in the same segment. Segment Data is just one component of our overarching data model. Segments help to reduce the number of calculations we do in real time.
  • 7. Old Approach for Segment Data Application Nodes (Tomcat + MySQL ) Limitations •Periodically updated. MySQL Data Push Event Logs •Only subsection of the data. •Cluster performance is effected during a data push. Aggregation Hadoop
  • 8. Cassandra Approach for Segment Data Application Nodes Better! (Tomcat + Less • Updating in real time now MySQL Usage) possible • Distributed not duplicated • Lesscomplexity to manage • Storing more information • We can now bid on users Cassandra sooner!
  • 10. During waking hours: Dr. Realtime • User traffic is at peak • Applications need low latency operations • High volume of read and write operations • Desire high cache hit rate to limit disk IO • Dr. Realtime conducts 'experiments' on optimization
  • 11. Experiment: Active Set, VFS, cache size tuning • Cluster optimization is a topic that must be revisited periodically • User base and requests are perpetually growing • Amount of physical data stored grows • New features typically result in new data and more requests • How to tune your environment is application and hardware dependent
  • 12. Physical data directory • sstable holds data • Index holds offsets to avoid disk seeks • Bloom filter probabilistic lookup system – (also a stat table)
  • 13. When RAM > Data Size • If you can afford to keep your data set in RAM: • It is fast from VFS cache • That's it. Your optimized. • However you do not usually need this much ram
  • 14. When RAM < Data Size • The OS will cache the most active portions of disk • The write/compact model causes the cache to churn • User requests causes the cache to churn
  • 15. Understanding Active set with a hypothetical example Webmail service (Coldmail): • I have an account for 10 years, I never log in more than twice a month • I have 1,000,000 items in my inbox • Not in the active set Social networking (chirper): • I am logged in every day • Commonly read get updates from my friends • In the active set
  • 16. $60,000 Question How do you determine what the active set of your application and user base is?
  • 18. Turn on a cache • JMX allows you to tune only a single node for side by side comparisons • Set the size very large for key cache (be more careful with row cache)
  • 19. Analysis • 8:30 hit rate 91% 1.2 mil • 10:30 hit rate ~93% 1.7 mil • Past 1.2 million entry cache might be better spent elsewhere
  • 20. Active set conclusions • Determine sweet spot for hit rate and cache size • Do not try to cache long tail of requests • When all other things equal dedicate more cache to most read column family • Use row cache only if rows are a predictable size • Large row caches can not be saved so cold on restart
  • 21. read_repair_chance – Cassandra's version of an ethical dilemma • Read Repair generates additional reads across the cluster for each user read • Read Repair Chance controls the probability of Read Repair occurring. • If data is write-once or write-rarely Read Repair may be unnecessary – data read ratio much larger then write ratio – data that does not need strict consistency • 1.0 Hinted handoff now does not need to wait on the failure detector. Read Repair Chance default has been set to 10% from 100%. – Cassandra-2045 TX ntelford and co!
  • 22. Analysis for RRC 'test subjects' Candidate: Many reads few writes Inside story: This data used to take 2 days. A few ms... Come on man! Candidate ?: Many writes Inside story: This is used for frequency capping, higher % justified
  • 23. Experiment: Test the limits of NoSQL science with YCSB YCSB is a distributed load generator that comes in handy! • Before our upgrade from 0.6.X->0.7.X – All the benchmarks were better – But good to kick the tires • Prototyping new Column Family – Time to write 500 million records – How many reads/second on 50GB of data
  • 24. Create a mixed workload java -cp $CP com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.CassandraClient7 -P workloads/workloadb -t -threads 10 -precordcount=75000000 -poperationcount=1000000 -preadproportion=0.33 -pupdateproportion=0.33 -pscanproportion=0 -pinsertproportion=0.33
  • 25. Round 1 Results RunTime: 410 Seconds Throughput: 2437 Operations/Second Shared the results on #cassandrairc. Suggestion! Try: -threads 30
  • 26. Trying it again… Original Results: -threads 10 RunTime: 410 Seconds Throughput: 2437 Operations/Second New Results: -threads 30 RunTime: 196 Seconds Throughput 5088 Operations/Second
  • 27. Cassandra writes fast! (duh) • Read path – Row, Key, and VFS caches – With enough data and read ops disks bottleneck • Write path – structured log writes are linear to disk-wide and fast – compaction merges sstables in background • Many threads maximizes write capability • Many threads also stops a read blocking on IO from limiting write potential
  • 28. Night falls and Dr. Realtime transforms... /etc/cron.d/mr_batch_dr_realtime # turn into Mr. batch at night 0 0 * * * root nodetool -h `hostname` setcompactionthroughput999 #turn back into Dr. Realtime for day 0 6 * * * root nodetool -h `hostname` setcompactionthroughput16 Setting throughput ensures • During the day most iops are free to serve traffic • At night can rip through compactions
  • 29. Mr Batch ravages data creating tombstones • If User clears cookies they vanish forever • In actuality they return as a new user • Data has very high turnover • We need to enforce retention policy on data • TTL columns do not meet our requirements :( • Cleanup daemon is a throttled range scanner • Cleanup daemon also produces histograms every cycle
  • 30. Mr. Batch 'kills' rows while you sleep
  • 31. A note about different workloads • Structured log format of C* has deep implications • Many factors effect performance and disk size: • Write once data • Wide rows (many columns) • Wide rows over time (fragmented) • Application read write profile • Deletion/update percentage • LevelDB inspired compaction in 1.0 different profile then current tiered compaction
  • 32. Tombstones have costs • Physically live on disk • Bloat data, index, and bloom filters • Tombstone live for a grace period and then are eligible to be removed
  • 33. Caching after (major) compaction • Our case (lots of churn) major compaction shrinks data significantly • Rows fragmented over many sstables are joined • Tombstones and related data columns removed • All files should be smaller • Smaller files means better VFS caching
  • 34. Simple compaction scheduler HOSTS=$1 for i in $HOSTS ; do nodetool -hcassandra{i} -p 8585 compact KS1 nodetool -hcassandra{i} -p 8585 compact KS2 nodetool -hcassandra{i} -p 8585 compact KS3 Done 30 23 * * * /root/compacto/spool.sh "01 02 03 04" 30 23 * * * /root/compacto/spool.sh "05 06 07 08"
  • 35. Questions ? ? ?