SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
CASSANDRA & SOLID
STATE DRIVES
Rick Branson, DataStax
FACT

CASSANDRA’S STORAGE
ENGINE WAS OPTIMIZED
 FOR SPINNING DISKS
LSM-TREES
WRITE PATH
insert({ cf1: { row1: { col3: foo } } })




      Client                         Cassandra




     On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }
                                                   In-Memory Memtable for “cf1”
{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }
                                                 row1   col1: [del]   col2: “def”   col3: “foo”

{ cf1: { row2: { col1: xyz } } }
                                                 row2   col1: “xyz”
{ cf1: { row1: { col3: foo } } }




                                                              COMMIT
In-Memory Memtable for “cf1”

                    row1         col1: [del]       col2: “def”   col3: “foo”


                    row2         col1: “xyz”




SSTable   SSTable      SSTable             SSTable




 1         2               3                   4

                                                                               FLUSH
SSTable   SSTable             SSTable   SSTable




        1         2                    3         4


                           SSTable




SSTables are merged to maintain read performance


                                           COMPACT
X X X X
  SSTable   SSTable         SSTable   SSTable




SSTable
                      New SSTable is streamed
                      to disk and old SSTables
                             are erased
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
                    IMPORTANT
  complexity O(N)
• SSTables are completely immutable
COMPARED
• Most popular data storage engines
  rewrite modified data in-place: MySQL
  (InnoDB), PostgreSQL, Oracle,
  MongoDB, Membase, BerkeleyDB, etc
• Most perform similar buffering of
  writes before flushing to disk
• ... but flushes are RANDOM writes
SPINNING DISKS
• Dirt cheap: $0.08/GB
• Seek time limited by time it takes for drive
  to rotate: IOPS = RPM/60
• 7,200 RPM = ~120 IOPS
• 15,000 RPM has been the max for decades
• Sequential operations are best: 125MB/
  sec for modern SATA drives
THAT WAS THE WORLD
IN WHICH CASSANDRA
     WAS BUILT
2012: MLC NAND FLASH*
            • Affordable: ~$1.75/GB street
            • Massive IOPS: 39,500/sec read, 23,000/
                  sec write
            • Latency of less than 100µs
            • Good sequential throughput: 270MB/
                  sec read, 205MB/sec write
            • Way cheaper per IOPS: $0.02 vs $1.25
* based on specifications provided by Intel for 300GB Intel 320 drive
WITH RANDOM ACCESS
STORAGE, ARE CASSANDRA’S
  LSM-TREES OBSOLETE?
SOLID STATE HAS
SOME MAJOR BUTS...
... BUT
• Cannot overwrite directly: must erase
  first, then write
• Can write in small increments (4KB),
  but only erase in ~512KB blocks
• Latency: write is ~100µs, erase is ~2ms
• Limited durability: ~5,000 cycles (MLC)
  for each erase block
WEAR LEVELING is used
to reduce the number of
 total erase operations
WEAR LEVELING
WEAR LEVELING
Erase Block
WEAR LEVELING
WEAR LEVELING
WEAR LEVELING

 Disk Page
WEAR LEVELING

  Write 1
WEAR LEVELING

  Write 1
  Write 2
WEAR LEVELING

  Write 1
  Write 2
  Write 3
Remember: the whole block must be erased


         Write 1
         Write 2
         Write 3

                   How is data from only
                     Write 2 modified?
Mark Garbage
Empty Block




Mark Garbage    Append
                Modified
                 Data
Wait... GARBAGE?
THAT MEANS...
... fragmentation,
WHICH MEANS...
Garbage Collection!
GARBAGE COLLECTION
• Compacts fragmented disk blocks
• Erase operations drag on performance
• Modern SSDs do this in the
  background... as much as possible
• If no empty blocks are available, GC
  must be done before ANY writes can
  complete
WRITE AMPLIFICATION
• When only a few kilobytes are written,
  but fragmentation causes a whole
  block to be rewritten
• The smaller & more random the writes,
  the worse this gets
• Modern “mark and sweep” GC reduces
  it, but cannot eliminate it
Torture test shows massive
             write performance drop-off
            for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
Some poorly designed drives
      COMPLETELY fall apart


Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
Even a well-behaved drive
suffers significantly from the
         torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
Post-torture, all disk blocks
were marked empty, and the
   “fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
“TRIM”
• Filesystems don’t typically immediately
  erase data when files are deleted, they just
  mark them as deleted and erase later
• TRIM allows the OS to actively tell the drive
  when a region of disk is no longer used
• If an entire erase block is marked as
  unused, GC is avoided, otherwise TRIM
  just hastens the collection process
TRIM only reduces the
write amplification effect,
   it can’t eliminate it.
THEN THERE’S
 LIFETIME...
AnandTech estimates that modern MLC SSDs
only last about 1.5 years under heavy MySQL load,
   which causes around 10x write amplification
REMEMBER THIS?
TAKEAWAYS
• All disk writes are sequential, append-
  only operations
• On-disk tables (SSTables) are written in
  sorted order, so compaction is linear
  complexity O(N)
• SSTables are completely immutable
CASSANDRA
 ONLY WRITES
SEQUENTIALLY
“For a sequential write workload,
     write amplification is equal to 1,
          i.e., there is no write
              amplification.”


Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write
        Performance: Understanding, Analysis, and Performance Modeling”
THANK YOU.
     ~ @rbranson

Contenu connexe

Tendances

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSPythian
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...DataStax Academy
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraDataStax Academy
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1DataStax Academy
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationRamkumar Nottath
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScyllaDB
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Community
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction EverywhereDataStax Academy
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph Community
 

Tendances (18)

TechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWSTechTalk v2.0 - Performance tuning Cassandra + AWS
TechTalk v2.0 - Performance tuning Cassandra + AWS
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
Cassandra Day Chicago 2015: DataStax Enterprise & Apache Cassandra Hardware B...
 
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache CassandraCassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
Cassandra Day SV 2014: Designing Commodity Storage in Apache Cassandra
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
Performance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migrationPerformance tuning - A key to successful cassandra migration
Performance tuning - A key to successful cassandra migration
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Bluestore
BluestoreBluestore
Bluestore
 
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Compaction, Compaction Everywhere
Compaction, Compaction EverywhereCompaction, Compaction Everywhere
Compaction, Compaction Everywhere
 
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...
 
SSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax LondonSSDs, IMDGs and All the Rest - Jax London
SSDs, IMDGs and All the Rest - Jax London
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 

Similaire à Cassandra and Solid State Drives

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesDataStax Academy
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016Tomas Vondra
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptxJawaharPrasad3
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkI Goo Lee
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBXiao Yan Li
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performancejimmytruong
 
Storage structure
Storage structureStorage structure
Storage structureMohd Arif
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updatedWerner Fischer
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMAlex Zaballa
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016Alex Chistyakov
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflowsjasonajohnson
 

Similaire à Cassandra and Solid State Drives (20)

Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
9_Storage_Devices.pptx
9_Storage_Devices.pptx9_Storage_Devices.pptx
9_Storage_Devices.pptx
 
FlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalkFlashSQL 소개 & TechTalk
FlashSQL 소개 & TechTalk
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Cassandra compaction
Cassandra compactionCassandra compaction
Cassandra compaction
 
Design Tradeoffs for SSD Performance
Design Tradeoffs for SSD PerformanceDesign Tradeoffs for SSD Performance
Design Tradeoffs for SSD Performance
 
Storage structure
Storage structureStorage structure
Storage structure
 
20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated20111026 optimal-usage-of-ssds-under-linux-updated
20111026 optimal-usage-of-ssds-under-linux-updated
 
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASMSAOUG - Connect 2014 - Flex Cluster and Flex ASM
SAOUG - Connect 2014 - Flex Cluster and Flex ASM
 
SSD PPT BY SAURABH
SSD PPT BY SAURABHSSD PPT BY SAURABH
SSD PPT BY SAURABH
 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Exploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient WorkflowsExploiting Your File System to Build Robust & Efficient Workflows
Exploiting Your File System to Build Robust & Efficient Workflows
 

Dernier

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Cassandra and Solid State Drives

  • 1. CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax
  • 2. FACT CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED FOR SPINNING DISKS
  • 5. insert({ cf1: { row1: { col3: foo } } }) Client Cassandra On-Disk Node Commit Log { cf1: { row1: { col1: abc } } } In-Memory Memtable for “cf1” { cf1: { row1: { col2: def } } } { cf1: { row1: { col1: <del> } } } row1 col1: [del] col2: “def” col3: “foo” { cf1: { row2: { col1: xyz } } } row2 col1: “xyz” { cf1: { row1: { col3: foo } } } COMMIT
  • 6. In-Memory Memtable for “cf1” row1 col1: [del] col2: “def” col3: “foo” row2 col1: “xyz” SSTable SSTable SSTable SSTable 1 2 3 4 FLUSH
  • 7. SSTable SSTable SSTable SSTable 1 2 3 4 SSTable SSTables are merged to maintain read performance COMPACT
  • 8. X X X X SSTable SSTable SSTable SSTable SSTable New SSTable is streamed to disk and old SSTables are erased
  • 9. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 10. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear IMPORTANT complexity O(N) • SSTables are completely immutable
  • 11. COMPARED • Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc • Most perform similar buffering of writes before flushing to disk • ... but flushes are RANDOM writes
  • 12. SPINNING DISKS • Dirt cheap: $0.08/GB • Seek time limited by time it takes for drive to rotate: IOPS = RPM/60 • 7,200 RPM = ~120 IOPS • 15,000 RPM has been the max for decades • Sequential operations are best: 125MB/ sec for modern SATA drives
  • 13. THAT WAS THE WORLD IN WHICH CASSANDRA WAS BUILT
  • 14. 2012: MLC NAND FLASH* • Affordable: ~$1.75/GB street • Massive IOPS: 39,500/sec read, 23,000/ sec write • Latency of less than 100µs • Good sequential throughput: 270MB/ sec read, 205MB/sec write • Way cheaper per IOPS: $0.02 vs $1.25 * based on specifications provided by Intel for 300GB Intel 320 drive
  • 15. WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S LSM-TREES OBSOLETE?
  • 16.
  • 17. SOLID STATE HAS SOME MAJOR BUTS...
  • 18. ... BUT • Cannot overwrite directly: must erase first, then write • Can write in small increments (4KB), but only erase in ~512KB blocks • Latency: write is ~100µs, erase is ~2ms • Limited durability: ~5,000 cycles (MLC) for each erase block
  • 19. WEAR LEVELING is used to reduce the number of total erase operations
  • 25. WEAR LEVELING Write 1
  • 26. WEAR LEVELING Write 1 Write 2
  • 27. WEAR LEVELING Write 1 Write 2 Write 3
  • 28. Remember: the whole block must be erased Write 1 Write 2 Write 3 How is data from only Write 2 modified?
  • 30. Empty Block Mark Garbage Append Modified Data
  • 35. GARBAGE COLLECTION • Compacts fragmented disk blocks • Erase operations drag on performance • Modern SSDs do this in the background... as much as possible • If no empty blocks are available, GC must be done before ANY writes can complete
  • 36. WRITE AMPLIFICATION • When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten • The smaller & more random the writes, the worse this gets • Modern “mark and sweep” GC reduces it, but cannot eliminate it
  • 37. Torture test shows massive write performance drop-off for heavily fragmented drive Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6
  • 38. Some poorly designed drives COMPLETELY fall apart Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6
  • 39. Even a well-behaved drive suffers significantly from the torture test Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 40. Post-torture, all disk blocks were marked empty, and the “fast” comes back... Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11
  • 41.
  • 42. “TRIM” • Filesystems don’t typically immediately erase data when files are deleted, they just mark them as deleted and erase later • TRIM allows the OS to actively tell the drive when a region of disk is no longer used • If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process
  • 43. TRIM only reduces the write amplification effect, it can’t eliminate it.
  • 45.
  • 46.
  • 47. AnandTech estimates that modern MLC SSDs only last about 1.5 years under heavy MySQL load, which causes around 10x write amplification
  • 49. TAKEAWAYS • All disk writes are sequential, append- only operations • On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N) • SSTables are completely immutable
  • 51. “For a sequential write workload, write amplification is equal to 1, i.e., there is no write amplification.” Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”
  • 52. THANK YOU. ~ @rbranson