SlideShare une entreprise Scribd logo
1  sur  54
Phoenix Cassandra Users Meetup
January 26th, 2015
Narasimhan Sampath
Choice Hotels International
Cassandra Internals
 What is Cassandra
 SEDA
 Data Placement, Replication and Partition Aware Drivers
 Read and Write Path
 Merkle Trees, SSTables, Read Repair and Compaction
 Single and Multi-threaded Operations
 Demo
Agenda
 Cassandra is a decentralized distributed database
 No master or slave nodes
 No single point of failure
 Peer-Peer architecture
 Read / write to any available node
 Replication and data redundancy built into the architecture
 Data is eventually consistent across all cluster nodes
 Linearly (and massively) scalable
 Multiple Data Center support built in – a single cluster can span geo locations
 Adding or removing nodes / data centers is easy and does not require down time
 Data redistribution / rebalance seamless and non blocking
 Runs on commodity hardware
 Hardware failure is expected and factored into the Architecture
 Internal architecture more complex than non-distributed databases
Cassandra
 Automatic Sharding (partitioning)
 Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes
 Each node is responsible for a subset of the data
 Copies of that subset are stored on other nodes for high availability and redundancy
 Data placement design determines node balancing (token assignment, adding and removing nodes)
 Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the
users
 Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent)
 Consistency (all nodes see the same data at the same time)
 Availability (a guarantee that every request receives a response about whether it succeeded or failed)
 Partition tolerance (the system continues to operate despite a part of the system failing)
 Brewer’s CAP theorem (For further reading)
 Staged Event Driven Architecture – framework for achieving high concurrency and load
 Uses events, messages and queues to process tasks
 Decouples the request and response from the worker threads
Cassandra
 Ring – Visual representation of data managed by Cassandra
 Node – Individual machine in the ring
 Data Center – A collection of related nodes
 Cluster – Collection of (geographically separated) data centers
 Commitlog – The equivalent of a transaction log file for Durability
 Memtable – In Memory structures to store data (per column family)
 Keyspace – Container for application data (Analogous to schema)
 Table – Structure that holds data in rows and columns
 SSTable – An immutable file (for each table) on disk to which data structures in memory are
dumped periodically
Cassandra Terminology
 Gossip – Peer to Peer protocol to discover and share location and state information on
nodes
 Tokens – A number used to assign a range of data to a node within a datacenter
 Partitioner – A Hashing function for deriving the token
 Replication Factor – determines the number of copies of data
 Snitch – Snitch informs Cassandra about network topology
 Replica – Copies of data on different nodes for redundancy and fault tolerance
 Replication Factor – total number of copies on the cluster
Terminology
 Cassandra is linearly (horizontal) and massively scalable
 Just add or remove nodes to the cluster as load increases or decreases
 There is no down time required for this
 SEDA – Staged Event Driven Architecture guarantees consistent throughput
Core Strength - Scalability
Quantifying Massive
 Avoids the pitfalls of Client Server based design
 Eliminates storage bottlenecks
 No single data repository
 Redundancy built in
 All nodes participate (whether they have requested data or not)
 Shared nothing
 Transparently add / remove nodes as necessary without downtime
 Comes with a trade-off – eventual consistency (CAP)
 Newer Staged Event Driven Architecture
How does it Scale?
 Legacy systems typically use thread based concurrency models
 Programming traditional multi-threaded applications is hard
 Distributed multithreaded applications are even harder
 Leads to severe scalability bottlenecks
 A new thread or process is usually created for each request
 There is a maximum number of threads a system can support
 Challenges with thread execution model
 Deadlocks
 Livelocks (wastes CPU cycles)
 starvation (wait for resources)
 Overheads – Context switching, synchronization and data movement
 Request and response typically handled by the same thread
 Sequential execution
Legacy Systems
 a
Threads
Event Driven Architecture
 Evolution of Event Driven Architecture (EDA)
 This consists of a set of loosely coupled software components and services
 An Event is something that an application can act upon
 A hotel booking event
 A check-in event
 A listener can pick up a check-in event and act on it
 In-room entertainment system displays a personalized greeting
 Partners may get notified and can send personalized offers (Spa / massage/ restaurant
discounts)
 This is much more scalable than thread based concurrency models
 SEDA is an Architectural approach
 An application is broken down into a set of logical stages
 These stages are loosely coupled and connected via queues
 Decouples event and thread scheduling from DB Engine logic
 Prevents resources from being overcommitted under high load
 Enables modularity and code reuse
SEDA Explained
Understanding Stage (SEDA)
Understanding Stage
 SEDA enables Massive Concurrency
 No thread deadlocks or livelocks or Starvation to worry about (for most part)
 Thread Scheduling and Resource Management abstracted
 Supports self tuning / resource allocation / management
 Easier to debug and monitor application performance at scale
 Distributed debugging / tracing easier
 Graceful degradation under excessive load
 Maintains throughput at the expense of latency
Why SEDA matters
Examples of Stages
Data Placement
Facebook’s DC
Why is data placement important
 Cassandra has a listen and broadcast IP address
 Snitch maps IP address to Racks and Data Centers
 Gossip uses this information to help Cassandra build node location map
 Snitch helps Cassandra with replica placement
 Helps Cassandra minimize cross data center latency
Role of Snitch
 Once built and configured, a cluster is ready to store data
 Each node owns a Token Range
 Can be manually assigned in YAML file
 Or Cassandra can manage token assignment - a concept called vNodes
 A Keyspace needs to be created with replication options
 CREATE KEYSPACE “Choice"
WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
 Cassandra Schema objects are replicated globally to all nodes
 This enables each node in the cluster to act as a coordinator node
Data Placement
 Data gets replicated as defined in the Keyspace
 Within a data center, murmur3hash of PK decides which node owns
the data
 Replication Strategy determines which nodes contain replicas
 Simple Strategy – Replicas are placed in succeeding nodes
 Network Topology – Walks the ring clockwise and places each copy on the
first node on successive racks
 Asymmetric replica groupings are possible (DR / Analytics etc.)
Data Placement
empID empName deptID deptName hiredate
22 Sam 12 Finance 1/22/1996
33 Scott 18 Human Resources 12/8/2006
44 Walter 24 Shipping 11/20/2009
55 Bianca 30 Marketing 1/1/2015
Data Placement
Partition Sample Hash
Finance
-2245462676723220000
Human Resources
7723358927203680000
Shipping
-6723372854036780000
Marketing
1168604627387940000
Data Placement
Data Access
 Cassandra’s location independent Architecture means a user can connect to any node of
the cluster, which then acts as coordinator node
 Schemas get replicated globally – even to nodes that do not contain a copy of the data
 Cassandra offers tunable consistency – an extension of eventual consistency
 Clients determine how consistent the data should be
 They can choose between high availability (CL ONE) and high safety (CL ALL) among other options
 Further reading
 Request goes through stages – the thread that received the initial request will insert the
request into a queue and wait for the next user request
 Partition aware drivers help route traffic to the nearest node
 Hinted Hand-offs – store and forward write requests
Data Access
 Memtables
 Commitlog
 SSTables
 Tombstones
 Compaction
 Repair
Reads and Writes
Reads & Writes
Reads and Writes
Write Process
 Write requests written to a MemTable
 When Memtable is full, contents get queued to be flushed to disk
 Writes are also simultaneously persisted on Disk to a CommitLog file
 This helps achieving durable writes
 CommitLog entries are purged after MemTable is flushed to disk
 MemTables and SSTables are created on a per table basis
 Tunable consistency determines how may MemTables and Commitlogs the row has to be written to
 SSTables are immutable and cannot be modified once written to
 Compaction consolidates SSTables and removes tombstones
 SizeTiered Compaction
 Leveled Compaction
 Repair is a process that synchronizes copies located in different nodes
 Uses Merkle Trees to make this more efficient
Write Path
 Is a feature that enables high write availability
 Has to be enabled / disabled in the YAML file
 When a replica node is down
 A hint is stored in the coordinator node
 Hints are stored for three hours (default)
 Hinted writes do not count towards CL
 Replaying hints does not affect system performance
Hinted Hand-off
Read Path
 A row of data will likely exist in multiple locations
 Unflushed Memtable
 Un-compacted and compacted SSTables
 Tunable consistency determines how many nodes have to respond
 Cassandra does not rewrite entire row to new file on update
 No read before writes
 Updated / new columns exists in new file
 Unmodified columns exist in old file
 The timestamped version of the row can be different in each location
 All these must be retrieved, reconstructed and processed based on timestamp
 Uses Bloom filters to make key lookups more efficient
 Row fragments may exist in multiple SSTables
 May exist in Memtable as well
 Bloom filters speed lookups
Read Path
 It is a Probabilistic Bit Vector Data Structure
 Supports two operations – Test and Add
 Cassandra uses Bloom filters to reduce Disk I/O during key lookup
 Each SSTable has a bloom filter associated with it
 A Bloom filter is used to test if an element is a part of a set
 False Positives are possible, but False negatives are not
 Means a key is “possibly in set” or “definitely not in set”
 Check out JasonDavies.com for a cool interactive demo
 http://www.jasondavies.com/bloomfilter/
 http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Bloom Filters
 Deletes are handled differently when compared to traditional RDBMS
 Data to be deleted is marked using Tombstones (using a write operation)
 Actual removal takes place later during compaction
 Run Repair on each node within 10 days (default)
 Repair removes inconsistencies between replicas
 Inconsistencies happen because nodes can be down for longer than hinted handoff window, thereby missing
deletes/updates
 Distributed deletes are hard in a peer to peer system that has no SPOF
Deletes
 Distributed Systems are eventually consistent
 Only a small number of nodes have to respond for successful (delete) operation
 As the delete command propagates through the system, some nodes may be unavailable
 The commands are stored (as hinted hand-offs) and will be delivered when the downed
node comes online
 The delete command may be “lost” if the downed node does not come back within the
hinted hand-off window (default 3 hours)
Why are Distributed Deletes hard?
 Cassandra does not support in-row updates
 Updates are implemented as a delete and an insert
 Updated values are written to a new file
 Unmodified columns of the original row exist in old file
 Compaction consolidates all values and writes row to new file
Updates
 Cassandra does not perform in-place updates or deletes
 Instead the new data is written to a new SSTable file
 Cassandra marks data to be deleted using markers called Tombstones
 Tombstones exist for the time period defined by GC_GRACE_SECONDS
 Compactions merges data in each SSTable by partition key
 Evicts tombstones, deletes data and consolidates SSTables into a single SSTable
 Old SSTables are deleted as soon as existing reads complete
Compaction
Compaction
 Read Repair and Node Repair
 Read Repair synchronizes data requested in a read operation
 Node repair synchronizes all data (for a range) in a node with
all replicas
 Node repair needs to be scheduled to run at least once within
the GC_GRACE_SECONDS Window (default 10 days)
Repair
 There are two stages to the repair process
 Build a Merkle Tree
 Each replica will compare the differences between the trees
 Once the compare completes, changes stream over
 Streams are written to new SSTables
 Repair is a resource intensive operation
 Read up on Advanced Repair techniques
Repair Process
 The distributed, decentralized nature of Cassandra requires repair operations
 Repair involves comparing all data elements in each replica and updating the data
 This happens asynchronously and in the background
 Cassandra uses Merkle Trees to detect data inconsistencies quicker and minimize data
transferred between different nodes
 Merkle Tree is an inverted hash tree structure
 Used to compare data stored in different nodes
 Partial branches of tree can be compared
 Minimizes repair time and traffic between nodes
Merkle Trees
Single threaded Operations
 Some Examples of Single threaded operations:
 Merkle Tree Comparison
 Triggering Repair
 Deleting files
 Obsolete Sstables
 Commitlog segments
 Gossip
 Hinted Handoff (default value = 1)
 Message Streaming
 This demo is to help get a better understanding on:
 Gossip
 Replication
 Data Manipulation (Inserts, Updates, Deletes)
 Role of Memtable, CommitLog and Tombstones
 Compaction
Demo
Demo - Steps
 Modify core cluster and table settings
 Insert Data in one node
 Verify Replication
 Shut down one node
 Continue DML operations
 Start the downed node
 Understand Outcome
 Let’s see it!
Demo Time
 Commands issued to Cassandra when one node was
down
Demo commnds
 Expected results
 Actual Results
Results
Demo Recap
 What just happened?
 Inserts disappeared
 Updates rolled back
 Deletes reappeared
 What happened to Durability?
 And this thing called eventual consistency?
 All nodes were up and running
 Initial writes came in, got persisted and replicated
 All nodes have received the data and are in sync.
 Memtable Flush, Compaction and SSTables Consolidation
 This clears the memory and the commit log
 None of the 3 nodes have any entries in the commit log for these
rows
 Data exists in SSTables and so query returns data back to user
What really happened?
 One node is brought down
 The state is preserved in that node
 Inserts / Updates and Deletes continue in other nodes
 Replication and Synchronization happens
 Consolidation and Compaction happens on the other 2 nodes
 Every time this happens, commit log is cleared and tombstones evicted
 gc_grace_seconds & hinted_handoff play a critical role for this demo to work
 3rd node that was down is brought up and it starts synchronizing
 It still has the original state preserved and sends that copy to the other 2 nodes
 Other 2 nodes receive the data and look for commit log entries and Tombstones locally
 When the nodes do not find the entries, they proceed to apply that change (as new data) and the system reverts back
What really happened?
 http://www.Datastax.com
 http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
 http://berb.github.io/diploma-thesis/original/052_threads.html
 Choice Hotels is hiring!
 Please contact Jeremiah Anderson for details.
 Jeremiah_Anderson@choicehotels.com
References

Contenu connexe

Tendances

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing FrameworksSirKetchup
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 

Tendances (20)

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing Frameworks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 

En vedette

Apache Cassandra, part 3 – machinery, work with Cassandra
Apache Cassandra, part 3 – machinery, work with CassandraApache Cassandra, part 3 – machinery, work with Cassandra
Apache Cassandra, part 3 – machinery, work with CassandraAndrey Lomakin
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...DataStax
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with CassandraMikalai Alimenkou
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraRobbie Strickland
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
CQRS innovations (English version)
CQRS innovations (English version)CQRS innovations (English version)
CQRS innovations (English version)Andrey Lomakin
 
Cassandra架构与应用
Cassandra架构与应用Cassandra架构与应用
Cassandra架构与应用zhangzhaokun
 
Apache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryApache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryAndrey Lomakin
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionDataStax Academy
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016DataStax
 
Cassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodCassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodDataStax Academy
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...DataStax
 
Cassandra 2.1 簡介
Cassandra 2.1 簡介Cassandra 2.1 簡介
Cassandra 2.1 簡介Cloud Tu
 
DataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clauseDataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clauseDataStax Academy
 
Silicon Valley Data Science: From Oracle to Cassandra with Spark
Silicon Valley Data Science: From Oracle to Cassandra with SparkSilicon Valley Data Science: From Oracle to Cassandra with Spark
Silicon Valley Data Science: From Oracle to Cassandra with SparkDataStax Academy
 

En vedette (20)

Apache Cassandra, part 3 – machinery, work with Cassandra
Apache Cassandra, part 3 – machinery, work with CassandraApache Cassandra, part 3 – machinery, work with Cassandra
Apache Cassandra, part 3 – machinery, work with Cassandra
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with Cassandra
 
Always On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on CassandraAlways On: Building Highly Available Applications on Cassandra
Always On: Building Highly Available Applications on Cassandra
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
CQRS innovations (English version)
CQRS innovations (English version)CQRS innovations (English version)
CQRS innovations (English version)
 
Cassandra架构与应用
Cassandra架构与应用Cassandra架构与应用
Cassandra架构与应用
 
Apache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machineryApache Cassandra, part 2 – data model example, machinery
Apache Cassandra, part 2 – data model example, machinery
 
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Despliegue de Cassandra en la nube de Amazon
Despliegue de Cassandra en la nube de AmazonDespliegue de Cassandra en la nube de Amazon
Despliegue de Cassandra en la nube de Amazon
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
 
Cassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the HoodCassandra Summit 2014: CQL Under the Hood
Cassandra Summit 2014: CQL Under the Hood
 
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...
 
Cassandra 2.1 簡介
Cassandra 2.1 簡介Cassandra 2.1 簡介
Cassandra 2.1 簡介
 
DataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clauseDataStax: A deep look at the CQL WHERE clause
DataStax: A deep look at the CQL WHERE clause
 
Cassandra queuing
Cassandra queuingCassandra queuing
Cassandra queuing
 
Silicon Valley Data Science: From Oracle to Cassandra with Spark
Silicon Valley Data Science: From Oracle to Cassandra with SparkSilicon Valley Data Science: From Oracle to Cassandra with Spark
Silicon Valley Data Science: From Oracle to Cassandra with Spark
 

Similaire à Cassandra internals

Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understandingsidra ali
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Matija Gobec
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud RightScale
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsArush Nagpal
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory SystemsAnkit Gupta
 
05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.pptAnandKonj1
 
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'sankarapu posibabu
 
No SQL Databases.ppt
No SQL Databases.pptNo SQL Databases.ppt
No SQL Databases.pptssuser8c8fc1
 
Apache ignite as in-memory computing platform
Apache ignite as in-memory computing platformApache ignite as in-memory computing platform
Apache ignite as in-memory computing platformSurinder Mehra
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0Asis Mohanty
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJoseph Kuo
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to CassandraUmair Mansoob
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 

Similaire à Cassandra internals (20)

Cassandra
CassandraCassandra
Cassandra
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understanding
 
Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.Coding serbia meetup 29.09.2015.
Coding serbia meetup 29.09.2015.
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud
 
Clustering van IT-componenten
Clustering van IT-componentenClustering van IT-componenten
Clustering van IT-componenten
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
 
05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt05 No SQL Sudarshan.ppt
05 No SQL Sudarshan.ppt
 
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
No SQL Databases sdfghjkl;sdfghjkl;sdfghjkl;'
 
No SQL Databases.ppt
No SQL Databases.pptNo SQL Databases.ppt
No SQL Databases.ppt
 
Apache ignite as in-memory computing platform
Apache ignite as in-memory computing platformApache ignite as in-memory computing platform
Apache ignite as in-memory computing platform
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and IgniteJCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
JCConf 2016 - Cloud Computing Applications - Hazelcast, Spark and Ignite
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Why Cassandra?
Why Cassandra?Why Cassandra?
Why Cassandra?
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to Cassandra
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 

Dernier

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Cassandra internals

  • 1. Phoenix Cassandra Users Meetup January 26th, 2015 Narasimhan Sampath Choice Hotels International Cassandra Internals
  • 2.  What is Cassandra  SEDA  Data Placement, Replication and Partition Aware Drivers  Read and Write Path  Merkle Trees, SSTables, Read Repair and Compaction  Single and Multi-threaded Operations  Demo Agenda
  • 3.  Cassandra is a decentralized distributed database  No master or slave nodes  No single point of failure  Peer-Peer architecture  Read / write to any available node  Replication and data redundancy built into the architecture  Data is eventually consistent across all cluster nodes  Linearly (and massively) scalable  Multiple Data Center support built in – a single cluster can span geo locations  Adding or removing nodes / data centers is easy and does not require down time  Data redistribution / rebalance seamless and non blocking  Runs on commodity hardware  Hardware failure is expected and factored into the Architecture  Internal architecture more complex than non-distributed databases Cassandra
  • 4.  Automatic Sharding (partitioning)  Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes  Each node is responsible for a subset of the data  Copies of that subset are stored on other nodes for high availability and redundancy  Data placement design determines node balancing (token assignment, adding and removing nodes)  Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the users  Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent)  Consistency (all nodes see the same data at the same time)  Availability (a guarantee that every request receives a response about whether it succeeded or failed)  Partition tolerance (the system continues to operate despite a part of the system failing)  Brewer’s CAP theorem (For further reading)  Staged Event Driven Architecture – framework for achieving high concurrency and load  Uses events, messages and queues to process tasks  Decouples the request and response from the worker threads Cassandra
  • 5.  Ring – Visual representation of data managed by Cassandra  Node – Individual machine in the ring  Data Center – A collection of related nodes  Cluster – Collection of (geographically separated) data centers  Commitlog – The equivalent of a transaction log file for Durability  Memtable – In Memory structures to store data (per column family)  Keyspace – Container for application data (Analogous to schema)  Table – Structure that holds data in rows and columns  SSTable – An immutable file (for each table) on disk to which data structures in memory are dumped periodically Cassandra Terminology
  • 6.  Gossip – Peer to Peer protocol to discover and share location and state information on nodes  Tokens – A number used to assign a range of data to a node within a datacenter  Partitioner – A Hashing function for deriving the token  Replication Factor – determines the number of copies of data  Snitch – Snitch informs Cassandra about network topology  Replica – Copies of data on different nodes for redundancy and fault tolerance  Replication Factor – total number of copies on the cluster Terminology
  • 7.  Cassandra is linearly (horizontal) and massively scalable  Just add or remove nodes to the cluster as load increases or decreases  There is no down time required for this  SEDA – Staged Event Driven Architecture guarantees consistent throughput Core Strength - Scalability
  • 9.  Avoids the pitfalls of Client Server based design  Eliminates storage bottlenecks  No single data repository  Redundancy built in  All nodes participate (whether they have requested data or not)  Shared nothing  Transparently add / remove nodes as necessary without downtime  Comes with a trade-off – eventual consistency (CAP)  Newer Staged Event Driven Architecture How does it Scale?
  • 10.  Legacy systems typically use thread based concurrency models  Programming traditional multi-threaded applications is hard  Distributed multithreaded applications are even harder  Leads to severe scalability bottlenecks  A new thread or process is usually created for each request  There is a maximum number of threads a system can support  Challenges with thread execution model  Deadlocks  Livelocks (wastes CPU cycles)  starvation (wait for resources)  Overheads – Context switching, synchronization and data movement  Request and response typically handled by the same thread  Sequential execution Legacy Systems
  • 12. Event Driven Architecture  Evolution of Event Driven Architecture (EDA)  This consists of a set of loosely coupled software components and services  An Event is something that an application can act upon  A hotel booking event  A check-in event  A listener can pick up a check-in event and act on it  In-room entertainment system displays a personalized greeting  Partners may get notified and can send personalized offers (Spa / massage/ restaurant discounts)  This is much more scalable than thread based concurrency models
  • 13.  SEDA is an Architectural approach  An application is broken down into a set of logical stages  These stages are loosely coupled and connected via queues  Decouples event and thread scheduling from DB Engine logic  Prevents resources from being overcommitted under high load  Enables modularity and code reuse SEDA Explained
  • 16.  SEDA enables Massive Concurrency  No thread deadlocks or livelocks or Starvation to worry about (for most part)  Thread Scheduling and Resource Management abstracted  Supports self tuning / resource allocation / management  Easier to debug and monitor application performance at scale  Distributed debugging / tracing easier  Graceful degradation under excessive load  Maintains throughput at the expense of latency Why SEDA matters
  • 20. Why is data placement important
  • 21.  Cassandra has a listen and broadcast IP address  Snitch maps IP address to Racks and Data Centers  Gossip uses this information to help Cassandra build node location map  Snitch helps Cassandra with replica placement  Helps Cassandra minimize cross data center latency Role of Snitch
  • 22.  Once built and configured, a cluster is ready to store data  Each node owns a Token Range  Can be manually assigned in YAML file  Or Cassandra can manage token assignment - a concept called vNodes  A Keyspace needs to be created with replication options  CREATE KEYSPACE “Choice" WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};  Cassandra Schema objects are replicated globally to all nodes  This enables each node in the cluster to act as a coordinator node Data Placement
  • 23.  Data gets replicated as defined in the Keyspace  Within a data center, murmur3hash of PK decides which node owns the data  Replication Strategy determines which nodes contain replicas  Simple Strategy – Replicas are placed in succeeding nodes  Network Topology – Walks the ring clockwise and places each copy on the first node on successive racks  Asymmetric replica groupings are possible (DR / Analytics etc.) Data Placement
  • 24. empID empName deptID deptName hiredate 22 Sam 12 Finance 1/22/1996 33 Scott 18 Human Resources 12/8/2006 44 Walter 24 Shipping 11/20/2009 55 Bianca 30 Marketing 1/1/2015 Data Placement Partition Sample Hash Finance -2245462676723220000 Human Resources 7723358927203680000 Shipping -6723372854036780000 Marketing 1168604627387940000
  • 26. Data Access  Cassandra’s location independent Architecture means a user can connect to any node of the cluster, which then acts as coordinator node  Schemas get replicated globally – even to nodes that do not contain a copy of the data  Cassandra offers tunable consistency – an extension of eventual consistency  Clients determine how consistent the data should be  They can choose between high availability (CL ONE) and high safety (CL ALL) among other options  Further reading  Request goes through stages – the thread that received the initial request will insert the request into a queue and wait for the next user request  Partition aware drivers help route traffic to the nearest node  Hinted Hand-offs – store and forward write requests
  • 28.  Memtables  Commitlog  SSTables  Tombstones  Compaction  Repair Reads and Writes
  • 31. Write Process  Write requests written to a MemTable  When Memtable is full, contents get queued to be flushed to disk  Writes are also simultaneously persisted on Disk to a CommitLog file  This helps achieving durable writes  CommitLog entries are purged after MemTable is flushed to disk  MemTables and SSTables are created on a per table basis  Tunable consistency determines how may MemTables and Commitlogs the row has to be written to  SSTables are immutable and cannot be modified once written to  Compaction consolidates SSTables and removes tombstones  SizeTiered Compaction  Leveled Compaction  Repair is a process that synchronizes copies located in different nodes  Uses Merkle Trees to make this more efficient
  • 33.  Is a feature that enables high write availability  Has to be enabled / disabled in the YAML file  When a replica node is down  A hint is stored in the coordinator node  Hints are stored for three hours (default)  Hinted writes do not count towards CL  Replaying hints does not affect system performance Hinted Hand-off
  • 34. Read Path  A row of data will likely exist in multiple locations  Unflushed Memtable  Un-compacted and compacted SSTables  Tunable consistency determines how many nodes have to respond  Cassandra does not rewrite entire row to new file on update  No read before writes  Updated / new columns exists in new file  Unmodified columns exist in old file  The timestamped version of the row can be different in each location  All these must be retrieved, reconstructed and processed based on timestamp  Uses Bloom filters to make key lookups more efficient
  • 35.  Row fragments may exist in multiple SSTables  May exist in Memtable as well  Bloom filters speed lookups Read Path
  • 36.  It is a Probabilistic Bit Vector Data Structure  Supports two operations – Test and Add  Cassandra uses Bloom filters to reduce Disk I/O during key lookup  Each SSTable has a bloom filter associated with it  A Bloom filter is used to test if an element is a part of a set  False Positives are possible, but False negatives are not  Means a key is “possibly in set” or “definitely not in set”  Check out JasonDavies.com for a cool interactive demo  http://www.jasondavies.com/bloomfilter/  http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html Bloom Filters
  • 37.  Deletes are handled differently when compared to traditional RDBMS  Data to be deleted is marked using Tombstones (using a write operation)  Actual removal takes place later during compaction  Run Repair on each node within 10 days (default)  Repair removes inconsistencies between replicas  Inconsistencies happen because nodes can be down for longer than hinted handoff window, thereby missing deletes/updates  Distributed deletes are hard in a peer to peer system that has no SPOF Deletes
  • 38.  Distributed Systems are eventually consistent  Only a small number of nodes have to respond for successful (delete) operation  As the delete command propagates through the system, some nodes may be unavailable  The commands are stored (as hinted hand-offs) and will be delivered when the downed node comes online  The delete command may be “lost” if the downed node does not come back within the hinted hand-off window (default 3 hours) Why are Distributed Deletes hard?
  • 39.  Cassandra does not support in-row updates  Updates are implemented as a delete and an insert  Updated values are written to a new file  Unmodified columns of the original row exist in old file  Compaction consolidates all values and writes row to new file Updates
  • 40.  Cassandra does not perform in-place updates or deletes  Instead the new data is written to a new SSTable file  Cassandra marks data to be deleted using markers called Tombstones  Tombstones exist for the time period defined by GC_GRACE_SECONDS  Compactions merges data in each SSTable by partition key  Evicts tombstones, deletes data and consolidates SSTables into a single SSTable  Old SSTables are deleted as soon as existing reads complete Compaction
  • 42.  Read Repair and Node Repair  Read Repair synchronizes data requested in a read operation  Node repair synchronizes all data (for a range) in a node with all replicas  Node repair needs to be scheduled to run at least once within the GC_GRACE_SECONDS Window (default 10 days) Repair
  • 43.  There are two stages to the repair process  Build a Merkle Tree  Each replica will compare the differences between the trees  Once the compare completes, changes stream over  Streams are written to new SSTables  Repair is a resource intensive operation  Read up on Advanced Repair techniques Repair Process
  • 44.  The distributed, decentralized nature of Cassandra requires repair operations  Repair involves comparing all data elements in each replica and updating the data  This happens asynchronously and in the background  Cassandra uses Merkle Trees to detect data inconsistencies quicker and minimize data transferred between different nodes  Merkle Tree is an inverted hash tree structure  Used to compare data stored in different nodes  Partial branches of tree can be compared  Minimizes repair time and traffic between nodes Merkle Trees
  • 45. Single threaded Operations  Some Examples of Single threaded operations:  Merkle Tree Comparison  Triggering Repair  Deleting files  Obsolete Sstables  Commitlog segments  Gossip  Hinted Handoff (default value = 1)  Message Streaming
  • 46.  This demo is to help get a better understanding on:  Gossip  Replication  Data Manipulation (Inserts, Updates, Deletes)  Role of Memtable, CommitLog and Tombstones  Compaction Demo
  • 47. Demo - Steps  Modify core cluster and table settings  Insert Data in one node  Verify Replication  Shut down one node  Continue DML operations  Start the downed node  Understand Outcome
  • 48.  Let’s see it! Demo Time
  • 49.  Commands issued to Cassandra when one node was down Demo commnds
  • 50.  Expected results  Actual Results Results
  • 51. Demo Recap  What just happened?  Inserts disappeared  Updates rolled back  Deletes reappeared  What happened to Durability?  And this thing called eventual consistency?
  • 52.  All nodes were up and running  Initial writes came in, got persisted and replicated  All nodes have received the data and are in sync.  Memtable Flush, Compaction and SSTables Consolidation  This clears the memory and the commit log  None of the 3 nodes have any entries in the commit log for these rows  Data exists in SSTables and so query returns data back to user What really happened?
  • 53.  One node is brought down  The state is preserved in that node  Inserts / Updates and Deletes continue in other nodes  Replication and Synchronization happens  Consolidation and Compaction happens on the other 2 nodes  Every time this happens, commit log is cleared and tombstones evicted  gc_grace_seconds & hinted_handoff play a critical role for this demo to work  3rd node that was down is brought up and it starts synchronizing  It still has the original state preserved and sends that copy to the other 2 nodes  Other 2 nodes receive the data and look for commit log entries and Tombstones locally  When the nodes do not find the entries, they proceed to apply that change (as new data) and the system reverts back What really happened?
  • 54.  http://www.Datastax.com  http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf  http://berb.github.io/diploma-thesis/original/052_threads.html  Choice Hotels is hiring!  Please contact Jeremiah Anderson for details.  Jeremiah_Anderson@choicehotels.com References