Cassandra internals

Phoenix Cassandra Users Meetup
January 26th, 2015
Narasimhan Sampath
Choice Hotels International
Cassandra Internals

 What is Cassandra
 SEDA
 Data Placement, Replication and Partition Aware Drivers
 Read and Write Path
 Merkle Trees, SSTables, Read Repair and Compaction
 Single and Multi-threaded Operations
 Demo
Agenda

 Cassandra is a decentralized distributed database
 No master or slave nodes
 No single point of failure
 Peer-Peer architecture
 Read / write to any available node
 Replication and data redundancy built into the architecture
 Data is eventually consistent across all cluster nodes
 Linearly (and massively) scalable
 Multiple Data Center support built in – a single cluster can span geo locations
 Adding or removing nodes / data centers is easy and does not require down time
 Data redistribution / rebalance seamless and non blocking
 Runs on commodity hardware
 Hardware failure is expected and factored into the Architecture
 Internal architecture more complex than non-distributed databases
Cassandra

 Automatic Sharding (partitioning)
 Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes
 Each node is responsible for a subset of the data
 Copies of that subset are stored on other nodes for high availability and redundancy
 Data placement design determines node balancing (token assignment, adding and removing nodes)
 Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the
users
 Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent)
 Consistency (all nodes see the same data at the same time)
 Availability (a guarantee that every request receives a response about whether it succeeded or failed)
 Partition tolerance (the system continues to operate despite a part of the system failing)
 Brewer’s CAP theorem (For further reading)
 Staged Event Driven Architecture – framework for achieving high concurrency and load
 Uses events, messages and queues to process tasks
 Decouples the request and response from the worker threads
Cassandra

 Ring – Visual representation of data managed by Cassandra
 Node – Individual machine in the ring
 Data Center – A collection of related nodes
 Cluster – Collection of (geographically separated) data centers
 Commitlog – The equivalent of a transaction log file for Durability
 Memtable – In Memory structures to store data (per column family)
 Keyspace – Container for application data (Analogous to schema)
 Table – Structure that holds data in rows and columns
 SSTable – An immutable file (for each table) on disk to which data structures in memory are
dumped periodically
Cassandra Terminology

 Gossip – Peer to Peer protocol to discover and share location and state information on
nodes
 Tokens – A number used to assign a range of data to a node within a datacenter
 Partitioner – A Hashing function for deriving the token
 Replication Factor – determines the number of copies of data
 Snitch – Snitch informs Cassandra about network topology
 Replica – Copies of data on different nodes for redundancy and fault tolerance
 Replication Factor – total number of copies on the cluster
Terminology

 Cassandra is linearly (horizontal) and massively scalable
 Just add or remove nodes to the cluster as load increases or decreases
 There is no down time required for this
 SEDA – Staged Event Driven Architecture guarantees consistent throughput
Core Strength - Scalability

 Avoids the pitfalls of Client Server based design
 Eliminates storage bottlenecks
 No single data repository
 Redundancy built in
 All nodes participate (whether they have requested data or not)
 Shared nothing
 Transparently add / remove nodes as necessary without downtime
 Comes with a trade-off – eventual consistency (CAP)
 Newer Staged Event Driven Architecture
How does it Scale?

 Legacy systems typically use thread based concurrency models
 Programming traditional multi-threaded applications is hard
 Distributed multithreaded applications are even harder
 Leads to severe scalability bottlenecks
 A new thread or process is usually created for each request
 There is a maximum number of threads a system can support
 Challenges with thread execution model
 Deadlocks
 Livelocks (wastes CPU cycles)
 starvation (wait for resources)
 Overheads – Context switching, synchronization and data movement
 Request and response typically handled by the same thread
 Sequential execution
Legacy Systems

Event Driven Architecture
 Evolution of Event Driven Architecture (EDA)
 This consists of a set of loosely coupled software components and services
 An Event is something that an application can act upon
 A hotel booking event
 A check-in event
 A listener can pick up a check-in event and act on it
 In-room entertainment system displays a personalized greeting
 Partners may get notified and can send personalized offers (Spa / massage/ restaurant
discounts)
 This is much more scalable than thread based concurrency models

 SEDA is an Architectural approach
 An application is broken down into a set of logical stages
 These stages are loosely coupled and connected via queues
 Decouples event and thread scheduling from DB Engine logic
 Prevents resources from being overcommitted under high load
 Enables modularity and code reuse
SEDA Explained

 SEDA enables Massive Concurrency
 No thread deadlocks or livelocks or Starvation to worry about (for most part)
 Thread Scheduling and Resource Management abstracted
 Supports self tuning / resource allocation / management
 Easier to debug and monitor application performance at scale
 Distributed debugging / tracing easier
 Graceful degradation under excessive load
 Maintains throughput at the expense of latency
Why SEDA matters

Why is data placement important

 Cassandra has a listen and broadcast IP address
 Snitch maps IP address to Racks and Data Centers
 Gossip uses this information to help Cassandra build node location map
 Snitch helps Cassandra with replica placement
 Helps Cassandra minimize cross data center latency
Role of Snitch

 Once built and configured, a cluster is ready to store data
 Each node owns a Token Range
 Can be manually assigned in YAML file
 Or Cassandra can manage token assignment - a concept called vNodes
 A Keyspace needs to be created with replication options
 CREATE KEYSPACE “Choice"
WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
 Cassandra Schema objects are replicated globally to all nodes
 This enables each node in the cluster to act as a coordinator node
Data Placement

 Data gets replicated as defined in the Keyspace
 Within a data center, murmur3hash of PK decides which node owns
the data
 Replication Strategy determines which nodes contain replicas
 Simple Strategy – Replicas are placed in succeeding nodes
 Network Topology – Walks the ring clockwise and places each copy on the
first node on successive racks
 Asymmetric replica groupings are possible (DR / Analytics etc.)
Data Placement

empID empName deptID deptName hiredate
22 Sam 12 Finance 1/22/1996
33 Scott 18 Human Resources 12/8/2006
44 Walter 24 Shipping 11/20/2009
55 Bianca 30 Marketing 1/1/2015
Data Placement
Partition Sample Hash
Finance
-2245462676723220000
Human Resources
7723358927203680000
Shipping
-6723372854036780000
Marketing
1168604627387940000

Data Access
 Cassandra’s location independent Architecture means a user can connect to any node of
the cluster, which then acts as coordinator node
 Schemas get replicated globally – even to nodes that do not contain a copy of the data
 Cassandra offers tunable consistency – an extension of eventual consistency
 Clients determine how consistent the data should be
 They can choose between high availability (CL ONE) and high safety (CL ALL) among other options
 Further reading
 Request goes through stages – the thread that received the initial request will insert the
request into a queue and wait for the next user request
 Partition aware drivers help route traffic to the nearest node
 Hinted Hand-offs – store and forward write requests

 Memtables
 Commitlog
 SSTables
 Tombstones
 Compaction
 Repair
Reads and Writes

Write Process
 Write requests written to a MemTable
 When Memtable is full, contents get queued to be flushed to disk
 Writes are also simultaneously persisted on Disk to a CommitLog file
 This helps achieving durable writes
 CommitLog entries are purged after MemTable is flushed to disk
 MemTables and SSTables are created on a per table basis
 Tunable consistency determines how may MemTables and Commitlogs the row has to be written to
 SSTables are immutable and cannot be modified once written to
 Compaction consolidates SSTables and removes tombstones
 SizeTiered Compaction
 Leveled Compaction
 Repair is a process that synchronizes copies located in different nodes
 Uses Merkle Trees to make this more efficient

 Is a feature that enables high write availability
 Has to be enabled / disabled in the YAML file
 When a replica node is down
 A hint is stored in the coordinator node
 Hints are stored for three hours (default)
 Hinted writes do not count towards CL
 Replaying hints does not affect system performance
Hinted Hand-off

Read Path
 A row of data will likely exist in multiple locations
 Unflushed Memtable
 Un-compacted and compacted SSTables
 Tunable consistency determines how many nodes have to respond
 Cassandra does not rewrite entire row to new file on update
 No read before writes
 Updated / new columns exists in new file
 Unmodified columns exist in old file
 The timestamped version of the row can be different in each location
 All these must be retrieved, reconstructed and processed based on timestamp
 Uses Bloom filters to make key lookups more efficient

 Row fragments may exist in multiple SSTables
 May exist in Memtable as well
 Bloom filters speed lookups
Read Path

 It is a Probabilistic Bit Vector Data Structure
 Supports two operations – Test and Add
 Cassandra uses Bloom filters to reduce Disk I/O during key lookup
 Each SSTable has a bloom filter associated with it
 A Bloom filter is used to test if an element is a part of a set
 False Positives are possible, but False negatives are not
 Means a key is “possibly in set” or “definitely not in set”
 Check out JasonDavies.com for a cool interactive demo
 http://www.jasondavies.com/bloomfilter/
 http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Bloom Filters

 Deletes are handled differently when compared to traditional RDBMS
 Data to be deleted is marked using Tombstones (using a write operation)
 Actual removal takes place later during compaction
 Run Repair on each node within 10 days (default)
 Repair removes inconsistencies between replicas
 Inconsistencies happen because nodes can be down for longer than hinted handoff window, thereby missing
deletes/updates
 Distributed deletes are hard in a peer to peer system that has no SPOF
Deletes

 Distributed Systems are eventually consistent
 Only a small number of nodes have to respond for successful (delete) operation
 As the delete command propagates through the system, some nodes may be unavailable
 The commands are stored (as hinted hand-offs) and will be delivered when the downed
node comes online
 The delete command may be “lost” if the downed node does not come back within the
hinted hand-off window (default 3 hours)
Why are Distributed Deletes hard?

 Cassandra does not support in-row updates
 Updates are implemented as a delete and an insert
 Updated values are written to a new file
 Unmodified columns of the original row exist in old file
 Compaction consolidates all values and writes row to new file
Updates

 Cassandra does not perform in-place updates or deletes
 Instead the new data is written to a new SSTable file
 Cassandra marks data to be deleted using markers called Tombstones
 Tombstones exist for the time period defined by GC_GRACE_SECONDS
 Compactions merges data in each SSTable by partition key
 Evicts tombstones, deletes data and consolidates SSTables into a single SSTable
 Old SSTables are deleted as soon as existing reads complete
Compaction

 Read Repair and Node Repair
 Read Repair synchronizes data requested in a read operation
 Node repair synchronizes all data (for a range) in a node with
all replicas
 Node repair needs to be scheduled to run at least once within
the GC_GRACE_SECONDS Window (default 10 days)
Repair

 There are two stages to the repair process
 Build a Merkle Tree
 Each replica will compare the differences between the trees
 Once the compare completes, changes stream over
 Streams are written to new SSTables
 Repair is a resource intensive operation
 Read up on Advanced Repair techniques
Repair Process

 The distributed, decentralized nature of Cassandra requires repair operations
 Repair involves comparing all data elements in each replica and updating the data
 This happens asynchronously and in the background
 Cassandra uses Merkle Trees to detect data inconsistencies quicker and minimize data
transferred between different nodes
 Merkle Tree is an inverted hash tree structure
 Used to compare data stored in different nodes
 Partial branches of tree can be compared
 Minimizes repair time and traffic between nodes
Merkle Trees

Single threaded Operations
 Some Examples of Single threaded operations:
 Merkle Tree Comparison
 Triggering Repair
 Deleting files
 Obsolete Sstables
 Commitlog segments
 Gossip
 Hinted Handoff (default value = 1)
 Message Streaming

 This demo is to help get a better understanding on:
 Gossip
 Replication
 Data Manipulation (Inserts, Updates, Deletes)
 Role of Memtable, CommitLog and Tombstones
 Compaction
Demo

Demo - Steps
 Modify core cluster and table settings
 Insert Data in one node
 Verify Replication
 Shut down one node
 Continue DML operations
 Start the downed node
 Understand Outcome

 Commands issued to Cassandra when one node was
down
Demo commnds

 Expected results
 Actual Results
Results

Demo Recap
 What just happened?
 Inserts disappeared
 Updates rolled back
 Deletes reappeared
 What happened to Durability?
 And this thing called eventual consistency?

 All nodes were up and running
 Initial writes came in, got persisted and replicated
 All nodes have received the data and are in sync.
 Memtable Flush, Compaction and SSTables Consolidation
 This clears the memory and the commit log
 None of the 3 nodes have any entries in the commit log for these
rows
 Data exists in SSTables and so query returns data back to user
What really happened?

 One node is brought down
 The state is preserved in that node
 Inserts / Updates and Deletes continue in other nodes
 Replication and Synchronization happens
 Consolidation and Compaction happens on the other 2 nodes
 Every time this happens, commit log is cleared and tombstones evicted
 gc_grace_seconds & hinted_handoff play a critical role for this demo to work
 3rd node that was down is brought up and it starts synchronizing
 It still has the original state preserved and sends that copy to the other 2 nodes
 Other 2 nodes receive the data and look for commit log entries and Tombstones locally
 When the nodes do not find the entries, they proceed to apply that change (as new data) and the system reverts back
What really happened?

 http://www.Datastax.com
 http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
 http://berb.github.io/diploma-thesis/original/052_threads.html
 Choice Hotels is hiring!
 Please contact Jeremiah Anderson for details.
 Jeremiah_Anderson@choicehotels.com
References

Cassandra internals

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Cassandra internals

Similaire à Cassandra internals (20)

Dernier

Dernier (20)

Cassandra internals