Ricon2013 preso upload

Dynamic Dynamos:
Riak and Cassandra
Jason Brown, @jasobrown
Senior Software Engineer, Netflix
#riconwest 2013-Oct-30

Choice
The whole "NoSQL movement" is really about
choice. At scale there will never be a single
solution that is best for everyone.
@jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html

whoami
● @Netflix, > 5 years
● Apache Cassandra committer
● wannabe distributed systems geek

Netflix and Cassandra
●
●
●
●

long-time Oracle shop
Aug 2008
needed new db for cloud migration
2010 - selected cassandra
○ dynamo-style, masterless system
○ multi-datacenter support
○ written in Java

Netflix’s C* Prod Deployment
Production clusters

> 65

Production nodes

> 2300

Multi-region clusters

> 40

Most regions used

4 (three clusters)

Total data

~300 TB

Largest cluster

288 nodes (actually, 576 nodes)

Max reads/writes

300k rps / 1.3m wps

Jason’s whiteboard, summer 2013
<image of white board here>

Why Riak?
● Another dynamo-style system
● vector clocks
● not java
● not jvm

Talk Structure
1. Comparisons
a. Write/Read path
b. Conflict resolution
c. Anti-entropy
d. Multiple Datacenter support

2. Riak @ Netflix

Data modeling
Riak
● key -> value
Cassandra
● columnar layout
● row key with one to many columns

Virtual Nodes
● Split hash ring into smaller chunks
● Physical node responsible for 1..n tokens
Cassandra
● purely for routing
Riak
● burrowed deep into the code base

Write Path - Cassandra
● Coordinator gets request
● Determine replica nodes, in all DCs
● Send to all in local DC
● Send to one replica in each remote DC
● All respond back to coordinator
○ block for consistency_level nodes

● Execute triggers

Tunable Consistency
Coordinator blocks for specified replica count to
respond
Consistency Levels:
● ALL
● EACH_QUORUM
● LOCAL_QUORUM / LOCAL_ONE
● ONE / TWO / THREE
● ANY

Put Path - Riak
●
●
●
●
●

Node gets request
determine vnodes (preflist)
if node not in preflist, forward
run precommit hooks
perform coordinating put
○ looks for previous riak object
○ increment vclock

● calls other preflist vnodes to put
● prepare return value (mult vclocks)
● coordinator vnode runs postcommit hooks

riak “consistency levels”
● n (n_val) - vnode replication count
● r - read count
● w - write count
○ {all | one | quorum | <int>}

● pr / pw - primary read/write count
● dw - durable write
bucket defaults can be overriden on request

That was the happy path
...
what about partitions?

Hinted Handoff - Riak
Sloppy quorum
● preflist fallbacks to secondary vnodes
○ skip unavailable primary vnode
○ use next available vnode in ring

● Send data to vnode when available
Put data is written, and available for reads

Hinted Handoff - Cassandra
Coordinator stores hints
○ for unavailable nodes
○ if replica fails to respond

Replay hints to node when available
Mutation is stored, but data not read available
CL.ANY - stores a hint if no replicas available

Read path - Riak
●
●
●
●

coordinator gets request
determine preflist
send request to all vnodes in preflist
when read count of vnodes return
○ merge values
○ possibly read repair

Read Repair - Riak
● compare vclocks for object
● if resolvable differences, ship newest object
to out of date vnodes
● return object with latest vclock to client

Read Path - Cassandra
● Determine replicas to invoke
○ based on consistency level

● First replica responds with full data set,
others send digests
● Coordinator waits for consistency_level
nodes to respond

Consistent Read - Cassandra
● compare digest of columns from replicas
● If any mismatches:
○ re-request full data set from same replicas
○ compare full data sets, send updates
○ block until out of date replicas respond

● Return merged data set to client

Read Repair - Cassandra
Converge requested data across all replicas
Piggy-backs on normal reads, but waits for all
replicas to respond (async)
Follows same alg as consistent reads

Conflict resolution - Riak
Vector clocks
●
●
●
●
●
●

logical clock per object
array of {incrementer, version, timestamp}
maintains causal relationships
safe in face of ‘concurrent’ writes
performance penalty
resolution burden pushed to caller

Conflict Resolution - Cassandra
Last Writer Wins
● every column has timestamp value
● “whatever timestamp caller passed in”
● “What time is it?”
○

http://aphyr.com/posts/299-the-trouble-with-timestamps

● faster
● system resolves conflicts

Anti-entropy
Converge cold data
Merkle Tree exchange
Stream inconsistencies
IO/CPU intensive

Anti-entropy - Cassandra
Node repair - converges ranges owned by a
node with all replicas
● Initiator identifies peers
● Each participant reads range from disk,
generates Merkle Tree, return MT
● Initiator compares all MTs
● Range exchange

Anti-Entropy - Riak
AAE - conceptually similar to Cassandra
Merkle Tree updated on every write
Leaf nodes contain keys, not hash value
Tree is rebuilt periodically
Each execution only between two vnodes

Multi Datacenter support
Cassandra
● in the box
● node interconnects are plain TCP scokets
○ two connections per node pair

● queries not restricted to local DC
○ read repair
○ node repair

Riak
● included in RiakEE (MDC)
● local nodes use disterl
● remote nodes use TCP
● queries do not span multiple regions
● repl types:
○ realtime
○ fullsync

● AAE

Riak @ Netflix
(hypothetically)

Subscriber data
c* = wide row implementation
● row key = custId (long)
● column per distinct attribute
○
○
○
○

subscriberId
name
subscription details
holds

riak = fit reasonably well with JSON/text blob

Movie Ratings
c* implementation:
● new ratings stored in individual columns
● recurring job to aggregate into JSON blob
● reads grab JSON + incremental updates

Riak = JSON blob already, append new ratings

Viewing History
Time-series of ‘viewable’ events
● one column per event
● playback/bookmark serialized JSON blob
● 7-8 months worth of playback data

Riak - time-series data doesn’t feel like a
natural fit

“Large blob” storage
● Team wanted to store images in c*
● key -> one column
● blob size
Right in the wheelhouse for Riak/RiakCS

Operations
Priam (https://github.com/Netflix/priam)
● backup/restore
● Cassandra bootstrap / token assignment
● configuration management
● supports multi-region deployments
Every node decentralized from peers

Priam for Riak
(perceived) challenges in supporting Riak
● some degree of centralization
○ cluster launch
○ backups for eLevelDb

● prod -> test refresh (riak reip)
● MDC

BI Integration
Aegthithus - pipeline for importing into BI
● grab nightly backup snapshot for cluster
● convert to JSON
● merge, dedupe, find new data
● import into Hive, Teradata, etc
Downside is (semi-) stale data into BI

Alternative BI Integration
Live, secondary cluster
● C* - just another datacenter in cluster
● Riak - MultiDataCenter (MDC) sloution
All mutations sent to secondary cluster
● what happens when things get slow?
● now part of c* repairs & riak full-sync

Wrap up
● Choice
● Cassandra and Riak are great databases
○ resilient to failure
○ flexible data modeling
○ strong communities

● Running databases in the cloud ain’t easy

Thank you, Basho!
Mark Phillips, Jordan West, Joe Blomstedt,
Andrew Thompson, @evanmcc,
Basho Tech Support

Ricon2013 preso upload

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Dernier

Dernier (20)

Ricon2013 preso upload