2. @doanduyhai
About me!
2
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS technical point of contact in Europe
☞ duy_hai.doan@datastax.com
• production troubleshooting
3. @doanduyhai 3
Datastax!
• Founded in April 2010
• We drive Apache Cassandra™
• Home to Cassandra chair & most committers (≈80%)
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + features
4. @doanduyhai
Agenda!
4
• Architecture
• Replication
• Consistency
• Read/write path
• Data Model
• Failure handling
6. @doanduyhai
Cassandra history!
6
NoSQL database
• created at Facebook
• open-sourced since 2008
• current version = 2.1
• column-oriented ☞ distributed table
19. Failure tolerance by replication!
@doanduyhai
19
Replication Factor (RF) = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
{B, A, H}
{C, B, A}
{D, C, B}
A
B
C
D
E
F
G
H
20. @doanduyhai
Coordinator node!
Incoming requests (read/write)
Coordinator node handles the request
n1
Every node can be coordinator àmasterless
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
Client request
21. @doanduyhai
Consistency!
21
Tunable at runtime
• ONE
• QUORUM (strict majority w.r.t. RF)
• ALL
Apply both to read & write
22. Write consistency!
Write ONE
• write request to all replicas in //
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
23. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
@doanduyhai
client
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
24. Write consistency!
Write ONE
• write request to all replicas in //
• wait for ONE ack before returning to
@doanduyhai
client
• other acks later, asynchronously
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
25. Write consistency!
Write QUORUM
• write request to all replicas in //
• wait for QUORUM acks before
@doanduyhai
returning to client
• other acks later, asynchronously
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
5 μs
10 μs
120 μs
26. Read consistency!
Read ONE
• read from one node among all replicas
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
27. Read consistency!
Read ONE
• read from one node among all replicas
• contact the fastest node (stats)
@doanduyhai
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
29. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
@doanduyhai
replicas to reach QUORUM
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
30. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
@doanduyhai
replicas to reach QUORUM
• return most up-to-date data to client
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
31. Read consistency!
Read QUORUM
• read from one fastest node
• AND request digest from other
@doanduyhai
replicas to reach QUORUM
• return most up-to-date data to client
• repair if digest mismatch n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
32. Consistency in action!
@doanduyhai
32
RF = 3, Write ONE, Read ONE
B A A
B A A
Read ONE: A
data replication in progress …
Write ONE: B
33. Consistency in action!
data replication in progress …
@doanduyhai
33
RF = 3, Write ONE, Read QUORUM
B A A
Write ONE: B
Read QUORUM: A
B A A
34. Consistency in action!
data replication in progress …
@doanduyhai
34
RF = 3, Write ONE, Read ALL
B A A
Read ALL: B
B A A
Write ONE: B
35. Consistency in action!
data replication in progress …
@doanduyhai
35
RF = 3, Write QUORUM, Read ONE
B B A
Write QUORUM: B
Read ONE: A
B B A
36. Consistency in action!
data replication in progress …
@doanduyhai
36
RF = 3, Write QUORUM, Read QUORUM
B B A
Read QUORUM: B
B B A
Write QUORUM: B
40. Consistency summary!
ONERead + ONEWrite
☞ available for read/write even (N-1) replicas down
QUORUMRead + QUORUMWrite
☞ available for read/write even 1+ replica down
@doanduyhai 40
52. Partition Key Cache!
Partition Key Cache SStable
@doanduyhai
52
#partition001
data
data
….
#partition002
#partition350
data
data
…
#Partition Offset …
#partition001 0x0 …
#partition002 0x153 …
#partition350 0x5464321 …
53. Cassandra Read Path!
4 Key Index Sample
2 Bloom Filter
Bloom Filter
@doanduyhai
53
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
3 Partition Key Cache
SStable2
Bloom Filter
54. @doanduyhai
Key Index Sample!
54
SStable
#partition001
data
data
….
#partition128
#partition350
data
data
…
Key Index Sample
Sample Offset
#partition001 0x0
#partition128 0x4500
#partition256 0x851513
#partition512 0x5464321
55. Cassandra Read Path!
4 Key Index Sample
2 Bloom Filter
Bloom Filter
@doanduyhai
55
1 Row Cache
Off Heap
JVM Heap
SELECT .. FROM … WHERE #partition =…;
3 Partition Key Cache
5 Compression Offset
SStable2
Bloom Filter
59. Last Write Win (LWW)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
@doanduyhai
59
jdoe
age
name
33 John DOE
#partition
60. Last Write Win (LWW)!
@doanduyhai
jdoe
age (t1) name (t1)
33 John DOE
60
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
auto-generated timestamp (μs)
.
61. Last Write Win (LWW)!
SSTable1 SSTable2
@doanduyhai
61
UPDATE users SET age = 34 WHERE login = jdoe;
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
62. Last Write Win (LWW)!
SSTable1 SSTable2 SSTable3
@doanduyhai
62
DELETE age FROM users WHERE login = jdoe;
tombstone
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
63. Last Write Win (LWW)!
@doanduyhai
63
SELECT age FROM users WHERE login = jdoe;
? ? ?
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
64. Last Write Win (LWW)!
@doanduyhai
64
SELECT age FROM users WHERE login = jdoe;
✕ ✕ ✓
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
65. @doanduyhai
Compaction!
65
SSTable1 SSTable2 SSTable3
jdoe
age (t3)
ý
jdoe
age (t1) name (t1)
33 John DOE
jdoe
age (t2)
34
New SSTable
jdoe
age (t3) name (t1)
ý John DOE
66. @doanduyhai
CRUD operations!
66
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = jdoe;
DELETE age FROM users WHERE login = jdoe;
SELECT age FROM users WHERE login = jdoe;
68. User management & permissions!
@doanduyhai
68
User
• CREATE/ALTER/DROP
Permissions
• GRANT <xxx> PERMISSION ON <resource> TO <user_name>
• REVOKE <xxx> PERMISSION ON <resource> FROM <user_name>
71. @doanduyhai
Queries!
71
Get message by user and message_id (date)
SELECT * FROM mailbox WHERE login = jdoe
and message_id = ‘2014-09-25 16:00:00’;
Get message by user and date interval
SELECT * FROM mailbox WHERE login = jdoe
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
72. @doanduyhai
Queries!
72
Get message by message_id only (#partition not provided)
SELECT * FROM mailbox WHERE message_id = ‘2014-09-25 16:00:00’;
Get message by date interval (#partition not provided)
SELECT * FROM mailbox WHERE
and message_id <= ‘2014-09-25 16:00:00’
and message_id >= ‘2014-09-20 16:00:00’;
73. Get message by user range (range query on #partition)
Get message by user pattern (non exact match on #partition)
@doanduyhai
Queries!
73
SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe;
SELECT * FROM mailbox WHERE login like ‘%doe%‘;
74. WHERE clause restrictions!
@doanduyhai
74
All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition
Only exact match (=) on #partition, range queries (<, ≤, >, ≥) not allowed
• ☞ full cluster scan
On clustering columns, only range queries (<, ≤, >, ≥) and exact match
WHERE clause only possible on columns defined in PRIMARY KEY
75. Collections & maps!
@doanduyhai
75
CREATE TABLE users (
login text,
name text,
age int,
friends set<text>,
hobbies list<text>,
languages map<int, text>,
…
PRIMARY KEY(login));
76. Collections & maps!
@doanduyhai
76
All elements loaded at each access
• ☞ low cardinality only (≤ 1000 elements)
Should not be used use for 1-N, N-N relations
77. User Defined Type (UDT)!
@doanduyhai
77
Instead of
CREATE TABLE users (
login text,
…
street_number int,
street_name text,
postcode int,
country text,
…
PRIMARY KEY(login));
78. User Defined Type (UDT)!
@doanduyhai
78
CREATE TYPE address (
street_number int,
street_name text,
postcode int,
country text);
CREATE TABLE users (
login text,
…
location frozen <address>,
…
PRIMARY KEY(login));
82. @doanduyhai
From SQL to CQL!
82
Remember…
there is no join
(do you want to scale ?)
83. @doanduyhai
From SQL to CQL!
83
Remember…
there is no integrity constraint
(do you want to read-before-write ?)
84. @doanduyhai
From SQL to CQL!
84
1-1, N-1 : normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author_id uuid, // typical join id
content text,
PRIMARY KEY((article_id), comment_id));
85. @doanduyhai
From SQL to CQL!
85
1-1, N-1 : de-normalized
User
1
n
Comment
CREATE TABLE comments (
article_id uuid,
comment_id timeuuid,
author person, // person is UDT
content text,
PRIMARY KEY((article_id), comment_id));
86. @doanduyhai
From SQL to CQL!
86
N-1, N-N : normalized
n
User
n
CREATE TABLE friends(
login text,
friend_login text, // typical join id
PRIMARY KEY((login), friend_login));
87. @doanduyhai
From SQL to CQL!
87
N-1, N-N : de-normalized
n
User
n
CREATE TABLE friends(
login text,
friend_login text,
friend person, // person is UDT
PRIMARY KEY((login), friend_login));
88. Data modeling best practices!
@doanduyhai
88
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
89. Data modeling best practices!
@doanduyhai
89
Start by queries
• identify core functional read paths
• 1 read scenario ≈ 1 SELECT
Denormalize
• wisely, only duplicate necessary & immutable data
• functional/technical trade-off
90. Data modeling best practices!
@doanduyhai
90
Person UDT
- firstname/lastname
- date of birth
- sexe
- mood
- location
91. Data modeling best practices!
@doanduyhai
91
John DOE, male
birthdate: 21/02/1981
subscribed since 03/06/2011
☉ San Mateo, CA
’’Impossible is not John DOE’’
Full detail read from
User table on click
97. Consistent read!
Read at QUORUM or more
• read from one fastest node
• AND request digest from other
@doanduyhai
replicas to reach QUORUM
• return most up-to-date data to client
• repair if digest mismatch n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
98. Read Repair!
Read at CL < QUORUM
• read from one least-loaded node
• every x reads, request digests from
@doanduyhai
other replicas
• compare digest & repair
asynchronously
• read_repair_chance (10%)
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator
99. Manual repair!
Should be part of cluster operations
• scheduled
• I/O intensive
@doanduyhai
Use incremental repair of 2.1
101. Training Day | December 3rd
Beginner Track
• Introduction to Cassandra
• Introduction to Spark, Shark, Scala and
Cassandra
Advanced Track
• Data Modeling
• Performance Tuning
Conference Day | December 4th
Cassandra Summit Europe 2014 will be the single
largest gathering of Cassandra users in Europe.
Learn how the world's most successful companies are
transforming their businesses and growing faster than
ever using Apache Cassandra.
http://bit.ly/cassandrasummit2014
Compan@yd Coaonndfiudyehnatiia l 101