13. The set of sorted tables are organized into a sequence of levels. The sorted table generated from a log file is placed in a special young level (also
called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the
overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)
sstables分成多个层级,从⽇日志⽂文件刷写到磁盘的⽂文件⽴立即加⼊入Level-0,当Level-0超过4个⽂文件(每个⽂文件1MB),会和Level-2的重叠部分进⾏行合并
Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping (disjoint) key ranges. Consider level
number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L,
and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating
new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
Level-0的keys允许重复,因为是从内存中刷写进来。当Level-L(L>=1)所有⽂文件超过10^L MB时,选择Level-L的⼀一个⽂文件和Level-(L+1)重叠⽂文件合并
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from
the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and
will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to
level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
如果Level-L只是覆盖了Level-(L+1)的⼀一部分,⽐比如Level-L的key=[D-F],⽽而Level-(L+1)的key=[B-G],则这个Level-(L+1)整个⽂文件都会作为合并的输⼊入
A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the
current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has
grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much
data from level-(L+2). 当Level L和Level (L+1)进⾏行合并时,⽂文件⼤大⼩小超过2M就⽣生成新⽂文件;如果key 覆盖了10个Level-(L+2)⽂文件,也要尽早⽣生成新⽂文件
Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at
level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no
such file).为了公平地选择Level-L中参与合并的⽂文件,我们会记住上⼀一次合并的end-key,下⼀一次合并时会选择接着end-key的第⼀一个⽂文件
Level 0:When the log file grows above a certain size (1MB by default):
Create a brand new memtable and log file and direct future updates here,In the background:
1).Write the contents of the previous memtable to an sstable
2).Discard the memtable
3).Delete the old log file and the old memtable(2)
4).Add the new sstable(1) to the young (level-0) level.
当⽇日志⽂文件超过1MB,创建新的memtable和⽇日志⽂文件,将旧memtable刷写成sstable,丢弃旧memtable,删除旧⽇日志和旧memtable,添加sstable到L0
Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB). I.e., we will read 14MB and write 14MB. Other
than the special level-0 compactions, we will pick one 2MB file from level L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because
level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file
ranges at level-L+1). The compaction will therefore read/write 26MB. Assuming a disk IO rate of 100MB/s, the worst compaction cost will be ≈ 0.5 s.
Level-0⾄至多只会读取Level-0中4个1MB的⽂文件,最坏情况下会读取Level-1的10MB⽂文件(因为Level-1的⺫⽬目标阈值=10M,所有⽂文件和Level-0都重叠下)
除了Level-0的合并,其他Level会选择Level-L的⼀一个2MB⽂文件,最坏情况会读取12个Level-(L+1)的⽂文件(其中10个完全覆盖,另外两个考虑了边界⽂文件)
https://rawgit.com/google/leveldb/master/doc/impl.html
15. 1. Every sstable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent
level is 10x the size. (in L1 you'll have 50MB of data, L2 500MB, and so on).
2. Sstables are created with the guarantee that they don't overlap
3. When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you'll have 50MB in ~10 files,
L2 500MB in ~100 files, etc..
5M
50M
500M
L0
L2
L1
ABCDEL1 FGHIJ KLMNO PQRST
L2 ABC DEF GHI JKL MNO PQR ST
5M
5M 5M 5M 5M 5M 5M 5M 5M 5M 5M
5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M … … … … … 5M
L0
L2
L1 5M*10 Files=50M
5M*100 Files=500M
http://stackoverflow.com/questions/29766453/how-does-the-leveled-compaction-strategy-ensure-90-of-reads-are-from-one-sstabl
AL0
1M
A B C D EL0
1M 1M 1M 1M 1M
L0 is filled
FAL0
1M
B
1M
ABCDE
L0 F
1M
L1
5M
ABCDE
FL0
L1
1M
5M
G
ABCDE
FL0
L1
1M
5M
G H I J K
ABCDE
L0
L1
1M
5M
K
FGHIJ
90% reads from the same file 5/6=83.3% 10/11=90%
所有⽂文件的总⼤大⼩小,⽽而不是说⼀一个⽂文件的⼤大⼩小
当L0充满了5MB的⽂文件,L1充满50M的⽂文件,
L2充满500M的⽂文件,都会发⽣生Compaction
同⼀一个Level⾥里每个SSTable的key范围不会有重叠
但是在不同Level中,同⼀一个key有可能是重叠的
由于每个Level的总⼤大⼩小呈10倍,⽂文件数量通常也是10倍
16. The set of sorted tables are organized into a sequence of levels. The sorted table generated from a log file is placed in a special young level
(also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged
together with all of the overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)
当Level-0超过4个⽂文件时,Level-0中所有⽂文件(刚好超过4个时就触发,所以所有⽂文件通常也是这4个⽂文件)会和Level-1的有重叠⽂文件进⾏行合并
(初始时可能Level-1中没有任何⽂文件,那就把Level-0的所有⽂文件进⾏行合并,只要Level-1有⽂文件,就会参与到合并过程)
合并过程中并不是产⽣生⼀一个⼤大⽂文件,⽽而是数据⼤大⼩小满2M写⼀一个⽂文件,最终可能有多个⼩小⽂文件
从内存写到磁盘的⽂文件(Level-0)是1M(⽇日志=1M时开始刷写),Level-1及以上每个⽂文件⼤大⼩小为2M
问题:⽇日志⽂文件=1M,内存以及刷写到磁盘的sstable⽂文件⼀一定是1M吗?
答案:不⼀一定,假设有⼀一个场景是⼀一直操作同⼀一个key,不同的value,在内存中只会有⼀一条记录
⽽而⽇日志⽂文件因为是追加的,每次更新都有⼀一条记录,最终⽇日志⽂文件=1M,内存可能只有1KB
问题:⽇日志⽂文件⼤大⼩小决定了什么时候刷新到磁盘,还是内存的⼤大⼩小决定?
如果由⽇日志⽂文件决定,因为1M的⽇日志占⽤用的内存不会很多,这样设置的Memtable⼤大⼩小就没起什么作⽤用了
如果由Memtable⼤大⼩小决定,只有Memtable占⽤用的内存达到阈值时,才会刷写到磁盘上
假设Memtable=100M,⽣生成SSTable时,每个SSTable=1M,在Level-0会⽣生成100个⽂文件。
L0
L1
L2
L3
young level(L0)
memory
Level 0:When the log file grows above a certain size (1MB by default):
Create a brand new memtable and log file and direct future updates here
In the background:
1).Write the contents of the previous memtable to an sstable
2).Discard the memtable
3).Delete the old log file and the old memtable(2)
4).Add the new sstable(1) to the young (level-0) level.
memtable→sstable→Level-0→Level-0和Level-1合并
18. Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping (disjoint) key ranges. Consider level
number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L,
and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating
new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from
the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and
will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to
level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
Level-0的key可能会重复,其他Level的key不会重复(同⼀一个Level中不同SSTable的key不会重复)。考虑Level-1以上的合并操作:当Level-L中所有⽂文件
的⼤大⼩小超过⺫⽬目标阈值(Level-1的⺫⽬目标=10M,Level-2的⺫⽬目标=100M..),选择Level-L的⼀一个⽂文件以及Level-(L+1)中覆盖Level-L范围的所有⽂文件,最后会
在Level-(L+1)⽣生成⼀一系列的新⽂文件(就像L0和L1合并时,会在L1⽣生成新⽂文件⼀一样,同时每个⽂文件的⼤大⼩小也还是2MB,注意每个⽂文件的⼤大⼩小并不会改变)
L0
L1
L2
L3
level-1, total target size=10M, each file size=2M, so there’re 5 Files
level-2, total target size=100M, file size=2M, so there’re 50 Files
level-3, total target size=1000M, file size=2M, so there’re 500 Files
k1-k3 k2-k5 k1-k4
k1-k3 k4-k6 k7-k9
k1-k2 k3-k4 k5-k6
k1 k2 k3-k4
合并:Level-L⼀一个,Level-(L+1)重叠的所有⽂文件
21. K-O
K L M N O
K-O
K L M N O
K-O
K L M N O
K-O
}
Level-L
Level-(L+1)
Level-L
Level-(L+1)
Level-(L+1)中和Level-L有重叠⽂文件,但是合并没有选择,最终Level-(L+1)会有重叠⽂文件
Level-(L+1)中和Level-L有重叠⽂文件,合并过程⼀一起参与,最终Level-(L+1)不会有重叠⽂文件
对于Level-1以上的每⼀一层,同⼀一层的所有SSTable⽂文件之间的key range都不会有重叠。前提条件是
Level-L和Level-(L+1)合并时选择Level-L的⼀一个⽂文件、在Level-(L+1)选择和Level-L重叠的所有⽂文件
A-E F-H I-L M-O P-R S-V W-X Y-Z
A-J K-S T-Z Level-L选择的⼀一个⽂文件的范围是K-S,Level-(L+1)中覆盖了K-S的⽂文件有[#6,#7,#8,#9]
虽然#6和#9只有⼀一部分覆盖,但是整个⽂文件都要被选择,因为没办法选择⼀一点点⽂文件#1.sst #2.sst #3.sst
#4.sst #5.sst #6.sst #7.sst #8.sst #9.sst #10.sst #11.sst
×
✅
22. gradually migrating new updates from the young level
to the largest level using only bulk reads and writes
将最新的更新记录逐渐地从年轻级别(Level-0)迁移到
最⼤大的级别(Level-n),只是⽤用批量读和写就可以完成
[E’,F’,G’]
[A,B,C] [D,E,F] [G,H,I,J] [K,L,M]
[A,B,C] [K,L,M][D,E/E’,F/F’] [G/G’,H] [I,J]
L1L0 a,b,c,d e,f,g,h j,k,lc,b,da,b,c c,d,e b,f,g
L1L0 a,b,c d,e,f j,k,lg,h,i
L1L0 a,b,c d,e,f g,h,im,h,ne,g,k h,f,g g,k,l
L1L0 a,b,c d,e i,j,k,lf,g,h
j,k,l
m,n
L1L0 a,b,c d,e i,j,k,lf,g,h m,nk m n o
L1L0
a,b,c d,e i,j,kf,g,h l,m n,o
L2
L1L0
a,b,c d,e i,j,kf,g,h l,m n,o
L2
L1L0
a,b,c d,e i,j,kf,g,h l,m n,o
L2
k n m p
L1L0
a,b,c d,e i,jf,g,h k,l n,o
L2
m,n,p
L1L0
d,e i,jf,g,h k,l a,b,c
L2
o,pm,n,p
选择⼀一个⽂文件[a,b,c],
由于L2中没有和它重叠的,
该⽂文件直接提升到L2
23. L1L0
d,e i,jf,g,h k,l a,b,c
L2
o,pm,n,pa b c i
L1L0
a,b,c g,h,i,jd,e,f k,l a,b,c
L2
o,pm,n,p
L1L0
a,b,c g,h,i,jd,e,f k,l a,b,c
L2
o,pm,n,pg h l p
L1L0
a,b,c g,h,id,e,f j,k a,b,c
L2
o,pl,m n,p
L1L0
a,b,c d,e,f j,k a,b,c
L2
o,pl,m n,pn k m o g,h,i
L1L0
a,b,c d,e,f j,k,l a,b,c
L2
o,pm,n o,p g,h,i
L1L0
a,b,c d,e,f j,k,l a,b,c
L2
o,pm,n o,p g,h,ig h k o
L1L0
a,b,c d,e,f g,h,j a,b,c
L2
o,pk,l,m n,o,p g,h,i
L1L0
a,b,c d,e,f g,h,j a,b,c
L2
o,pk,l,m n,o,p g,h,ih i r t
L1L0
a,b,c d,e,f g,h,i a,b,c
L2
o,pj,k,l m,n,o g,h,ip,r,t
L1L0
a,b,c d,e,f g,h,i a,b,c
L2
m,nj,k,l g,h,ip,r,t o,p
25. A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a
new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file
when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule
ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).
Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key
of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping
around to the beginning of the key space if there is no such file).
Compaction时选择的⽂文件【Level-L⼀一个⽂文件,Level-(L+1)多个⽂文件】最终会在Level-(L+1)⽣生成多个⽂文件(每个⽂文件的⼤大⼩小都相
同=2MB)。switch new file指的是输⼊入是多个⽂文件,合并时先保存在内存中,如果超过2M就⽣生成⼀一个新⽂文件。
如果新⽣生成的⽂文件的key range⽐比Level-(L+2)层中有重叠的⽂文件的key range超过10个,则这个新⽂文件要分裂为多个新⽂文件。举例
要⽣生成的新⽂文件key范围是1-50,⼤大⼩小=2MB,但是Level-(L+2)每个key占⽤用⼀一个⽂文件(1,2,3...50每个key都是⼀一个sstable⽂文件
=2MB),那么这次合并这个新⽣生成的⽂文件要分裂为5个(⽂文件⼤大⼩小可能不⾜足2MB),⽐比如1-10对应Level-(L+2)的1-10,然后是
2-20,3-30...。
如果没有这个约束条件,假设合并⽣生成到Level-(L+1)的⽂文件的key范围是1-50(⼤大⼩小仍然是2MB),⽽而Level-(L+2)对应1-50 key
范围的⽂文件有50个,当Level-(L+1)的这个1-50⼀一个⽂文件要和Level-(L+2)的合并时,就要选择Level-(L+1)的⼀一个⽂文件和Level-
(L+2)的50个⽂文件,⽽而通常选择下⼀一级的重叠⽂文件时,不宜选择过多的⽂文件参与合并!
50. C* treats a delete as an insert or upsert. The data being added to the partition in the DELETE command is a deletion marker called a
tombstone. The tombstones go through Cassandra's write path, and are written to SSTables on one or more nodes. The key difference feature
of a tombstone: it has a built-in expiration date/time(gc_grace_seconds). At the end of its expiration period the tombstone is deleted as part
of Cassandra's normal compaction process. You can also mark a Cassandra record (row or column) with a time-to-live(TTL) value. After this
amount of time has ended, Cassandra marks the record with a tombstone, and handles it like other tombstoned records.
删除也是⼀一种插⼊入或upsert(insert or update:如果记录不存在是insert,如果已经存在就是update)。执⾏行删除命令或记录的TTL超过后,记录会被标记为
Tombstone。标记为Tombstone的记录会和正常的写路径写到SSTable,并赋值到多个节点。
insert into xx(column)values(abc) using ttl 60s; => sstable#1: column,abc, 123400
60s later, mark column record as tombstone => sstable#2: column,delete,123460
sometimes later, compaction happend => 合并上⾯面的两条记录,最终只保留tombstone记录(注意tombstone记录不会⽴立即被删除)
Compaction merges the data in each SSTable by partition key, selecting the version of the data with the latest timestamp.
合并时选择时间撮最⾼高的那条记录,以tombstone为例,sstable#2的记录的时间撮⽐比sstable#1的时间撮要⾼高,
所以最终选择时间撮⾼高的sstable#2,但是因为这是⼀一条delete记录,所以column这条记录在合并后最终不存在(但是tombstone记录还存在)。
the client controls how many replicas to block for on writes, which includes deletions. Thus, a delete operation can't just wipe out all traces of the data
being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that
did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, C* replaces it with a special value
called a tombstone. The tombstone can then be propagated(传播) to replicas that missed the initial remove request(没有收到删除请求的). By Defines
GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it(?) can be GC'd during compaction.
每个节点在本地跟踪tombstone的时间,⼀一旦超过GCGraceSeconds,在Compaction的时候,会删除Tombstone记录
如果⼀一个节点当掉了就收不到Tombstone请求,但是如果能在GCGraceSeconds内恢复还是可以收到Tombstone请求的。
但是如果当掉的时间⻓长于GCGraceSeconds,Tombstone就会彻底的消失了,即超过GCGraceSeconds就收不到Tombstone请求了。
Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed
as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction.
表级别的TTL,当记录超过TTL后就会被⽴立即删除,没有Tombstone,没有Compaction!
Data deleted by TTL isn’t the same as issue a delete–each expiring cell internally has a ttl/timestamp at which it will be converted into a tombstone.
There is no tombstone added to the memtable or flushed to disk–it just treats the expired cells as tombstones once they’re past that timestamp.
TTL有两种:
1. ⾏行或列级别的TTL,时间超过TTL,标记为Tombstone
2. 表级别的TTL,没有Tombstone,时间⼀一到⽴立即删除
https://wiki.apache.org/cassandra/DistributedDeletes
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDeletes.html
分布式删除
51. If a node receives a delete for data it stores locally, the node tombstones the specified record and tries to pass the tombstone(接收tombstone
的节点还是协调节点,即客户端->协调节点->节点1,节点2,节点3,还是协调节点->节点1->节点2,节点3?) to other nodes containing replicas of
that record. But if one replica node is unresponsive at that time, it does not receive the tombstone immediately, so it still contains the pre-
delete version of the record. If the tombstoned record has already been deleted from the rest of the cluster(集群中其他节点发⽣生Compaction时
如果删除了tombstone) by the time that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to
the rest of the cluster. This kind of deleted but persistent record is called a zombie.
To prevent the reappearance of zombies, Cassandra gives each tombstone a grace period. Its purpose is to give unresponsive nodes time to
recover and process tombstones normally. 【If a client writes a new update to the tombstoned record during the grace period, Cassandra
overwrites the tombstone. If a client sends a read for that record during the grace period, Cassandra disregards the tombstone, and retrieves
the record from other replicas if possible】.
When an unresponsive node recovers, Cassandra uses hinted handoff to replay the database mutations the node missed while it was down.
Cassadra will not replay a mutation for a tombstone during its grace period. But if the node does not recover until after the grace period ends,
Cassandra may miss the deletion.After the tombstone's grace period ends, Cassandra deletes the tombstone during compaction.
删除tombstone,还是删除tombstone对应的记录?删除记录必须在gc_grace_period之后吗?
hinted handoff会重放mutation,但不会重放tombstone,那tombstone怎么被恢复的节点接收?
节点⽆无响应,读(写)当然也不会有响应,写会覆盖tombstone好理解,读为什么要忽略?
Tombstones exist for a period of time defined by gc_grace_period. it give the unresponsive nodes time to recover and process tombstone normally.
Marking data with a tombstone signals(通知) Cassandra to retry(重试:发送不成功就继续发送) sending a delete request(就是tombstone请求) to a
replica that was down at the time of delete(删除时节点当掉了). If the replica comes back up(恢复) within the grace period of time, it eventually(最终)
receives the delete request. However, if a node is down longer than the grace period, the node can miss the delete(丢失删除请求) because the
tombstone disappears after gc_grace_seconds(消失了).Cassandra always attempts to replay missed updates when the node comes back up again.
tombstone disappear after gc_grace_period指的是在这段时间内C*会尝试不断向当掉的节点发送tombstone请求,如果当掉的节点在这段时间内成功恢
复起来,它就⼀一定会接收到tombstone请求。如果当掉的节点在这段时间内⽆无法正常恢复,它就错过了tombstone请求,这种情况会造成数据在其他节点已
经被删除掉,但是在后⾯面才恢复起来的节点上的数据仍然存在,⽽而读请求就会认为刚刚恢复的节点上的数据是新的数据,它就会把这份数据重新拷⻉贝到其
他节点上,zombie! 这就好⽐比我们之前删掉了这条记录,现在却起死回⽣生!
节点当掉后,C*本⾝身有Hint Handoff机制确保更新暂时保存在其他节点(协调节点),当节点恢复后,更新会发送到节点上进⾏行replay,那么tombstone请求
也是类似的。Hint Handoff这种机制如果在节点当掉很⻓长时间,那么hints数据会积累的⾮非常多,⽽而tombstone可能不希望保留那么久,所以如果超过⼀一定时
间,当掉的节点如果还没有恢复起来,那么它就有可能丢失掉tombstone请求。默认值是10天(864000s),通常在10天内⼀一个节点总是⼀一直当掉很少⻅见。
❓
write overwrite tombstone read disregards tombstone
insert ABC,D tombstone ABC update ABC,E query ABC
Distribute Delete → Fully Consistent Delete: No Data Reappear
52. 》Tombstone经过gc_grace_seconds后会被删除,如果节点当掉的时间超过gc_grace_seconds,仍然可能出现Zombie
》如果节点在gc_grace_seconds时间内恢复,通过读修复和反熵可以将其他节点的Tombstone更新到刚刚恢复的节点上
读修复和反熵都可以保证在gc_grace内不会出现Zombie,其中读修复必须有读操作,⽽而反熵则通过⼿手动repair来完成
如果仅仅只有读修复,⽽而如果这条记录⼀一直没有发⽣生读操作,其他节点的Tombstone记录还是不会被更新到恢复的节点
》⼀一条记录的多次更新操作包括Tombstone记录,在发⽣生Compaction时,如果Tombstone时间撮最新,在把其他记录都
删除后,Tombstone记录还要记录保留⼀一段时间。保留Tombstone的⺫⽬目的是要和集群中其他节点的删除操作互相通信。
删除过期记录,是因为Tombstone时间撮最⼤大,那么旧时间撮的记录就没有存在的必要了,即使存在,也是浪费空间。
Note that at this point(Compaction) tombstoned data(data, not tombstone itself) is directly removed during the compaction. However, as previously
discussed, we do still store a tombstone marker(it’s tombstone itself) on disk, as we need to keep a record of the delete itself in order effectively
communicate the delete operation to the rest of the cluster. We need not keep the actual value as that is not needed for consistency.
》记录被标记为Tombstone有两种⽅方式:执⾏行DELETE命令、执⾏行INSERT/UPDATE时使⽤用TTL。
其中DELETE命令有可能造成Zombie,但是TTL不会,因为DELETE涉及节点当掉的多节点Tombstone请求通信,
⽽而使⽤用TTL时,每个节点都知道⾃自⼰己什么时候删除记录,不涉及节点之间的Tombstone请求通信(如果节点当掉?)
TTL are not affected as no node can have the data and miss the associated TTL, it is atomic, the same record.
Any node having the data will also know when the data has to be deleted.
》什么时候删除Tombstone记录?
C* will fully drop those tombstones when a compaction triggers, only after local_delete_time + gc_grace_seconds as defined on the table the data
belongs to. Remember that all the nodes are supposed to have been repaired within gc_grace_seconds to ensure a correct distribution of the
tombstones and prevent deleted data from reappearing as mentioned above.
Inorder to make sure that all the replicas received the delete and have tombstone stored to avoid having some zombie data issues.
Our only way to achieve that is a full repair. After gc_grace_seconds, the tombstone will eventually be evicted and
if a node missed the tombstone, we will be in the situation where the data can reappear(zombie).
》删除Tombstone记录还有⼀一些条件,在Compaction时和Tombstone相关的记录必须都在同⼀一次Compaction⾥里
We need all the fragments of a row or partition to be in the same compaction for the tombstone to be removed. Considering a compaction handles files
1 to 4, if some data is on table 5, the tombstones will not be evicted, as we still need it to mark data on SSTable 5 as being deleted, or data from
SSTable 5 would come back (zombie). 如果合并1-4⽂文件,并把Tombstone删除了,那么⽂文件5存在的记录就有可能复活,所以还需要保留Tombstone
》TTL EstimatedHistogram(tombstone_threshold)表⽰示SSTable的Tombstone超过20%,发⽣生Single SSTable Compaction,删除⼤大量的Tombstone
不过即使发⽣生单⽂文件的Compaction,对于有重叠的SSTables,Tombstone不能被删除,因为要确保对应记录的所有⽂文件都删除了才能删除Tombstone
Not clearing tombstones because compacted rows with a lower timestamp were present in other sstables.
》gc_grace_seconds设置为0会产⽣生Zombie,不过如果单机环境可以设置为0,因为单机不存在分布式删除。
gc_grace_seconds设置为0后,发⽣生Compaction时,会删除Tombstone。
》gc_grace_seconds的⺫⽬目的是给定这段时间内,当掉的节点能够有机会/有时间恢复正常(收到丢失的Tombstone请求),如果过了这个时间节点仍然没有恢
复,出现Zombie,即使有Repair也⽆无能为例,实际上正是由于Repair机制导致了Zombie,如果没有Repair,刚刚恢复的节点上应该被删除的数据就不会
被传播到其他节点(这些节点已经把数据删除),但是没有Repair⼜又不⾏行,只有这个恢复的节点有数据,为了保证⼀一致性,需要把数据复制到其他节点上。
WARNING: Never set gc_grace_seconds this low or else previously deleted data may reappear via repair if a node was down while tombstones are
removed. 其他节点过了gc_grace_seconds后会删除掉Tombstone记录,⽽而刚恢复节点上还有旧的记录,由于集群没有Tombstone了就会发⽣生Zombie.
54. http://www.slideshare.net/planetcassandra/tombstones-and-compaction-48960191
Hard Delete
Tombstone
Node Down…
Node Down…
Tombstones are distributed by
AntiEntropy and Read Repair.
But we need to set a limit on
how long they will be distributed for.
gc_grace_seconds: do not distribute
tombstones older than this ma(n)y sec.
超过gc_grace_seconds后就不会复制
通过反熵和读修复双重机制的保证,
在gc_grace_seconds时间内如果节点
恢复了, 则最终会收到Tombstone请求
这样三个节点都有Tombstone请求了
Purging the Deleted Data:
By Compaction
Purging the Tombstone:
gc_grace_seconds
Repair
Zombie
Distribute Detele & Distribute Tombstone
59. https://www.tomaz.me/slides/2014-24-03-cassandra-anti-patterns/#/23
http://distributeddatastore.blogspot.com/2016/04/tombstones.html
How do tombstones affect range query reads?
1. Gimme a single column (e.g. c5) - no biggie, uses bloom filter and pointer to a offset in a sstable file, skips tombstones
2. Gimme all the columns between c5 and c10 (inclusive) - houston, we have a problem (need to do late filtering)!
(
)
Tombstones can impact performance of slice queries especially for wider rows(columns?). When Cassandra executes a column slice query it
needs to read columns from all the SSTables that include the given row and filter out tombstones. And all these tombstones need to be kept in
memory till the row fragments from all SSTables are merged which increase heap space usage.
60. Without GC Grace Seconds,Deleted Data Can Be Revived(复活)
Data deleted and removed during compaction while a replicate is offline,
will be restored when the replicate comes back online
With GC Grace Seconds,a Temporary Outage(中断) Will Not Effect Deletes
Data deleted will not be removed until after GC Grace Seconds. This prevents an
outage of Node2 of less then GC Grace Seconds from modifying the database.
http://www.planetcassandra.org/blog/qa-starters-guide-to-cassandra/
Compaction时移除Tombstone
节点2恢复,修复会导致zombie
Compaction时保留Tombstone
节点2恢复,通过节点1的修复
会将Tombstone传播给节点2
节点2→节点1
节点1→节点2
Data和Tombstone同时存在
说明这两个数据是不同的,
否则如果Data⽐比Tombstone旧
Compaction后只存在Tombstone
如果Data⽐比Tombstone新,
Compaction后只会存在Data
74. First, memtables are first flushed to SSTables in the first level (L0). Then compaction merges these SSTables with larger SSTables in level L1.
The SSTables in levels greater than L1 are merged into SSTables with a size greater than or equal to sstable_size_in_mb (default: 160 MB). If a L1
SSTable stores data of a partition that is larger than L2, LCS moves the SSTable past L2 to the next level up.
In each of the levels above L0, LCS creates SSTables that are about the same size. Each level is 10X the size of the last level, so level L1 has 10X as
many SSTables as L0, and level L2 has 100X as many. If the result of the compaction is more than 10 SSTables in level L1, the excess SSTables are
moved to level L2.
The LCS compaction process guarantees that the SSTables within each level starting with L1 have non-overlapping data. For many reads, this
guarantee enables Cassandra to retrieve all the required data from only one or two SSTables. In fact, 90% of all reads can be satisfied from one
SSTable. Since LCS does not compact L0 tables, however, resource-intensive reads involving many L0 SStables may still occur.
At levels beyond L0, LCS requires less disk space for compacting — generally, 10X the fixed size of the SSTable. Obsolete data is evicted more often,
so deleted data uses smaller portions of the SSTables on disk. However, LCS compaction operations take place more often and place more I/O burden
on the node. For write-intensive workloads, the payoff of using this strategy is generally not worth the performance loss to I/O operations. In many
cases, tests of LCS-configured tables reveal I/O saturation on writes and compactions.
new sstables are added to the first level - L0, and immediately compacted with the sstables in L1. When L1 fills up, extra sstables are promoted to L2.
Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap.
Memtable刷写成SSTable后,⽴立即被加⼊入到L0 加⼊入L0的SSTable会⽴立即和L1的⽂文件发⽣生Compaction
L1以上每个SSTable⽂文件的⼤大⼩小都⼤大约是160M
从每个Level的总⽂文件⼤大⼩小来看,下⼀一层是上⼀一层的10倍
L1超过10个⽂文件时,超过的⽂文件会被移动到L2(因为L2还没有⽂文件,直接移动)
L0的SSTable⽂文件的⼤大⼩小是多少?160M吗?
L1及以上在同⼀一层的SSTable⽂文件不会有重叠数据
LCS不会压缩L0的⽂文件,如果L0有很多⽂文件,读可能会受影响,因为要读取L0的所有⽂文件
为了Compacting,仅需要10*160M额外的空间(10个⽂文件的⼤大⼩小)
Compaction越频繁,过期数据越容易被删除,但是节点承受的I/O压⼒力越⼤大
如果是写⽐比较多的场景,这种策略消耗的I/O对性能不是很好
加⼊入L0的sstable⽴立即和L1的合并,当L1满了,额外的sstables会被提升到L2
后续⽣生成到L1的⽂文件会和L2中有重叠的⽂文件合并(有没有附加条件还是总是合并)