Cassandra compaction

如果有⼈人问你数据库的原理，叫他读这这篇⽂文章：http://blog.jobbole.com/100349/
相⽐比⼆二叉树，⽤用更少的I/O查找key
⾼高效的范围查询，避免整棵树遍历
B+ Tree在磁盘上并不是连续分布的

HBase：The Deﬁnitive Guide，B+Tree and LSM Tree
[link to previous page]
[link to next page]
key1 -> rowid
key2 -> rowid
key3 -> rowid

✅
✅
✅
Write KV->Memtable->SSTable Multi SSTable Get KV from Memtable & SSTables

k1,k3,k5,k6,k8
` k3 k1 k8 k6
k1,k3,k5,k6,k8
k2,k4,k5,k7,k8
k2,k4,k5,k7,k8
k4 k8 k2 k7 k5
k1,k4,k5,k6,k8
k1,k4,k5,k6,k8 k5,k6,k7,k8,k9
k5,k6,k7,k8,k9
k1 k5 k8 k6 k4 k6 k5 k9 k7 k8
Query k1Memtable
SSTable
k1,k2,k4
k2 k1 k4
✅ 🙅 ✅ 🙅
k1,k2,k3,k4,k5,k6,k7,k8,k9
————————————
v1,v1,v1,v1,v1,v1,v1,v1,v1
v2, ,v2,v2,v2,v2,v2
v3,v3, v3
v4, v4
k1,k3,k5,k6,k8 k2,k4,k5,k7,k8 k1,k4,k5,k6,k8 k5,k6,k7,k8,k9
#1.sst #2.sst #3.sst #4.sst
#5.sst
Query k1k1,k2,k4
读内存和四个SSTable⽂文件
读内存和⼀一个SSTable⽂文件
合并
10G 10G 10G 10G
40G
4个10G的合并为1个40G的，磁盘空间需要80G
合并没有完成前，新⽼老⽂文件同时存在（额外空闲）
当合并完成后，⽼老⽂文件才会被删除，磁盘占40G

http://cighao.com/2016/08/13/leveldb-source-analysis-01-introduction/
https://www.usenix.org/system/ﬁles/conference/fast16/fast16-papers-lu.pdf
update-in-place sorted tree
内存中有更新则就地更新排序
append-only B-trees
磁盘上数据采⽤用追加

The set of sorted tables are organized into a sequence of levels. The sorted table generated from a log file is placed in a special young level (also
called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged together with all of the
overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)
sstables分成多个层级，从⽇日志⽂文件刷写到磁盘的⽂文件⽴立即加⼊入Level-0，当Level-0超过4个⽂文件（每个⽂文件1MB），会和Level-2的重叠部分进⾏行合并
Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping (disjoint) key ranges. Consider level
number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L,
and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating
new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
Level-0的keys允许重复，因为是从内存中刷写进来。当Level-L（L>=1）所有⽂文件超过10^L MB时，选择Level-L的⼀一个⽂文件和Level-(L+1)重叠⽂文件合并
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from
the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and
will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to
level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
如果Level-L只是覆盖了Level-(L+1)的⼀一部分，⽐比如Level-L的key=[D-F]，⽽而Level-(L+1)的key=[B-G]，则这个Level-(L+1)整个⽂文件都会作为合并的输⼊入
A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the
current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has
grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much
data from level-(L+2). 当Level L和Level (L+1)进⾏行合并时，⽂文件⼤大⼩小超过2M就⽣生成新⽂文件；如果key 覆盖了10个Level-(L+2)⽂文件，也要尽早⽣生成新⽂文件
Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key of the last compaction at
level L. The next compaction for level L will pick the first file that starts after this key (wrapping around to the beginning of the key space if there is no
such file).为了公平地选择Level-L中参与合并的⽂文件，我们会记住上⼀一次合并的end-key，下⼀一次合并时会选择接着end-key的第⼀一个⽂文件
Level 0：When the log file grows above a certain size (1MB by default):
Create a brand new memtable and log file and direct future updates here，In the background:
1).Write the contents of the previous memtable to an sstable
2).Discard the memtable
3).Delete the old log file and the old memtable(2)
4).Add the new sstable(1) to the young (level-0) level.
当⽇日志⽂文件超过1MB，创建新的memtable和⽇日志⽂文件，将旧memtable刷写成sstable，丢弃旧memtable，删除旧⽇日志和旧memtable，添加sstable到L0
Level-0 compactions will read up to four 1MB files from level-0, and at worst all the level-1 files (10MB). I.e., we will read 14MB and write 14MB. Other
than the special level-0 compactions, we will pick one 2MB file from level L. In the worst case, this will overlap ~ 12 files from level L+1 (10 because
level-(L+1) is ten times the size of level-L, and another two at the boundaries since the file ranges at level-L will usually not be aligned with the file
ranges at level-L+1). The compaction will therefore read/write 26MB. Assuming a disk IO rate of 100MB/s, the worst compaction cost will be ≈ 0.5 s.
Level-0⾄至多只会读取Level-0中4个1MB的⽂文件，最坏情况下会读取Level-1的10MB⽂文件（因为Level-1的⺫⽬目标阈值=10M，所有⽂文件和Level-0都重叠下）
除了Level-0的合并，其他Level会选择Level-L的⼀一个2MB⽂文件，最坏情况会读取12个Level-(L+1)的⽂文件（其中10个完全覆盖，另外两个考虑了边界⽂文件）
https://rawgit.com/google/leveldb/master/doc/impl.html

合并：
每⼀一层所有⽂文件的⼤大⼩小是有限制的，⼤大约是以10倍依次递增。当某⼀一层的总⼤大⼩小超过了它的限制时，合并线程就会从该层选择⼀一个⽂文件将其和下⼀一层的所
有重叠的⽂文件进⾏行归并排序产⽣生⼀一个新的SSTable⽂文件放在下⼀一层中。合并线程会⼀一直运⾏行下去直到所有层的⼤大⼩小都在规定的限制内。在合并的过程中，
LevelDB会保持除level 0之外的每⼀一层上的所有⽂文件的key的范围不会重叠。L0层上⽂文件的key的范围可以重复，因为它是直接从memtable 上刷新过来。
压缩将选择的⽂文件内容重新输出到⼀一序列新的level-(L+1)⽂文件中（多路合并），当每个输出⽂文件达到2M时将会切换⼀一个新的⽂文件，或者当新输出的⽂文件中
key区间覆盖了level-(L+2)中多于10个⽂文件时，也会切换⽣生成新⽂文件；第⼆二个规则保证此后level-(L+1)的压缩时⽆无需/不会选择太多的⽂文件。
查找：
对于查询操作，LevelDB ⾸首先会查询memtable，接下来是immutable memtable，然后依次查询L0-L6中每⼀一层的⽂文件。确定⼀一个随机的key的位置⽽而需要
搜索⽂文件的次数的上界是由最⼤大的层数来决定的，因为除了L0之外，每⼀一层的所有⽂文件中key的范围都是没有重叠的。由于L0中⽂文件的key的范围是有重
叠的，所以在L0中进⾏行查询时，可能会查询多个⽂文件。
写放⼤大：
LevelDB 中的写放⼤大是很严重的。假如，每⼀一层的⼤大⼩小是上⼀一层的10倍，那么当把 i-1 层中的⼀一个⽂文件合并到 i 层中时，LevelDB 需要读取 i 层中的⽂文件的
数量多达10个，排序后再将他们写回到 i 层中去。所以这个时候的写放⼤大是10。对于⼀一个很⼤大的数据集，⽣生成⼀一个新的 table ⽂文件可能会导致 L0-L6 中相
邻层之间⼀一次发⽣生合并操作，这个时候的写放⼤大就是50（L1-L6中每⼀一层是10，10+10+10+10+10=50）。
读放⼤大：
(1) 查找⼀一个 key-value 对时，LevelDB 可能需要在多个层中去查找。在最坏的情况下，LevelDB 在 L0 中需要查找8个⽂文件，在 L1-L6 每层中需要查找1个
⽂文件，累计就需要查找14个⽂文件。
(2) 在⼀一个 SSTable ⽂文件中查找⼀一个 key-value 对时，LevelDB 需要读取该⽂文件的多个元数据块。所以实际读取的数据量应该是：index block + bloom-
ﬁlter blocks + data block。例如，当查找 1KB 的 key-value 对时，LevelDB 需要读取 16KB 的 index block，4KB的 bloom-ﬁlter block 和 4KB 的 data
block，总共要读取 24 KB 的数据。在最差的情况下需要读取 14 个 SSTable ⽂文件，所以这个时候的写放⼤大就是 24*14=336。较⼩小的 key-value 对会带来
更⾼高的读放⼤大。
默认值：
log⽇日志⽂文件达到1M时，会将Memtable刷写到磁盘上的SSTable，并将sst加⼊入Level0
合并多个⽂文件(⽐比如L0和L1)会(在L1)⽣生成更⼤大的⽂文件，当达到2M时，⽣生成新SSTable
每个Level(>=1)所有⽂文件的总⼤大⼩小(阈值)是前⼀一个Level的10倍，L1=10M，L2=100M
当每个Level的总⼤大⼩小超过阈值时，选择当前Level⼀一个⽂文件和下⼀一个Level重叠的⽂文件
当level层次⼤大于0时，同⼀一层的各个⽂文件之间的Rowkey区间不会重叠。所以在level n与level n+1的数据块进⾏行合并时，可以明确的知道某个key
值处在哪个数据块中，可以⼀一个数据块⼀一个数据块的合并，合并后⽣生成新块就丢掉⽼老块。不⽤用⼀一直到所有合并完成后才能删除⽼老的块。合并的过
程中，仅需在由上到下的部分⽂文件参与，⽽而不是要对所有⽂文件执⾏行Compaction操作。这样会加快Compaction执⾏行的效率。在读请求最极端的情
况下，从Level0开始读数据，⼀一直读到最下层Level n

1. Every sstable is created when a fixed (relatively small) size limit is reached. By default L0 gets 5MB files of files, and each subsequent
level is 10x the size. (in L1 you'll have 50MB of data, L2 500MB, and so on).
2. Sstables are created with the guarantee that they don't overlap
3. When a level fills up, a compaction is triggered and stables from level-L are promoted to level-L+1. So, in L1 you'll have 50MB in ~10 files,
L2 500MB in ~100 files, etc..
5M
50M
500M
L0
L2
L1
ABCDEL1 FGHIJ KLMNO PQRST
L2 ABC DEF GHI JKL MNO PQR ST
5M
5M 5M 5M 5M 5M 5M 5M 5M 5M 5M
5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M 5M … … … … … 5M
L0
L2
L1 5M*10 Files=50M
5M*100 Files=500M
http://stackoverflow.com/questions/29766453/how-does-the-leveled-compaction-strategy-ensure-90-of-reads-are-from-one-sstabl
AL0
1M
A B C D EL0
1M 1M 1M 1M 1M
L0 is filled
FAL0
1M
B
1M
ABCDE
L0 F
1M
L1
5M
ABCDE
FL0
L1
1M
5M
G
ABCDE
FL0
L1
1M
5M
G H I J K
ABCDE
L0
L1
1M
5M
K
FGHIJ
90% reads from the same file 5/6=83.3% 10/11=90%
所有⽂文件的总⼤大⼩小，⽽而不是说⼀一个⽂文件的⼤大⼩小
当L0充满了5MB的⽂文件，L1充满50M的⽂文件，
L2充满500M的⽂文件，都会发⽣生Compaction
同⼀一个Level⾥里每个SSTable的key范围不会有重叠
但是在不同Level中，同⼀一个key有可能是重叠的
由于每个Level的总⼤大⼩小呈10倍，⽂文件数量通常也是10倍

The set of sorted tables are organized into a sequence of levels. The sorted table generated from a log file is placed in a special young level
(also called level-0). When the number of young files exceeds a certain threshold (currently four), all of the young files are merged
together with all of the overlapping level-1 files to produce a sequence of new level-1 files (we create a new level-1 file for every 2MB of data.)
当Level-0超过4个⽂文件时，Level-0中所有⽂文件（刚好超过4个时就触发，所以所有⽂文件通常也是这4个⽂文件）会和Level-1的有重叠⽂文件进⾏行合并
(初始时可能Level-1中没有任何⽂文件，那就把Level-0的所有⽂文件进⾏行合并，只要Level-1有⽂文件，就会参与到合并过程)
合并过程中并不是产⽣生⼀一个⼤大⽂文件，⽽而是数据⼤大⼩小满2M写⼀一个⽂文件，最终可能有多个⼩小⽂文件
从内存写到磁盘的⽂文件(Level-0)是1M(⽇日志=1M时开始刷写)，Level-1及以上每个⽂文件⼤大⼩小为2M
问题：⽇日志⽂文件=1M，内存以及刷写到磁盘的sstable⽂文件⼀一定是1M吗？
答案：不⼀一定，假设有⼀一个场景是⼀一直操作同⼀一个key，不同的value，在内存中只会有⼀一条记录
⽽而⽇日志⽂文件因为是追加的，每次更新都有⼀一条记录，最终⽇日志⽂文件=1M，内存可能只有1KB
问题：⽇日志⽂文件⼤大⼩小决定了什么时候刷新到磁盘，还是内存的⼤大⼩小决定？
如果由⽇日志⽂文件决定，因为1M的⽇日志占⽤用的内存不会很多，这样设置的Memtable⼤大⼩小就没起什么作⽤用了
如果由Memtable⼤大⼩小决定，只有Memtable占⽤用的内存达到阈值时，才会刷写到磁盘上
假设Memtable=100M，⽣生成SSTable时，每个SSTable=1M，在Level-0会⽣生成100个⽂文件。
L0
L1
L2
L3
young level(L0)
memory
Level 0：When the log file grows above a certain size (1MB by default):
Create a brand new memtable and log file and direct future updates here
In the background:
1).Write the contents of the previous memtable to an sstable
2).Discard the memtable
3).Delete the old log file and the old memtable(2)
4).Add the new sstable(1) to the young (level-0) level.
memtable→sstable→Level-0→Level-0和Level-1合并

1M 1M L0
1M
1M
1M L01M
1M L01M
1M
1M
1M L01M
1M L01M
1M
1M 1M
1M 1M 1M
L0 L12M 2M
1M 1M
1M1M 1M 1M1M
L0
L0
2M 2M
2M 2M
2M 2M 2M
L1
L1
L1
…
L0
…
L1L0 2M 2M 2M 2M 2M
L0超过4M，选择所有的L0(5M)，第⼀一次Compaction
L1中没有⽂文件，对L0所有⽂文件以2M⼀一个⽂文件写⼊入L1
L1
合并过后，L0中原先有5M，写⼊入L1后不⼀一定有5M
当然也有可能是5M（不⾜足2M的也会写⼀一个新⽂文件）
最后L0中参与合并的所有⽂文件都会被删除掉
内存中的Memtable继续刷写到磁盘上，并把⽂文件
加⼊入到L0中，由于没达到L0级别的Compaction条件
L0不会发⽣生Compaction，L1的⽂文件也不会变化
达到L0的4个⽂文件，开始Compaction，由于L1有⽂文件
选择L0所有⽂文件以及L1的部分⽂文件（和L0有重叠的）
L0和L1的合并⽐比较特殊，虽然L1还没有满，也要参与
没参与合并的，在L1中不变
L0和L1进⾏行合并后，会在L1上⽣生成新的⽂文件，并
删除掉L0的所有⽂文件，以及L1上参与合并的⽂文件
××
随着内存刷写到磁盘，⽂文件⽴立即加⼊入到L0，当L0满⾜足
合并提交的条件，L0和L1进⾏行合并，在L1⽣生成新⽂文件
有些算法可能会限制只选择4M的⽂文件
L0和L1合并时，如果L1
有⽂文件，选择重叠⽂文件

Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping (disjoint) key ranges. Consider level
number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L,
and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating
new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from
the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and
will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to
level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
Level-0的key可能会重复，其他Level的key不会重复（同⼀一个Level中不同SSTable的key不会重复）。考虑Level-1以上的合并操作：当Level-L中所有⽂文件
的⼤大⼩小超过⺫⽬目标阈值（Level-1的⺫⽬目标=10M，Level-2的⺫⽬目标=100M..），选择Level-L的⼀一个⽂文件以及Level-(L+1)中覆盖Level-L范围的所有⽂文件，最后会
在Level-(L+1)⽣生成⼀一系列的新⽂文件（就像L0和L1合并时，会在L1⽣生成新⽂文件⼀一样，同时每个⽂文件的⼤大⼩小也还是2MB，注意每个⽂文件的⼤大⼩小并不会改变）
L0
L1
L2
L3
level-1, total target size=10M, each file size=2M, so there’re 5 Files
level-2, total target size=100M, file size=2M, so there’re 50 Files
level-3, total target size=1000M, file size=2M, so there’re 500 Files
k1-k3 k2-k5 k1-k4
k1-k3 k4-k6 k7-k9
k1-k2 k3-k4 k5-k6
k1 k2 k3-k4
合并：Level-L⼀一个，Level-(L+1)重叠的所有⽂文件

k1,k3 L0
k1,k3 L0k2,k4
k1,k3 L0k2,k4
k1,k3 L0
k1,k3 L0k2,k4
k1,k4
k1,k4 k2,k5
k1,k4 k2,k5
L0 L1k1,k2,k3 k4,k5
k1,k4
k1,k5k1,k4 k5,k6 k2,k3
L0
L0
k1,k2,k3 k4,k5
k1,k2,k3 k4,k5
L1
L1
L1L0
…
L1L0
L1
k1,k3
k2,k4
k1,k4
k2,k4
k2,k5
k2,k4
k1,k5
只要L1有⽂文件，当L0和L1的进⾏行合并时，即使L1没有达到阈值，也要参与合并
如果L1的没有参与合并，只是把L0的所有⽂文件直接合并并写⼊入到L1中，有问题：
L1中已有⽂文件的key range可能会和L0合并到L1的⽂文件的key range有重叠！
L1也要参与合并还有⼀一个好处是可以减少L1中不⾜足2M的⽂文件的碎⽚片个数。
每次L0合并写到L1中每个2M就写⼀一个新⽂文件，最后可能产⽣生不⾜足2M的⽂文件
k1,k2,k3 k4,k5,k6 ×
✅k1,k5k1,k4 k5,k6 k2,k3 k1,k2,k3 k4,k5
L0的范围是k1-k6，在这个key range⾥里L1所有⽂文件
都被覆盖到，所以合并时除了选择L0的所有⽂文件，
也要选择L1中被覆盖到的所有⽂文件
k1,k2 k3,k4 k5,k6 现在L1中所有⽂文件的key range就不会有重叠了！
L1中覆盖L0的⽂文件
😢

L1L0 2M 2M 2M 2M 2M1M 1M
L1L0 2M 2M 2M 2M 2M1M 1M 1M
L1L0 2M 2M 2M 2M 2M1M 1M 1M1M1M
L1L0 2M 2M 2M
L12M 2M 2M 2M 2M 2ML0
L0和L1的合并最终会使得L1的⽂文件越来越多
L0达到合并条件时，会和L1中有重叠的进⾏行合并
L0所有⽂文件和L1参与合并的⽂文件删除(合并后才删)
不过这⾥里为了区别新⽂文件和旧⽂文件，先删除旧⽂文件
L1的⺫⽬目标阈值=10M，超过10M后，选择L1的⼀一个⽂文件以及L2中重叠的⽂文件，由于L2还没有任何⽂文件，所以L1中这个被选择的⽂文件直接提升到L2中
L0
2M 2M 2M 2M 2M
L1 L2
2M
L0
2M 2M 2M 2M 2M
L1 L2
2M1M 1M1M1M
L0
2M 2M 2M 2M 2M
L1 L2
2M2M
L0
2M 2M 2M 2M
L1 L2
2M2M 2M
L1的⺫⽬目标阈值=10M，超过10M后，选择L1的⼀一个⽂文件以及L2中重叠的⽂文件，由于L2现在存在⽂文件，所以合并后的⽂文件的key range不会有重叠
Level-1⽂文件总⼤大⼩小超过10M时，其中⼀一个⽂文件会和Level-2有重叠的⽂文件⼀一起合并，并⽣生成新⽂文件到Level-2

K-O
K L M N O
K-O
K L M N O
K-O
K L M N O
K-O
}
Level-L
Level-(L+1)
Level-L
Level-(L+1)
Level-(L+1)中和Level-L有重叠⽂文件，但是合并没有选择，最终Level-(L+1)会有重叠⽂文件
Level-(L+1)中和Level-L有重叠⽂文件，合并过程⼀一起参与，最终Level-(L+1)不会有重叠⽂文件
对于Level-1以上的每⼀一层，同⼀一层的所有SSTable⽂文件之间的key range都不会有重叠。前提条件是
Level-L和Level-(L+1)合并时选择Level-L的⼀一个⽂文件、在Level-(L+1)选择和Level-L重叠的所有⽂文件
A-E F-H I-L M-O P-R S-V W-X Y-Z
A-J K-S T-Z Level-L选择的⼀一个⽂文件的范围是K-S，Level-(L+1)中覆盖了K-S的⽂文件有[#6,#7,#8,#9]
虽然#6和#9只有⼀一部分覆盖，但是整个⽂文件都要被选择，因为没办法选择⼀一点点⽂文件#1.sst #2.sst #3.sst
#4.sst #5.sst #6.sst #7.sst #8.sst #9.sst #10.sst #11.sst
×
✅

gradually migrating new updates from the young level
to the largest level using only bulk reads and writes
将最新的更新记录逐渐地从年轻级别（Level-0）迁移到
最⼤大的级别（Level-n），只是⽤用批量读和写就可以完成
[E’,F’,G’]
[A,B,C] [D,E,F] [G,H,I,J] [K,L,M]
[A,B,C] [K,L,M][D,E/E’,F/F’] [G/G’,H] [I,J]
L1L0 a,b,c,d e,f,g,h j,k,lc,b,da,b,c c,d,e b,f,g
L1L0 a,b,c d,e,f j,k,lg,h,i
L1L0 a,b,c d,e,f g,h,im,h,ne,g,k h,f,g g,k,l
L1L0 a,b,c d,e i,j,k,lf,g,h
j,k,l
m,n
L1L0 a,b,c d,e i,j,k,lf,g,h m,nk m n o
L1L0
a,b,c d,e i,j,kf,g,h l,m n,o
L2
L1L0
L2
L1L0
L2
k n m p
L1L0
a,b,c d,e i,jf,g,h k,l n,o
L2
m,n,p
L1L0
d,e i,jf,g,h k,l a,b,c
L2
o,pm,n,p
选择⼀一个⽂文件[a,b,c]，
由于L2中没有和它重叠的，
该⽂文件直接提升到L2

L1L0
d,e i,jf,g,h k,l a,b,c
L2
o,pm,n,pa b c i
L1L0
a,b,c g,h,i,jd,e,f k,l a,b,c
L2
o,pm,n,p
L1L0
a,b,c g,h,i,jd,e,f k,l a,b,c
L2
o,pm,n,pg h l p
L1L0
a,b,c g,h,id,e,f j,k a,b,c
L2
o,pl,m n,p
L1L0
a,b,c d,e,f j,k a,b,c
L2
o,pl,m n,pn k m o g,h,i
L1L0
a,b,c d,e,f j,k,l a,b,c
L2
o,pm,n o,p g,h,i
L1L0
a,b,c d,e,f j,k,l a,b,c
L2
o,pm,n o,p g,h,ig h k o
L1L0
a,b,c d,e,f g,h,j a,b,c
L2
o,pk,l,m n,o,p g,h,i
L1L0
a,b,c d,e,f g,h,j a,b,c
L2
o,pk,l,m n,o,p g,h,ih i r t
L1L0
a,b,c d,e,f g,h,i a,b,c
L2
o,pj,k,l m,n,o g,h,ip,r,t
L1L0
a,b,c d,e,f g,h,i a,b,c
L2
m,nj,k,l g,h,ip,r,t o,p

每⼀一层的⽂文件数量：
Level-0⽆无限制，但是很容易满⾜足4个⽂文件，⻢马上会和Level-1的进⾏行合并，最开始Level-1没有任何⽂文件，Level-0开始合并后，Level-1才会产⽣生新
⽂文件。随着Level-0不断合并，Level-1最终会达到阈值（⽐比如10M），当Level-1达到阈值时，其中⼀一个⽂文件会和Level-2的进⾏行合并，同样最开始
时Level-2也没有任何⽂文件，所以Level-1第⼀一次合并时，选择的这个⽂文件会直接放到Level-2中，当Level-1第⼆二次合并时，Level-2有⽂文件了。这
时会选择Level-1的⼀一个⽂文件和Level-2的重叠部分进⾏行合并⽣生成新⽂文件。任何合并操作，都会在下⼀一层⽣生成新⽂文件，删除参与合并的所有⽂文件。

A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a
new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file
when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule
ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).
Compactions for a particular level rotate through the key space. In more detail, for each level L, we remember the ending key
of the last compaction at level L. The next compaction for level L will pick the first file that starts after this key (wrapping
around to the beginning of the key space if there is no such file).
Compaction时选择的⽂文件【Level-L⼀一个⽂文件，Level-(L+1)多个⽂文件】最终会在Level-(L+1)⽣生成多个⽂文件(每个⽂文件的⼤大⼩小都相
同=2MB)。switch new file指的是输⼊入是多个⽂文件，合并时先保存在内存中，如果超过2M就⽣生成⼀一个新⽂文件。
如果新⽣生成的⽂文件的key range⽐比Level-(L+2)层中有重叠的⽂文件的key range超过10个，则这个新⽂文件要分裂为多个新⽂文件。举例
要⽣生成的新⽂文件key范围是1-50，⼤大⼩小=2MB，但是Level-(L+2)每个key占⽤用⼀一个⽂文件（1,2,3...50每个key都是⼀一个sstable⽂文件
=2MB），那么这次合并这个新⽣生成的⽂文件要分裂为5个（⽂文件⼤大⼩小可能不⾜足2MB），⽐比如1-10对应Level-(L+2)的1-10，然后是
2-20，3-30...。
如果没有这个约束条件，假设合并⽣生成到Level-(L+1)的⽂文件的key范围是1-50（⼤大⼩小仍然是2MB），⽽而Level-(L+2)对应1-50 key
范围的⽂文件有50个，当Level-(L+1)的这个1-50⼀一个⽂文件要和Level-(L+2)的合并时，就要选择Level-(L+1)的⼀一个⽂文件和Level-
(L+2)的50个⽂文件，⽽而通常选择下⼀一级的重叠⽂文件时，不宜选择过多的⽂文件参与合并！

000-099
100-199
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-109
110-119
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
125

000-099
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-109
110-119
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
125

000-099
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-109
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
125

000-099
100-199
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-109
110-119
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
125
L1
L2
L3
L1的⼀一个⽂文件和L2的10个⽂文件合并

000-099
100-199
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-125
126-137
138-156
157-169
170-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
130 131 132 133 134 135 136 137 138 139
140 141 142 143 144 145 146 147 148 149
150 152 153 154 155 156 157 158 159
160 161 162 163 164 165 167 168 169
170 171 172 173 174 175 176 177 178 179
180 181 182 183 184 185 186 187 188 189
125
151
166
假设按照⼤大⼩小满了之后才创建新⽂文件
L1
L2
L3

000-099
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-125
126-137
138-156
157-169
170-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
130 131 132 133 134 135 136 137 138 139
140 141 142 143 144 145 146 147 148 149
150 152 153 154 155 156 157 158 159
160 161 162 163 164 165 167 168 169
170 171 172 173 174 175 176 177 178 179
180 181 182 183 184 185 186 187 188 189
125
151
166
L1
L2
L3
L2的⼀一个⽂文件可能会和L3的不⽌止10个⽂文件进⾏行合并

000-099
200-299
300-399
700-799
800-899
400-499
600-699
500-599
900-999
000-009
010-019
020-029
030-039
040-049
050-059
060-069
070-079
080-089
090-099
100-109
110-119
120-129
130-139
140-149
150-159
160-169
170-179
180-189
190-199
200-209
…
290-299
300-309
…
390-399
400-409
…
490-499
500-509
700-799
800-899
600-699
900-999
000 001 002 003 004 005 006 007 008 009
010 011 012 013 014 015 016 017 018 019
090 091 092 093 094 095 096 097 098 099
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 126 127 128 129
190 191 192 193 194 195 196 197 198 199
125
100-199
Level-L
Level-(L+1)
Level-(L+2)
Level-L和Level-(L+1)合并时，除了按照⼤大⼩小⽣生成新⽂文件
如果覆盖了Level-(L+2)超过10个⽂文件，也要⽣生成新⽂文件

LSM由内存中的⼀一个Memtable和磁盘上的⼀一个或多个SSTable组成
磁盘上的⽂文件需要合并Compaction，最终不是合并为⼀一个⼤大⽂文件
⽽而是合并为多个⼩小⽂文件，否则合并⼀一个⼤大⽂文件代价太⼤大了，因为读取
⼀一个⼤大⽂文件，然后和内存中的数据合并，再写⼀一个⼤大⽂文件会⾮非常慢
https://www.percona.com/live/data-performance-conference-2016/sites/default/ﬁles/slides/RocksDB_Siying_Dong.pdf
http://rocksdb.org/blog/2207/dynamic-level/

基于Level的压缩策略，每⼀一层都有⼀一个⺫⽬目标⼤大⼩小（所有⽂文件的总⼤大⼩小，每个⽂文件⼤大⼩小⼀一般固定）
当该层的⽂文件⼤大⼩小达到⺫⽬目标阈值时，会选择当前⼀一个⽂文件和下⼀一层覆盖这个⽂文件的key的所有⽂文件
注意：覆盖这个词，⽽而不是说下⼀一层所有⽂文件都参与压缩合并

L0和L1的合并⽐比较特殊，会选择L0的所有⽂文件和L1的所有⽂文件进⾏行合并，通常L1⽂文件数量达到4个时开始合并

L0和L1合并之后，L0的⽂文件和L1的⽂文件都被删除掉，在L1⽣生成新的SSTable⽂文件
合并指的是将参与Compaction的所有⽂文件重新⽣生成新的⽂文件，旧的⽂文件就需要了
现在L1所有⽂文件的⼤大⼩小超过该层的⺫⽬目标阈值：1GB，接下来会对L1进⾏行Compaction

选取L1的⼀一个⽂文件，以及L2中覆盖L1的Key Range的所有⽂文件（可能还额外需要两个边缘⽂文件）进⾏行合并

同样合并过后，会删除L1的⼀一个⽂文件，以及L2中参与合并的所有⽂文件，并⽣生成新的⽂文件放在L2中

如果L2所有⽂文件也超过⺫⽬目标阈值，也会对L2和L3的进⾏行合并，这⾥里可以看到可以同时进⾏行多个Compaction

L2和L3进⾏行合并后，会删除L2和L3中参与Compaction的旧⽂文件，并在L3⽣生成新的SSTable⽂文件

如何预估空间消耗：假设所有的数据最终都只⽣生成到最后⼀一个Level，则浪费了约10%的空间
ex: 假设有5个Level：1G，10G，100G，1T，10T，最后⼀一层=10T，即⽤用户占⽤用的数据=10T
浪费了1G+10G+100G+1T≈1T=10% * 10T(db used size)
1G
10G
1T
10T
100T
100T
1G+10G+..+1T+10T≈10T
Level 0
Level 1
Level N
Level N-1
Level N-2

C* treats a delete as an insert or upsert. The data being added to the partition in the DELETE command is a deletion marker called a
tombstone. The tombstones go through Cassandra's write path, and are written to SSTables on one or more nodes. The key difference feature
of a tombstone: it has a built-in expiration date/time(gc_grace_seconds). At the end of its expiration period the tombstone is deleted as part
of Cassandra's normal compaction process. You can also mark a Cassandra record (row or column) with a time-to-live(TTL) value. After this
amount of time has ended, Cassandra marks the record with a tombstone, and handles it like other tombstoned records.
删除也是⼀一种插⼊入或upsert(insert or update：如果记录不存在是insert，如果已经存在就是update)。执⾏行删除命令或记录的TTL超过后，记录会被标记为
Tombstone。标记为Tombstone的记录会和正常的写路径写到SSTable，并赋值到多个节点。
insert into xx(column)values(abc) using ttl 60s; => sstable#1: column,abc, 123400
60s later, mark column record as tombstone => sstable#2: column,delete,123460
sometimes later, compaction happend => 合并上⾯面的两条记录，最终只保留tombstone记录(注意tombstone记录不会⽴立即被删除)
Compaction merges the data in each SSTable by partition key, selecting the version of the data with the latest timestamp.
合并时选择时间撮最⾼高的那条记录，以tombstone为例，sstable#2的记录的时间撮⽐比sstable#1的时间撮要⾼高，
所以最终选择时间撮⾼高的sstable#2，但是因为这是⼀一条delete记录，所以column这条记录在合并后最终不存在(但是tombstone记录还存在)。
the client controls how many replicas to block for on writes, which includes deletions. Thus, a delete operation can't just wipe out all traces of the data
being removed immediately: if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that
did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, C* replaces it with a special value
called a tombstone. The tombstone can then be propagated(传播) to replicas that missed the initial remove request(没有收到删除请求的). By Deﬁnes
GCGraceSeconds, and had each node track tombstone age locally. Once it has aged past the constant, it(?) can be GC'd during compaction.
每个节点在本地跟踪tombstone的时间，⼀一旦超过GCGraceSeconds，在Compaction的时候，会删除Tombstone记录
如果⼀一个节点当掉了就收不到Tombstone请求，但是如果能在GCGraceSeconds内恢复还是可以收到Tombstone请求的。
但是如果当掉的时间⻓长于GCGraceSeconds，Tombstone就会彻底的消失了，即超过GCGraceSeconds就收不到Tombstone请求了。
Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed
as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction.
表级别的TTL，当记录超过TTL后就会被⽴立即删除，没有Tombstone，没有Compaction！
Data deleted by TTL isn’t the same as issue a delete–each expiring cell internally has a ttl/timestamp at which it will be converted into a tombstone.
There is no tombstone added to the memtable or ﬂushed to disk–it just treats the expired cells as tombstones once they’re past that timestamp.
TTL有两种：
1. ⾏行或列级别的TTL，时间超过TTL，标记为Tombstone
2. 表级别的TTL，没有Tombstone，时间⼀一到⽴立即删除
https://wiki.apache.org/cassandra/DistributedDeletes
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDeletes.html
分布式删除

If a node receives a delete for data it stores locally, the node tombstones the speciﬁed record and tries to pass the tombstone(接收tombstone
的节点还是协调节点，即客户端->协调节点->节点1,节点2,节点3，还是协调节点->节点1->节点2,节点3?) to other nodes containing replicas of
that record. But if one replica node is unresponsive at that time, it does not receive the tombstone immediately, so it still contains the pre-
delete version of the record. If the tombstoned record has already been deleted from the rest of the cluster(集群中其他节点发⽣生Compaction时
如果删除了tombstone) by the time that node recovers, Cassandra treats the record on the recovered node as new data, and propagates it to
the rest of the cluster. This kind of deleted but persistent record is called a zombie.
To prevent the reappearance of zombies, Cassandra gives each tombstone a grace period. Its purpose is to give unresponsive nodes time to
recover and process tombstones normally. 【If a client writes a new update to the tombstoned record during the grace period, Cassandra
overwrites the tombstone. If a client sends a read for that record during the grace period, Cassandra disregards the tombstone, and retrieves
the record from other replicas if possible】.
When an unresponsive node recovers, Cassandra uses hinted handoff to replay the database mutations the node missed while it was down.
Cassadra will not replay a mutation for a tombstone during its grace period. But if the node does not recover until after the grace period ends,
Cassandra may miss the deletion.After the tombstone's grace period ends, Cassandra deletes the tombstone during compaction.
删除tombstone，还是删除tombstone对应的记录？删除记录必须在gc_grace_period之后吗？
hinted handoff会重放mutation，但不会重放tombstone，那tombstone怎么被恢复的节点接收？
节点⽆无响应，读（写）当然也不会有响应，写会覆盖tombstone好理解，读为什么要忽略？
Tombstones exist for a period of time deﬁned by gc_grace_period. it give the unresponsive nodes time to recover and process tombstone normally.
Marking data with a tombstone signals(通知) Cassandra to retry(重试:发送不成功就继续发送) sending a delete request(就是tombstone请求) to a
replica that was down at the time of delete(删除时节点当掉了). If the replica comes back up(恢复) within the grace period of time, it eventually(最终)
receives the delete request. However, if a node is down longer than the grace period, the node can miss the delete(丢失删除请求) because the
tombstone disappears after gc_grace_seconds(消失了).Cassandra always attempts to replay missed updates when the node comes back up again.
tombstone disappear after gc_grace_period指的是在这段时间内C*会尝试不断向当掉的节点发送tombstone请求，如果当掉的节点在这段时间内成功恢
复起来，它就⼀一定会接收到tombstone请求。如果当掉的节点在这段时间内⽆无法正常恢复，它就错过了tombstone请求，这种情况会造成数据在其他节点已
经被删除掉，但是在后⾯面才恢复起来的节点上的数据仍然存在，⽽而读请求就会认为刚刚恢复的节点上的数据是新的数据，它就会把这份数据重新拷⻉贝到其
他节点上，zombie! 这就好⽐比我们之前删掉了这条记录，现在却起死回⽣生！
节点当掉后，C*本⾝身有Hint Handoff机制确保更新暂时保存在其他节点(协调节点)，当节点恢复后，更新会发送到节点上进⾏行replay，那么tombstone请求
也是类似的。Hint Handoff这种机制如果在节点当掉很⻓长时间，那么hints数据会积累的⾮非常多，⽽而tombstone可能不希望保留那么久，所以如果超过⼀一定时
间，当掉的节点如果还没有恢复起来，那么它就有可能丢失掉tombstone请求。默认值是10天(864000s)，通常在10天内⼀一个节点总是⼀一直当掉很少⻅见。
❓
write overwrite tombstone read disregards tombstone
insert ABC,D tombstone ABC update ABC,E query ABC
Distribute Delete → Fully Consistent Delete: No Data Reappear

》Tombstone经过gc_grace_seconds后会被删除，如果节点当掉的时间超过gc_grace_seconds，仍然可能出现Zombie
》如果节点在gc_grace_seconds时间内恢复，通过读修复和反熵可以将其他节点的Tombstone更新到刚刚恢复的节点上
读修复和反熵都可以保证在gc_grace内不会出现Zombie，其中读修复必须有读操作，⽽而反熵则通过⼿手动repair来完成
如果仅仅只有读修复，⽽而如果这条记录⼀一直没有发⽣生读操作，其他节点的Tombstone记录还是不会被更新到恢复的节点
》⼀一条记录的多次更新操作包括Tombstone记录，在发⽣生Compaction时，如果Tombstone时间撮最新，在把其他记录都
删除后，Tombstone记录还要记录保留⼀一段时间。保留Tombstone的⺫⽬目的是要和集群中其他节点的删除操作互相通信。
删除过期记录，是因为Tombstone时间撮最⼤大，那么旧时间撮的记录就没有存在的必要了，即使存在，也是浪费空间。
Note that at this point(Compaction) tombstoned data(data, not tombstone itself) is directly removed during the compaction. However, as previously
discussed, we do still store a tombstone marker(it’s tombstone itself) on disk, as we need to keep a record of the delete itself in order effectively
communicate the delete operation to the rest of the cluster. We need not keep the actual value as that is not needed for consistency.
》记录被标记为Tombstone有两种⽅方式：执⾏行DELETE命令、执⾏行INSERT/UPDATE时使⽤用TTL。
其中DELETE命令有可能造成Zombie，但是TTL不会，因为DELETE涉及节点当掉的多节点Tombstone请求通信，
⽽而使⽤用TTL时，每个节点都知道⾃自⼰己什么时候删除记录，不涉及节点之间的Tombstone请求通信（如果节点当掉？）
TTL are not affected as no node can have the data and miss the associated TTL, it is atomic, the same record.
Any node having the data will also know when the data has to be deleted.
》什么时候删除Tombstone记录？
C* will fully drop those tombstones when a compaction triggers, only after local_delete_time + gc_grace_seconds as deﬁned on the table the data
belongs to. Remember that all the nodes are supposed to have been repaired within gc_grace_seconds to ensure a correct distribution of the
tombstones and prevent deleted data from reappearing as mentioned above.
Inorder to make sure that all the replicas received the delete and have tombstone stored to avoid having some zombie data issues.
Our only way to achieve that is a full repair. After gc_grace_seconds, the tombstone will eventually be evicted and
if a node missed the tombstone, we will be in the situation where the data can reappear(zombie).
》删除Tombstone记录还有⼀一些条件，在Compaction时和Tombstone相关的记录必须都在同⼀一次Compaction⾥里
We need all the fragments of a row or partition to be in the same compaction for the tombstone to be removed. Considering a compaction handles ﬁles
1 to 4, if some data is on table 5, the tombstones will not be evicted, as we still need it to mark data on SSTable 5 as being deleted, or data from
SSTable 5 would come back (zombie). 如果合并1-4⽂文件，并把Tombstone删除了，那么⽂文件5存在的记录就有可能复活，所以还需要保留Tombstone
》TTL EstimatedHistogram(tombstone_threshold)表⽰示SSTable的Tombstone超过20%，发⽣生Single SSTable Compaction，删除⼤大量的Tombstone
不过即使发⽣生单⽂文件的Compaction，对于有重叠的SSTables，Tombstone不能被删除，因为要确保对应记录的所有⽂文件都删除了才能删除Tombstone
Not clearing tombstones because compacted rows with a lower timestamp were present in other sstables.
》gc_grace_seconds设置为0会产⽣生Zombie，不过如果单机环境可以设置为0，因为单机不存在分布式删除。
gc_grace_seconds设置为0后，发⽣生Compaction时，会删除Tombstone。
》gc_grace_seconds的⺫⽬目的是给定这段时间内，当掉的节点能够有机会/有时间恢复正常(收到丢失的Tombstone请求)，如果过了这个时间节点仍然没有恢
复，出现Zombie，即使有Repair也⽆无能为例，实际上正是由于Repair机制导致了Zombie，如果没有Repair，刚刚恢复的节点上应该被删除的数据就不会
被传播到其他节点(这些节点已经把数据删除)，但是没有Repair⼜又不⾏行，只有这个恢复的节点有数据，为了保证⼀一致性，需要把数据复制到其他节点上。
WARNING: Never set gc_grace_seconds this low or else previously deleted data may reappear via repair if a node was down while tombstones are
removed. 其他节点过了gc_grace_seconds后会删除掉Tombstone记录，⽽而刚恢复节点上还有旧的记录，由于集群没有Tombstone了就会发⽣生Zombie.

sst#1:record1,ABC,12300
sst#2:record1,delete,12360
sst#1:record1,ABC,12300 sst#1:record1,ABC,12300sst#1:record1,ABC,12300
理解：tombstone经过gc_grace_seconds后在Compaction的时候被删除?
如果在gc_grace_seconds后删除了tombstone，那删除记录消失了，原先的数据不是照样恢复了？
解决办法：通常Compaction的时候把过期记录删除，但会继续保留Tombstone记录⼀一段时间
compaction的时候删除tombstone
insert record1,TTL=60
60s later, mark as tombstone
record1, tombstone, 12360
Compaction会删除记录
但Tombstone会继续被保留
在SSTable中存活⼀一段时间
query record1:
record1, ABC, 12300
gc_grace_seconds later, and compaction
delete tombstone: record1,tombstone,12360
compaction
选择时间撮最⾼高的记录
tombstone记录的时间撮最新
query record1
empty!
✔️
❌
gc_grace_seconds later
and compaction now
may delete tombstone
compaction
负责删除旧数据
✔️
query record1
empty!
query record1
empty!

http://www.slideshare.net/planetcassandra/tombstones-and-compaction-48960191
Hard Delete
Tombstone
Node Down…
Node Down…
Tombstones are distributed by
AntiEntropy and Read Repair.
But we need to set a limit on
how long they will be distributed for.
gc_grace_seconds: do not distribute
tombstones older than this ma(n)y sec.
超过gc_grace_seconds后就不会复制
通过反熵和读修复双重机制的保证，
在gc_grace_seconds时间内如果节点
恢复了, 则最终会收到Tombstone请求
这样三个节点都有Tombstone请求了
Purging the Deleted Data:
By Compaction
Purging the Tombstone:
gc_grace_seconds
Repair
Zombie
Distribute Detele & Distribute Tombstone

Compaction对每个SSTable合并数据通过Partition Key，基于时间撮选择最近存储的数据
因为每个SSTable的rows已经按照Partition Key排好序了，合并过程不是随机IO，性能很⾼高
合并过程会删除掉过期数据、Delete数据，最后多个SSTable只剩下⼀一个合并过的新SSTable
⽼老SSTables只要没有正在进⾏行中的读操作（占⽤用⽂文件句柄）就会尽早地被删除掉以释放空间
等待整个压缩结束后然后扔掉旧的SSTable，造成压缩过程的I/O和磁盘空间的抖动
1) 磁盘：在压过过程中，新⽼老SSTables都并⾏行存在，压缩结束后才删除掉旧⽂文件
2) I/O：读操作原先读取⽼老⽂文件，已有⻚页缓存；转换到新⽂文件后，要重建⻚页缓存
将⽼老的SSTables以增量形式替换为压缩的SSTables，可以改进Compaction后的读性能：
1）在压缩没有结束前，读操作可以直接从新的SSTable读取数据
2）不从⽼老⽂文件读取就可以从⻚页缓存中移除，并开始对新⽂文件进⾏行增量的⻚页缓存
Compaction类型：
1. SizeTieredCompactionStrategy（STCS）：write-intensive workloads
2. LeveledTieredCompactionStrategy（LTCS）：read-intensive workloads
3. DateTieredCompactionStrategy（DTCS）：time-series data and expiring TTL data
http://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlHowDataMaintain.html

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataMaintain.html
相同PartitionKey的
多个Column分布在
多个SSTable⽂文件红⾊色表⽰示要被删除的列，在合并过程中不会被写⼊入到新SSTable

SSTable#1 SSTable#2 SSTable#3 SSTable#4
Compaction
Compaction Why：
1. 合并SSTables来改善读性能（让Partition分布在更少的SSTable⾥里）
2. 删除deleted data/tombstone数据来释放磁盘空间（不直接删除数据）
Compaction How：
1. combining row fragments 合并⼀一⾏行数据在多个⽂文件上的多个碎⽚片(列)
2. evicting expired tombstones 清理过期的tombstones
3. rebuilding indexs 重建索引(数据⽂文件和索引⽂文件)
4. write new sstable 写新的数据⽂文件
5. abandom old sstable, wait GC to collect 丢弃旧的⽂文件，等待被回收
Compaction动作会将多个SSTable读⼊入内存，然后在内存中进⾏行合并
tombstone

https://lostechies.com/ryansvihla/2014/10/20/domain-modeling-around-deletes-or-using-cassandra-as-a-queue-even-when-you-know-better/
Machines A, B had their records with this job removed, Machine C which has a record of this job but because gc_grace_seconds
was 0, also A and B have no tombstones or record of this job ever being executed, so when doing any repair, Machine C replicates
this new/old record back over and whatever process is looking for new jobs ﬁres again on the already executed job.
without gc_grace_seconds or gc_grace_seconds=0
RDBMS→单机→分布式删除

https://www.tomaz.me/slides/2014-24-03-cassandra-anti-patterns/#/23
http://distributeddatastore.blogspot.com/2016/04/tombstones.html
How do tombstones affect range query reads?
1. Gimme a single column (e.g. c5) - no biggie, uses bloom filter and pointer to a offset in a sstable file, skips tombstones
2. Gimme all the columns between c5 and c10 (inclusive) - houston, we have a problem (need to do late filtering)!
(
)
Tombstones can impact performance of slice queries especially for wider rows(columns?). When Cassandra executes a column slice query it
needs to read columns from all the SSTables that include the given row and filter out tombstones. And all these tombstones need to be kept in
memory till the row fragments from all SSTables are merged which increase heap space usage.

Without GC Grace Seconds,Deleted Data Can Be Revived(复活)
Data deleted and removed during compaction while a replicate is ofﬂine,
will be restored when the replicate comes back online
With GC Grace Seconds,a Temporary Outage(中断) Will Not Effect Deletes
Data deleted will not be removed until after GC Grace Seconds. This prevents an
outage of Node2 of less then GC Grace Seconds from modifying the database.
http://www.planetcassandra.org/blog/qa-starters-guide-to-cassandra/
Compaction时移除Tombstone
节点2恢复，修复会导致zombie
Compaction时保留Tombstone
节点2恢复，通过节点1的修复
会将Tombstone传播给节点2
节点2→节点1
节点1→节点2
Data和Tombstone同时存在
说明这两个数据是不同的，
否则如果Data⽐比Tombstone旧
Compaction后只存在Tombstone
如果Data⽐比Tombstone新，
Compaction后只会存在Data

http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

delete.. where..
insert.. use ttl…
mark as tombstone
ttl times later..
write path to sstable(replicas)
[the tombstone has build-in expiraton time]
end of expiration period
compaction!!! (choose this ﬁle)
can delete column or entire row
data(row/column) really deleted
mark row1 as tombstone
标记⾏行/列为tombstone
node1
node2
node3
row1
row1
row1
tombstone
tombstone
tombstone
compaction
compaction
compaction
row1 deleted
row1 deleted
row1 deleted
node1
node2
node3
query row1
null
write path to
multi nodes
delete tombstone
delete tombstone
delete tombstonecompaction
compaction
compaction
gc_grace_seconds passed by
TTL ≠ Tombstone，TTL作⽤用于记录上，Tombstone也是⼀一条特殊的记录，所以记录本⾝身和Tombstone记录都应该被删除
记录经过TTL会被⽴立即标记为Tombstone，tombstone经过gc_grace_seconds后，在后续的Compaction会被删除。
上图有两次Compaction，第⼀一次Compaction是删除过期的数据本⾝身，第⼆二次Compaction才是删除Tombstone记录。

mark row1 as tombstone
node1
node2
node3
row1
row1:val1
row1
tombstone
tombstone
compaction
compaction
row1 deleted
row1 deleted
×un-responsive recover
node1
node2
node3
query row1
node1
node2
node3
query row1
null
null
val1
node1
node2
node3
query row1
val1
val1
val1val1

https://www.instaclustr.com/blog/2016/01/27/apache-cassandra-compaction/
Compaction Strategy

STCS（Sized-Tiered）
1M
5M
3M
6M
10M 15M
160M
200M
180M
600M
500M
800M
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataMaintain.html
row fragment
frequent update
适合于写多读少，读⽐比较慢，因为按照⼤大⼩小合并，⽽而不是按照⾏行合并
the data for a particular row may be spread over many SSTables 分布多个⽂文件
the eviction of deleted data does not occur predictably 删除过期数据⽆无法预测
SSTable⽂文件越来越⼤大，由于新⽼老⽂文件都存在，需要的磁盘空间也会越来越⼤大
只有Compaction才有机会删除过期数据，但是按照⼤大⼩小策略执⾏行，
SSTable涨的没那么快，不那么容易触发Compaction。
所以缺点是：hold onto stable data too long 持有过期的数据太旧了
同时随着服务器运⾏行的越久，需要/占⽤用的内存也越来越⼤大。
Full Compaction：STCS独有，把所有的SSTables合并为⼀一个SSTable

Mem Mem Mem Mem
Mem Mem Mem Mem
http://thebigmammoth.blogspot.com/2015/02/compaction-in-cassandra.html

https://shrikantbang.wordpress.com/2014/04/22/size-tiered-compaction-strategy-in-apache-cassandra/
bucket_high=1.5, bucket_low=0.5,
max_threshold=32, min_threshold=4,
min_sstable_size=32M(default=50M)
IF ((bucket avg size * bucket_low < SStable’ size < bucket avg size * bucket_high)
OR (SStable’ size < min_sstable_size AND bucket avg size < min_sstable_size))
then
add the SSTable to the bucket and
compute the new avg. size of the bucket.
ELSE
create a new bucket with the SSTable.

http://www.datastax.com/dev/blog/tombstone-removal-improvement-in-1-2
STCS缺点：很⼤大的SSTable⽂文件，包含很多的tombstone，不容易/延时被删除
代价：1) 占⽤用磁盘空间；2) 过多的tombstone，读取时占内存（先读到内存）
tombstone
tombstone_threshold超过20%的SSTable⼤大⼩小，执⾏行single sstable compaction

First, memtables are first flushed to SSTables in the first level (L0). Then compaction merges these SSTables with larger SSTables in level L1.
The SSTables in levels greater than L1 are merged into SSTables with a size greater than or equal to sstable_size_in_mb (default: 160 MB). If a L1
SSTable stores data of a partition that is larger than L2, LCS moves the SSTable past L2 to the next level up.
In each of the levels above L0, LCS creates SSTables that are about the same size. Each level is 10X the size of the last level, so level L1 has 10X as
many SSTables as L0, and level L2 has 100X as many. If the result of the compaction is more than 10 SSTables in level L1, the excess SSTables are
moved to level L2.
The LCS compaction process guarantees that the SSTables within each level starting with L1 have non-overlapping data. For many reads, this
guarantee enables Cassandra to retrieve all the required data from only one or two SSTables. In fact, 90% of all reads can be satisfied from one
SSTable. Since LCS does not compact L0 tables, however, resource-intensive reads involving many L0 SStables may still occur.
At levels beyond L0, LCS requires less disk space for compacting — generally, 10X the fixed size of the SSTable. Obsolete data is evicted more often,
so deleted data uses smaller portions of the SSTables on disk. However, LCS compaction operations take place more often and place more I/O burden
on the node. For write-intensive workloads, the payoff of using this strategy is generally not worth the performance loss to I/O operations. In many
cases, tests of LCS-configured tables reveal I/O saturation on writes and compactions.
new sstables are added to the first level - L0, and immediately compacted with the sstables in L1. When L1 fills up, extra sstables are promoted to L2.
Subsequent sstables generated in L1 will be compacted with the sstables in L2 with which they overlap.
Memtable刷写成SSTable后，⽴立即被加⼊入到L0 加⼊入L0的SSTable会⽴立即和L1的⽂文件发⽣生Compaction
L1以上每个SSTable⽂文件的⼤大⼩小都⼤大约是160M
从每个Level的总⽂文件⼤大⼩小来看，下⼀一层是上⼀一层的10倍
L1超过10个⽂文件时，超过的⽂文件会被移动到L2（因为L2还没有⽂文件，直接移动）
L0的SSTable⽂文件的⼤大⼩小是多少？160M吗？
L1及以上在同⼀一层的SSTable⽂文件不会有重叠数据
LCS不会压缩L0的⽂文件，如果L0有很多⽂文件，读可能会受影响，因为要读取L0的所有⽂文件
为了Compacting，仅需要10*160M额外的空间(10个⽂文件的⼤大⼩小)
Compaction越频繁，过期数据越容易被删除，但是节点承受的I/O压⼒力越⼤大
如果是写⽐比较多的场景，这种策略消耗的I/O对性能不是很好
加⼊入L0的sstable⽴立即和L1的合并，当L1满了，额外的sstables会被提升到L2
后续⽣生成到L1的⽂文件会和L2中有重叠的⽂文件合并(有没有附加条件还是总是合并)

》每个SSTable⽂文件的固定⼤大⼩小为160M
》从MemTable创建的SSTable⽂文件划分到Level-0中
》每个Level有SSTable⽂文件数量的限制。在除了Level-0的任意Level中，两级Level之间的SSTable⽂文件数量呈指数级倍数。
⽐比如：Level-1中有10个SSTable⽂文件(每个160M)，Level-2有100个SSTable⽂文件(每个160M)
》除L0之外其他Level的SSTable⽂文件之间所包含的key的范围不重叠(互斥、不相交)。
也就是说，每个Level的所有SSTable⽂文件，可以看做是⼀一个⼤大的SSTable⽂文件
》如果Level-0中SSTable数量超过限制（⽐比如4），将这4个L0的SSTable⽂文件与L1的所有10个SSTable⽂文件进⾏行Compaction
Compaction并不是合并成⼀一个⼤大⽂文件，⽽而是达到160M就写⼀一个新⽂文件，最后会⽣生成多个SSTable⽂文件
》在Compaction过程中，⾸首先对所有的SSTable⽂文件按key进⾏行归并排序，然后将排序后结果写⼊入到新的SSTable⽂文件中，
如果SSTable⽂文件⼤大⼩小到了160M上限，就新⽣生成SSTable继续写。以此类推，直到写完所有数据。
》删除参与Compaction的Level-0的4个和Level-1的10个旧的SSTable⽂文件
》此时Level-0的SSTable便被merge到Level-1中了。此时如果Level-1的SSTable⽂文件数量超过上限（10个），
那么就从Level-1中选出 1 个超量的SSTable⽂文件，然后将其与Level-2中的SSTable⽂文件进⾏行Compaction。
》查看选出的Level-1 SSTable⽂文件中key的范围
》从Level-2中选出能覆盖该范围的所有SSTable⽂文件
》将以上的所有SSTable⽂文件根据上⾯面介绍的算法继续进⾏行Compaction
》如果Level-2中的⽂文件数量超过限制，则继续按照上述算法选出超量的SSTable⽂文件与Level-3中的SSTable⽂文件进⾏行Compaction
注：⼀一般情况下，Level-1和Level-2的Compaction，只会涉及Level-2内⼤大概1/10的SSTable⽂文件，这样可以
⼤大幅降低参与Compcation的SSTable⽂文件数量（相⽐比于Size-Tired Compaction），进⼀一步提升提升性能
虽然Level Compaction产⽣生的⽂文件很多（⽐比如L1 10个，L2 100个，L3 1000个），但是这并不意味着我们要查询这么多⽂文件！
因为我们知道在同⼀一个Level中，不同的SSTable的key不会有重叠，所以在⼀一个Level中最多只需要查询⼀一个SSTable
所以最坏情况下为了读取⼀一个key，我们需要读取L0的所有⽂文件，然后每⼀一个Level读取⼀一个⽂文件
过期数据对磁盘的浪费空间只有10%的总数据量，只需要10个sstable⽂文件⼤大⼩小的临时空间⽤用于Compaction。
http://mjwhyz.me/2016/06/14/leveldb笔记---leveled-compaction/
http://www.scylladb.com/kb/compaction/
https://github.com/scylladb/scylla/wiki/SSTable-compaction
With the leveled compaction strategy, SSTable reads are efﬁcient. The great number of small SSTables doesn’t mean we need to look up a
key in that many SSTables, because we know the SSTables in each level have disjoint ranges, so we only need to look in one SSTable in each
level. In the typical case, we just need to read one SSTable.
The other factors making this compaction strategy efﬁcient are that at most 10% of space will be wasted by obsolete rows, and only enough
space for ~10x the small SSTable size needs to be reserved for temporary use by compaction.

0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
3
0
4
0
5
0
.. 9
0
6
0
8
0
7
0
9
9
.. .. .. .. .. .. ..
L1：10Files
L2：100Files
KeyRange: k10..-k19..
L1：每个SSTable的KeyRange占⽤用该Level的总KeyRange的⽐比率
=当前Keys/所有Keys=⼀一个SSTable⽂文件/10个SSTable=1/10
L2：每个SSTable的KeyRange占⽤用该Level的总KeyRange的⽐比率=1/100
KeyRange: k9a-k9z
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
3
0
4
0
5
0
.. 9
0
6
0
8
0
7
0
9
9
.. .. .. .. .. .. ..
L1
L2
0-9 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
3
0
4
0
5
0
.. 9
0
6
0
8
0
7
0
9
9
.. .. .. .. .. .. ..
L1
L2
L1的⼀一个SSTable的Keys需要10个L2的SSTable才能完全覆盖，
Compaction时选择⼀一个L1，10个L2（覆盖L1所有Key Range）
同时L2还需要2个额外的边缘⽂文件，因为KeyRange不是很准确
L1的⼀一个SSTable和L2的12个SSTable合并后，会在L2⽣生成新
的多个SSTable，然后删除掉L1参与Compaction的那个⽂文件
10-19
×
Compact
160M*10
160M*100

http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
LCS（Leveled）
新的sstable添加到L0, 并且
⽴立即和L1中的sstables合并
L0
L1
L2
compact
新sstable和L1的
sstables合并之后
L1满了额外的sstable
会被提升到L2
L1中后续新⽣生成的sstable如果和
L2有重叠,L1的sstable会和L2合并
有重叠
需合并
L1和L2合并之后
L2也满了,则L2的
也会被提升到L3
⼀一⾏行记录⼀一般(90%情况下)只会出现在两个SSTable⾥里

1. 在⼀一个Level中不会有重叠(overlapping)：⼀一⾏行在⼀一个Level中不可能出现两次，但是允许在不同Level中出现多次
2. 每个Level的数量都是前⼀一个Level的10倍，⽐比如L1=10个，L2=100个，L3=1000个（有些⽂文章说是MB，不是个数）
3. Level1以上的每个SSTable的⼤大⼩小都是160M
对于⼤大多数读请求都只需要读取⼀一⾄至两个SSTable，实际上90%的请求只需要读取⼀一个SSTable
🙅 ✅
L2
L1
0.9->1 sstable
0.1->2 sstable
0.01->3 sstable

http://engblog.polyvore.com/2015/03/cassandra-compaction-and-tombstone.html

tiered compaction跟不上write的速度
tiered compaction跟得上write的速度
hybrid compaction
STCS
LCS
Healthy
Sad
perform STCS when L0 fall behind

http://www.datastax.com/dev/blog/datetieredcompactionstrategy

Cassandra compaction

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Cassandra compaction