3. RocksDB Introduction
• Open source based on LevelDB 1.5, written in C++
• Key-Value persistent store
• Embedded Library
• Pluggable database
• Optimized for fast storage (flash or RAM)
• Optimized for server workloads
• Get(), Put(), Delete()
A persistent key-value store for fast storage environments
4. Three Basic Constructs
of RocksDB
• Memtable
– in-memory data structure
– A buffer, temporarily host the incoming writes
• Logfile
– Sequentially-written file
– On storage
5. Three Basic Constructs
of RocksDB
• SSTable(=SSTfile)
– Sorted Static Table on storage
– A file which contains a set of arbitrary, sorted key-value
pairs inside
– Organized in levels
– Immutable in its life time
– sorted data → to facilitate easy lookup of keys
– Storage of the entire database
7. SSTable & Memtable
of RocksDB
• On-disk SSTable indexes are always loaded into memory
• All writes go directly to the Memtable index
• Reads check the Memtable first → the SSTable indexes
• Periodically, the Memtable is flushed to disk as an SSTable
• Periodically, on-disk SSTables are merged
→ update/delete records will overwrite/remove the older
data
15. Log-Structured Merge Tree
• LSM-tree
– N-level merge trees
– Splitting a logical tree into several physical pieces
– So that the most-recently-updated portion of data is in a tree in
memory
– Transform random writes into sequential writes using logfile &
in-memory store(Memtable)
16. Log-Structured Merge DB
to minimize “random writes”
Write RequestRead Request
Read Write data
in RAM
Read Only data in RAM
on disk
Periodic
Compaction
Transaction Log
17. Log-Structured Merge DB
to minimize “random writes”
① Data Write(Insert, Update)
• New puts are written to memory(Memtable) & logfile
sequentially
• Memtable is filled up → flushed to a SSTable on disk
• Operated in memory, no disk access → faster than B+ tree
② Data Read
• Memtable → SSTable
• Maintain all the SSTable indexes in memory
18. RocksDB Compaction
Multi-threaded compactions
• Background Multi-thread
→ periodically do the “compaction”
→ parallel compactions on different parts of the database
can occur simultaneously
• Merge SSTfiles to a bigger SSTfile
• Remove multiple copies of the same key
– Duplicate or overwritten keys
• Process deletions of keys
• Supports two different styles of compaction
– Tunable compaction to trade-off
20. 1. Level Style Compaction
• RocksDB default compaction style
• Stores data in multiple levels in the database
• More recent data → L0
The oldest data → Lmax
• Files in L0
- overlapping keys, sorted by flush time
Files in L1 and higher
- non-overlapping keys, sorted by key
• Each level is 10 times larger than the previous one
Inherited from LevelDB
21. Level Style Compaction
Compaction process
cache
log
level1
level2
level3
level0
① Pick one file from level N
② Compact it with all its overlapping
files from level N+1
③ Replace them with new files in
level N+1
23. Level 0 → Level 1 Compaction
• Level 0 → overlapping keys
• Compaction includes all files from L1
• All files from L1 are compacted with L0
• L0 → L1 compaction completion
L1 → L2 compaction start
• Single thread compaction → not good throughput
• Solution : Making the size of L0 similar to size of L1
Tricky Compaction
24. 2. Universal Style Compaction
• For write-heavy workloads
→ Level Style Compaction may be bottlenecked on
disk throughput
• Stores all files in L0
• All files are arranged in time order
• Temporarily increase size amplification by a factor of
two
• Intended to decrease write amplification
• But, increase space amplification
25. Universal Style Compaction
① Pick up a few files that are chronologically adjacent to one
another
② Merge them
③ Replace them with a new file in level 0
Compaction process
26. Universal Style Compaction
• size_ratio
- Percentage flexibility while comparing file size
- Default : 1
• min_merge_width
- The minimum number of files in a single compaction
- Default : 2
• max_merge_width
- The maximum number of files in a single compaction
- Default : UINT_MAX
Compaction options
27. Universal Style Compaction
Compaction example
5 bytes
6 bytes
10 bytes 10 bytes
Stage 1 Stage 2
Single compaction by Universal Style Compaction
Level-0
Notes de l'éditeur
Facebook 사에서 수행한 방대한 데이터를 빠른 저장장치(SSD)에서 동작할 데이터베이스 소프트웨어 개발 프로젝트 / 대용량 데이터 처리에 대체로 적합하다 / C++ 라이브러리 / 기존의 관계형 데이터베이스와는 달리 KEY-VALUE 저장방식 / 기존의 관계형 데이터베이스에서의 테이블 간의 관계 설정이나 Join 연산과 같은 개념과 기능을 대폭 축소 / LevelDB와 비교하면 플래쉬 스토리지를 고속의 액세스 성능을 풀 활용할 수 있기 때문에 랜덤 읽기/쓰기나 대량 업로드 전반에 걸쳐서 고속화 할 수 있다. 랜덤 쓰기와 대량 업로드에서는 10배, 랜덤 읽기에서는 30%의 고속화를 실현하고 있다.
MANIFEST files will be formatted as a log all changes cause a state change (add or delete) will be appended to the log. A MANIFEST file lists the set of sorted tables that make up each level
Informational messages are printed to files named LOG and LOG.old.
CURRENT is a latest manifest file name of the text file