SlideShare a Scribd company logo
1 of 56
Download to read offline
BLUESTORE: A NEW STORAGE BACKEND FOR CEPH –
ONE YEAR IN
SAGE WEIL
2017.03.23
2
OUTLINE
● Ceph background and context
– FileStore, and why POSIX failed us
● BlueStore – a new Ceph OSD backend
● Performance
● Recent challenges
● Future
● Status and availability
● Summary
MOTIVATION
4
CEPH
● Object, block, and file storage in a single cluster
● All components scale horizontally
● No single point of failure
● Hardware agnostic, commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
● “A Scalable, High-Performance Distributed File System”
● “performance, reliability, and scalability”
5
CEPH COMPONENTS
RGW
A web services gateway
for object storage,
compatible with S3 and
Swift
LIBRADOS
A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomous, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes and lightweight monitors
RBD
A reliable, fully-distributed
block device with cloud
platform integration
CEPHFS
A distributed file system
with POSIX semantics and
scale-out metadata
management
OBJECT BLOCK FILE
6
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
btrfs
ext4
M
M
M
7
OBJECT STORAGE DAEMONS (OSDS)
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfs
btrfs
ext4
M
M
M
FileStore FileStoreFileStoreFileStore
8
●
ObjectStore
– abstract interface for storing
local data
– EBOFS, FileStore
●
EBOFS
– a user-space extent-based
object file system
– deprecated in favor of FileStore
on btrfs in 2009
●
Object – “file”
– data (file-like byte stream)
– attributes (small key/value)
– omap (unbounded key/value)
●
Collection – “directory”
– placement group shard (slice of
the RADOS pool)
●
All writes are transactions
– Atomic + Consistent + Durable
– Isolation provided by OSD
OBJECTSTORE AND DATA MODEL
9
●
FileStore
– PG = collection = directory
– object = file
●
Leveldb
– large xattr spillover
– object omap (key/value) data
●
Originally just for development...
– later, only supported backend
(on XFS)
●
/var/lib/ceph/osd/ceph-123/
– current/
●
meta/
– osdmap123
– osdmap124
●
0.1_head/
– object1
– object12
●
0.7_head/
– object3
– object5
●
0.a_head/
– object4
– object6
●
omap/
– <leveldb files>
FILESTORE
10
● Most transactions are simple
– write some bytes to object (file)
– update object attribute (file
xattr)
– append to update log (kv insert)
...but others are arbitrarily
large/complex
● Serialize and write-ahead txn to
journal for atomicity
– We double-write everything!
– Lots of ugly hackery to make
replayed events idempotent
[
{
"op_name": "write",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"length": 4194304,
"offset": 0,
"bufferlist length": 4194304
},
{
"op_name": "setattrs",
"collection": "0.6_head",
"oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#",
"attr_lens": {
"_": 269,
"snapset": 31
}
},
{
"op_name": "omap_setkeys",
"collection": "0.6_head",
"oid": "#0:60000000::::head#",
"attr_lens": {
"0000000005.00000000000000000006": 178,
"_info": 847
}
}
]
POSIX FAILS: TRANSACTIONS
11
POSIX FAILS: ENUMERATION
● Ceph objects are distributed by a 32-bit hash
● Enumeration is in hash order
– scrubbing
– “backfill” (data rebalancing, recovery)
– enumeration via librados client API
● POSIX readdir is not well-ordered
– And even if it were, it would be a different hash
● Need O(1) “split” for a given shard/range
● Build directory tree by hash-value prefix
– split any directory when size > ~100 files
– merge when size < ~50 files
– read entire directory, sort in-memory
…
DIR_A/
DIR_A/A03224D3_qwer
DIR_A/A247233E_zxcv
…
DIR_B/
DIR_B/DIR_8/
DIR_B/DIR_8/B823032D_foo
DIR_B/DIR_8/B8474342_bar
DIR_B/DIR_9/
DIR_B/DIR_9/B924273B_baz
DIR_B/DIR_A/
DIR_B/DIR_A/BA4328D2_asdf
…
12
THE HEADACHES CONTINUE
● New FileStore problems continue to surface as we approach switch to
BlueStore
– Recently discovered bug in FileStore omap implementation, revealed by new
CephFS scrubbing
– FileStore directory splits lead to throughput collapse when an entire pool’s
PG directories split in unison
– Read/modify/write workloads perform horribly
● RGW index objects
● RBD object bitmaps
– QoS efforts thwarted by deep queues and periodicity in FileStore throughput
– Cannot bound deferred writeback work, even with fsync(2)
– {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create
object clones
BLUESTORE
14
● BlueStore = Block + NewStore
– consume raw block device(s)
– key/value database (RocksDB) for metadata
– data written directly to block device
– pluggable block Allocator (policy)
– pluggable compression
– checksums, ponies, ...
● We must share the block device with RocksDB
BLUESTORE
BlueStore
BlueFS
RocksDB
BlueRocksEnv
data metadata
ObjectStore
BlockDevice
15
ROCKSDB: BLUEROCKSENV + BLUEFS
● class BlueRocksEnv : public rocksdb::EnvWrapper
– passes “file” operations to BlueFS
● BlueFS is a super-simple “file system”
– all metadata lives in the journal
– all metadata loaded in RAM on start/mount
– no need to store block free list
– coarse allocation unit (1 MB blocks)
– journal rewritten/compacted when it gets large
superblock journal …
more journal …
data
data
data
data
file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ...
●
Map “directories” to different block devices
– db.wal/ – on NVRAM, NVMe, SSD
– db/ – level0 and hot SSTs on SSD
– db.slow/ – cold SSTs on HDD
●
BlueStore periodically balances free space
16
● Single device
– HDD or SSD
● Bluefs db.wal/ + db/ (wal and sst files)
● object data blobs
● Two devices
– 512MB of SSD or NVRAM
● bluefs db.wal/ (rocksdb wal)
– big device
● bluefs db/ (sst files, spillover)
● object data blobs
MULTI-DEVICE SUPPORT
● Two devices
– a few GB of SSD
● bluefs db.wal/ (rocksdb wal)
● bluefs db/ (warm sst files)
– big device
● bluefs db.slow/ (cold sst files)
● object data blobs
● Three devices
– 512MB NVRAM
● bluefs db.wal/ (rocksdb wal)
– a few GB SSD
● bluefs db/ (warm sst files)
– big device
● bluefs db.slow/ (cold sst files)
● object data blobs
METADATA
18
BLUESTORE METADATA
● Everything in flat kv database (rocksdb)
● Partition namespace for different metadata
– S* – “superblock” properties for the entire store
– B* – block allocation metadata (free block bitmap)
– T* – stats (bytes used, compressed, etc.)
– C* – collection name → cnode_t
– O* – object name → onode_t or bnode_t
– X* – shared blobs
– L* – deferred writes (promises of future IO)
– M* – omap (user key/value data, stored in objects)
19
● Collection metadata
– Interval of object namespace
shard pool hash name bits
C<NOSHARD,12,3d3e0000> “12.e3d3” = <19>
shard pool hash name snap gen
O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = …
O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = …
struct spg_t {
uint64_t pool;
uint32_t hash;
shard_id_t shard;
};
struct bluestore_cnode_t {
uint32_t bits;
};
● Nice properties
– Ordered enumeration of objects
– We can “split” collections by adjusting
collection metadata only
CNODE
20
● Per object metadata
– Lives directly in key/value pair
– Serializes to 100s of bytes
● Size in bytes
● Attributes (user attr data)
● Inline extent map (maybe)
struct bluestore_onode_t {
uint64_t size;
map<string,bufferptr> attrs;
uint64_t flags;
struct shard_info {
uint32_t offset;
uint32_t bytes;
};
vector<shard_info> shards;
bufferlist inline_extents;
bufferlist spanning_blobs;
};
ONODE
21
● Blob
– Extent(s) on device
– Lump of data originating from
same object
– May later be referenced by
multiple objects
– Normally checksummed
– May be compressed
● SharedBlob
– Extent ref count on cloned
blobs
– In-memory buffer cache
BLOBS
struct bluestore_blob_t {
vector<bluestore_pextent_t> extents;
uint32_t compressed_length_orig = 0;
uint32_t compressed_length = 0;
uint32_t flags = 0;
uint16_t unused = 0; // bitmap
uint8_t csum_type = CSUM_NONE;
uint8_t csum_chunk_order = 0;
bufferptr csum_data;
};
struct bluestore_shared_blob_t {
uint64_t sbid;
bluestore_extent_ref_map_t ref_map;
};
22
● Map object extents → blob extents
● Extent map serialized in chunks
– stored inline in onode value if small
– otherwise stored in adjacent keys
● Blobs stored inline in each shard
– unless it is referenced across shard
boundaries
– “spanning” blobs stored in onode key
● If blob is “shared” (cloned)
– ref count on allocated extents stored
in external key
– only needed (loaded) on
deallocations
EXTENT MAP
...
O<,,foo,,> =onode + inline extent map
O<,,bar,,> =onode + spanning blobs
O<,,bar,,0> =extent map shard
O<,,bar,,4> =extent map shard
O<,,baz,,> =onode + inline extent map
...
logicaloffsets
blob
blob
blob
shard#1shard#2
spanningblob
DATA PATH
24
Terms
● Sequencer
– An independent, totally ordered
queue of transactions
– One per PG
● TransContext
– State describing an executing
transaction
DATA PATH BASICS
Three ways to write
● New allocation
– Any write larger than
min_alloc_size goes to a new,
unused extent on disk
– Once that IO completes, we commit
the transaction
● Unused part of existing blob
● Deferred writes
– Commit temporary promise to
(over)write data with transaction
● includes data!
– Do async (over)write
– Then clean up temporary k/v pair
25
TRANSCONTEXT STATE MACHINE
PREPARE AIO_WAIT
KV_QUEUED KV_COMMITTING
DEFERRED_QUEUED DEFERRED_AIO_WAIT
FINISH
DEFERRED_CLEANUP
Initiate some AIO
Wait for next TransContext(s) in Sequencer to be ready
Sequencer queue
Initiate some AIO
Wait for next commit batch
DEFERRED_AIO_WAIT
DEFERRED_CLEANUP
DEFERRED_CLEANUP
PREPARE
AIO_WAIT
KV_QUEUED
AIO_WAIT
KV_COMMITTING
KV_COMMITTING
KV_COMMITTING
FINISH
WAL_QUEUEDDEFERRED_QUEUED
FINISH
FINISH
FINISHDEFERRED_CLEANUP_COMMITTING
26
● Blobs can be compressed
– Tunables target min and max sizes
– Min size sets ceiling on size reduction
– Max size caps max read amplification
● Garbage collection to limit occluded/wasted space
– compacted (rewritten) when waste exceeds threshold
INLINE COMPRESSION
allocated
written
written (compressed)
start of object end of objectend of object
uncompressed blob region
27
● OnodeSpace per collection
– in-memory ghobject_t → Onode map of decoded onodes
● BufferSpace for in-memory blobs
– all in-flight writes
– may contain cached on-disk data
● Both buffers and onodes have lifecycles linked to a Cache
– LRUCache – trivial LRU
– TwoQCache – implements 2Q cache replacement algorithm (default)
● Cache is sharded for parallelism
– Collection → shard mapping matches OSD's op_wq
– same CPU context that processes client requests will touch the LRU/2Q lists
– aio completion execution not yet sharded – TODO?
IN-MEMORY CACHE
PERFORMANCE
29
HDD: RANDOM WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
100
200
300
400
500
Bluestore vs Filestore HDD Random Write Throughput
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
Throughput(MB/s)
4 8 16 32 64
128
256
512
1024
2048
4096
0
500
1000
1500
2000
Bluestore vs Filestore HDD Random Write IOPS
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
IOPS
30
HDD: MIXED READ/WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
50
100
150
200
250
300
350
Bluestore vs Filestore HDD Random RWThroughput
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
Throughput(MB/s)
4 8 16 32 64
128
256
512
1024
2048
4096
0
200
400
600
800
1000
1200
Bluestore vs Filestore HDD Random RW IOPS
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
IOPS
31
HDD+NVME: MIXED READ/WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
50
100
150
200
250
300
350
Bluestore vs Filestore HDD/NVMe Random RWThroughput
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
Throughput(MB/s)
4 8 16 32 64
128
256
512
1024
2048
4096
0
200
400
600
800
1000
1200
1400
Bluestore vs Filestore HDD/NVMe Random RW IOPS
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
IOPS
32
HDD: SEQUENTIAL WRITE
4 8 16 32 64
128
256
512
1024
2048
4096
0
100
200
300
400
500
Bluestore vs Filestore HDD Sequential Write Throughput
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
Throughput(MB/s)
4 8 16 32 64
128
256
512
1024
2048
4096
0
500
1000
1500
2000
2500
3000
3500
Bluestore vs Filestore HDD Sequential Write IOPS
Filestore
Bluestore (wip-bitmap-alloc-perf)
BS (Master-a07452d)
BS (Master-d62a4948)
Bluestore (wip-bluestore-dw)
IO Size
IOPS
WTF?
33
WHEN TO JOURNAL WRITES
● min_alloc_size – smallest allocation unit (16KB on SSD, 64KB on HDD)
>= send writes to newly allocated or unwritten space
< journal and deferred small overwrites
● Pretty bad for HDDs, especially sequential writes
● New tunable threshold for direct vs deferred writes
– Separate default for HDD and SSD
● Batch deferred writes
– journal + journal + … + journal + many deferred writes + journal + …
● TODO: plenty more tuning and tweaking here!
34
RGW ON HDD, 3X REPLICATION
1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados
1 RGW Server 4 RGW Servers Bench
0
100
200
300
400
500
600
700
3X Replication RadosGW Write Tests
32MB Objects, 24 HDD OSDs on 4 Servers, 4 Clients
Filestore 512KB Chunks
Filestore 4MB Chunks
Bluestore 512KB Chunks
Bluestore 4MB Chunks
Throughput(MB/s)
35
RGW ON HDD+NVME, EC 4+2
1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados
1 RGW Server 4 RGW Servers Bench
0
200
400
600
800
1000
1200
1400
1600
1800
4+2 Erasure Coding RadosGW Write Tests
32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients
Filestore 512KB Chunks
Filestore 4MB Chunks
Bluestore 512KB Chunks
Bluestore 4MB Chunks
Throughput(MB/s)
36
ERASURE CODE OVERWRITES
● Luminous allows overwrites of EC objects
– Requires two-phase commit to avoid “RAID-hole” like failure conditions
– OSD creates rollback objects
● clone_range $extent to temporary object
● write $extent with overwrite data
● clone[_range] marks blobs immutable, creates SharedBlob record
– Future small overwrites to this blob disallowed; new allocation
– Overhead of SharedBlob ref-counting record
● TODO: un-share and restore mutability in EC case
– Either hack since (in general) all references are in cache
– Or general un-sharing solution (if it doesn’t incur any additional cost)
37
BLUESTORE vs FILESTORE
3X vs EC 4+2 vs EC 5+1
Bluestore Filestore
HDD/HDD
0
100
200
300
400
500
600
700
800
900
RBD 4K Random Writes
16 HDD OSDs, 8 32GB volumes, 256 IOs in flight
3X
EC42
EC51
IOPS
OTHER CHALLENGES
39
USERSPACE CACHE
● Built ‘mempool’ accounting infrastructure
– easily annotate/tag C++ classes and containers
– low overhead
– debug mode provides per-type (vs per-pool) accounting (items and bytes)
● All data managed by ‘bufferlist’ type
– manually track bytes in cache
– ref-counting behavior can lead to memory use amplification
● Requires configuration
– bluestore_cache_size (default 1GB)
● not as convenient as auto-sized kernel caches
– bluestore_cache_meta_ratio (default .9)
● Finally have meaningful implementation of fadvise NOREUSE (vs DONTNEED)
40
MEMORY EFFICIENCY
● Careful attention to struct sizes: packing, redundant fields
● Write path changes to reuse and expand existing blobs
– expand extent list for existing blob
– big reduction in metadata size, increase in performance
● Checksum chunk sizes
– client hints to expect sequential read/write → large csum chunks
– can optionally select weaker checksum (16 or 8 bits per chunk)
● In-memory red/black trees (std::map<>, boost::intrusive::set<>)
– low temporal write locality → many CPU cache misses, failed prefetches
– per-onode slab allocators for extent and blob structs
41
ROCKSDB
● Compaction
– Awkward to control priority
– Overall impact grows as total metadata corpus grows
– Invalidates rocksdb block cache (needed for range queries)
● we prefer O_DIRECT libaio – workaround by using buffered reads and writes
● Bluefs write buffer
● Many deferred write keys end up in L0
● High write amplification
– SSDs with low-cost random reads care more about total write overhead
● Open to alternatives for SSD/NVM
– ZetaScale (recently open sourced by SanDisk)
– Or let Facebook et al make RocksDB great?
FUTURE
43
MORE RUN TO COMPLETION (RTC)
● “Normal” write has several context switches
– A: prepare transaction, initiate any aio
– B: io completion handler, queue txc for commit
– C: commit to kv db
– D: completion callbacks
● Metadata-only transactions or deferred writes?
– skip B, do half of C
● Very fast journal devices (e.g., NVDIMM)?
– do C commit synchronously
● Some completion callbacks back into OSD can be done synchronously
– avoid D
● Pipeline nature of each Sequencer makes this all opportunistic
44
SMR
● Described high-level strategy at Vault’16
– GSoC project
● Recent work shows less-horrible performance on DM-SMR
– Evolving ext4 for shingled disks (FAST’17, Vault’17)
– “Keep metadata in journal, avoid random writes”
● Still plan SMR-specific allocator/freelist implementation
– Tracking released extents in each zone useless
– Need reverse map to identify remaining objects
– Clean least-full zones
● Some speculation that this will be good strategy for non-SMR devices too
45
“TIERING” TODAY
● We do only very basic multi-device
– WAL, KV, object data
● Not enthusiastic about doing proper tiering within
BlueStore
– Basic multi-device infrastructure not difficult, but
– Policy and auto-tiering are complex and
unbounded
BlueStore
BlueFS
RocksDB
data metadata
ObjectStore
HDD SSD NVDIMM
46
TIERING BELOW!
● Prefer to defer tiering to block layer
– bcache, dm-cache, FlashCache
● Extend libaio interface to enable hints
– HOT and COLD flags
– Add new IO_CMD_PWRITEV2 with a usable flags
field
● and eventually DIF/DIX for passing csum to
device?
● Modify bcache, dm-cache, etc to respect hints
● (And above! This is unrelated to RADOS cache
tiering and future tiering plans across OSDs)
BlueStore
BlueFS
RocksDB
data metadata
ObjectStore
HDD SSD
dm-cache, bcache, ...
47
SPDK – KERNEL BYPASS FOR NVME
● SPDK support is in-tree, but
– Lack of caching for bluefs/rocksdb
– DPDK polling thread per OSD not practical
● Ongoing work to allow multiple logical OSDs to coexist in same process
– Share DPDK infrastructure
– Share some caches (e.g., OSDMap cache)
– Multiplex across shared network connections (good for RDMA)
– DPDK backend for AsyncMessenger
● Blockers
– msgr2 messenger protocol to allow multiplexing
– some common shared code cleanup (g_ceph_context)
STATUS
49
STATUS
● Early prototype in Jewel v10.2.z
– Very different than current code; no longer useful or interesting
● Stable (code and on-disk format) in Kraken v11.2.z
– Still marked ‘experimental’
● Stable and recommended default in Luminous v12.2.z (out this Spring)
● Current efforts
– Workload throttling
– Performance anomalies
– Optimizing for CPU time
50
MIGRATION FROM FILESTORE
●
Fail in place
– Fail FileStore OSD
– Create new BlueStore OSD on same device
→ period of reduced redundancy
● Disk-wise replacement
– Format new BlueStore on spare disk
– Stop FileStore OSD on same host
– Local host copy between OSDs/disks
→ reduce online redundancy, but data still available offline
– Requires extra drive slot per host
● Host-wise replacement
– Provision new host with BlueStore OSDs
– Swap new host into old hosts CRUSH position
→ no reduced redundancy during migration
– Requires spare host per rack, or extra host migration for each rack
51
SUMMARY
● Ceph is great at scaling out
● POSIX was poor choice for storing objects
● Our new BlueStore backend is so much better
– Good (and rational) performance!
– Inline compression and full data checksums
● We are definitely not done yet
– Performance, performance, performance
– Other stuff
● We can finally solve our IO problems
52
●
2 TB 2.5” HDDs
●
1 TB 2.5” SSDs (SATA)
●
400 GB SSDs (NVMe)
●
Kraken 11.2.0
●
CephFS
●
Cache tiering
●
Erasure coding
●
BlueStore
●
CRUSH device classes
●
Untrained IT staff!
BASEMENT CLUSTER
THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
sage@redhat.com
@liewegas
54
● FreelistManager
– persist list of free extents to
key/value store
– prepare incremental updates for
allocate or release
● Initial implementation
– extent-based
<offset> = <length>
– kept in-memory copy
– small initial memory footprint,
very expensive when fragmented
– imposed ordering constraint on
commits :(
● Newer bitmap-based approach
<offset> = <region bitmap>
– where region is N blocks
● 128 blocks = 8 bytes
– use k/v merge operator to XOR
allocation or release
merge 10=0000000011
merge 20=1110000000
– RocksDB log-structured-merge
tree coalesces keys during
compaction
– no in-memory state or ordering
BLOCK FREE LIST
55
● Allocator
– abstract interface to allocate blocks
● StupidAllocator
– extent-based
– bin free extents by size (powers of 2)
– choose sufficiently large extent
closest to hint
– highly variable memory usage
● btree of free extents
– implemented, works
– based on ancient ebofs policy
● BitmapAllocator
– hierarchy of indexes
● L1: 2 bits = 2^6 blocks
● L2: 2 bits = 2^12 blocks
● ...
00 = all free, 11 = all used,
01 = mix
– fixed memory consumption
● ~35 MB RAM per TB
BLOCK ALLOCATOR
56
● We scrub... periodically
– window before we detect error
– we may read bad data
– we may not be sure which copy
is bad
● We want to validate checksum
on every read
● Blobs include csum metadata
– crc32c (default), xxhash{64,32}
● Overhead
– 32-bit csum metadata for 4MB
object and 4KB blocks = 4KB
– larger csum blocks (compression!)
– smaller csums
● crc32c_8 or 16
● IO hints
– seq read + write → big chunks
– compression → big chunks
● Per-pool policy
CHECKSUMS

More Related Content

What's hot

Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Ceph Community
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSHSage Weil
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsKaran Singh
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Community
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph Community
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to CephCeph Community
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen
 

What's hot (20)

Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0Performance optimization for all flash based on aarch64 v2.0
Performance optimization for all flash based on aarch64 v2.0
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph
CephCeph
Ceph
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Ceph Month 2021: RADOS Update
Ceph Month 2021: RADOS UpdateCeph Month 2021: RADOS Update
Ceph Month 2021: RADOS Update
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 

Viewers also liked

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageSage Weil
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersSage Weil
 
Fundamentos de access
Fundamentos de accessFundamentos de access
Fundamentos de accessCristhian2002
 
2. grammar reference llibre face 2
2. grammar reference llibre face 22. grammar reference llibre face 2
2. grammar reference llibre face 2vizzabona
 
CMBS | Arbor Realty Trust: Growing Financial Partnerships
CMBS | Arbor Realty Trust: Growing Financial PartnershipsCMBS | Arbor Realty Trust: Growing Financial Partnerships
CMBS | Arbor Realty Trust: Growing Financial PartnershipsIvan Kaufman
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and BeyondSage Weil
 
A Parent's Guide to Understanding Social Media
A Parent's Guide to Understanding Social Media A Parent's Guide to Understanding Social Media
A Parent's Guide to Understanding Social Media Adam McLane
 
2. grammar reference llibre face 2
2. grammar reference llibre face 22. grammar reference llibre face 2
2. grammar reference llibre face 2vizzabona
 
Evolución histórica de la mecatronica
Evolución histórica de la mecatronicaEvolución histórica de la mecatronica
Evolución histórica de la mecatronicavanaliju011525
 
El índice de confianza del consumidor se mantuvo estable en marzo
El índice de confianza del consumidor se mantuvo estable en marzoEl índice de confianza del consumidor se mantuvo estable en marzo
El índice de confianza del consumidor se mantuvo estable en marzoEconomis
 
Xii material huacho jueves 23 de marzo del 2017
Xii material huacho jueves 23 de marzo del 2017Xii material huacho jueves 23 de marzo del 2017
Xii material huacho jueves 23 de marzo del 2017Isela Guerrero Pacheco
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceSijie Guo
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Sage Weil
 

Viewers also liked (19)

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
Fundamentos de access
Fundamentos de accessFundamentos de access
Fundamentos de access
 
2. grammar reference llibre face 2
2. grammar reference llibre face 22. grammar reference llibre face 2
2. grammar reference llibre face 2
 
CMBS | Arbor Realty Trust: Growing Financial Partnerships
CMBS | Arbor Realty Trust: Growing Financial PartnershipsCMBS | Arbor Realty Trust: Growing Financial Partnerships
CMBS | Arbor Realty Trust: Growing Financial Partnerships
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
A Parent's Guide to Understanding Social Media
A Parent's Guide to Understanding Social Media A Parent's Guide to Understanding Social Media
A Parent's Guide to Understanding Social Media
 
2. grammar reference llibre face 2
2. grammar reference llibre face 22. grammar reference llibre face 2
2. grammar reference llibre face 2
 
Evolución histórica de la mecatronica
Evolución histórica de la mecatronicaEvolución histórica de la mecatronica
Evolución histórica de la mecatronica
 
звіт мо 2017
звіт мо 2017звіт мо 2017
звіт мо 2017
 
Actividades primer grado
Actividades primer gradoActividades primer grado
Actividades primer grado
 
El índice de confianza del consumidor se mantuvo estable en marzo
El índice de confianza del consumidor se mantuvo estable en marzoEl índice de confianza del consumidor se mantuvo estable en marzo
El índice de confianza del consumidor se mantuvo estable en marzo
 
Xii material huacho jueves 23 de marzo del 2017
Xii material huacho jueves 23 de marzo del 2017Xii material huacho jueves 23 de marzo del 2017
Xii material huacho jueves 23 de marzo del 2017
 
2017 Global Food Policy Report
2017 Global Food Policy Report 2017 Global Food Policy Report
2017 Global Food Policy Report
 
Apache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage ServiceApache BookKeeper: A High Performance and Low Latency Storage Service
Apache BookKeeper: A High Performance and Low Latency Storage Service
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
Ceph Object Store
Ceph Object StoreCeph Object Store
Ceph Object Store
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 

Similar to BlueStore, A New Storage Backend for Ceph, One Year In

Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Community
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureThomas Uhl
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and BeyondSage Weil
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETNikos Kormpakis
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for youTaco Scargo
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...Building the Enterprise infrastructure with PostgreSQL as the basis for stori...
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...PavelKonotopov
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS Ceph Community
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsJavier González
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Divejoshdurgin
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合Hitoshi Sato
 

Similar to BlueStore, A New Storage Backend for Ceph, One Year In (20)

Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
Bluestore
BluestoreBluestore
Bluestore
 
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage ArhcitectureINFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
20171101 taco scargo luminous is out, what's in it for you
20171101 taco scargo   luminous is out, what's in it for you20171101 taco scargo   luminous is out, what's in it for you
20171101 taco scargo luminous is out, what's in it for you
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...Building the Enterprise infrastructure with PostgreSQL as the basis for stori...
Building the Enterprise infrastructure with PostgreSQL as the basis for stori...
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
RocksDB meetup
RocksDB meetupRocksDB meetup
RocksDB meetup
 
Optimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDsOptimizing RocksDB for Open-Channel SSDs
Optimizing RocksDB for Open-Channel SSDs
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合
 

Recently uploaded

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 

Recently uploaded (20)

Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 

BlueStore, A New Storage Backend for Ceph, One Year In

  • 1. BLUESTORE: A NEW STORAGE BACKEND FOR CEPH – ONE YEAR IN SAGE WEIL 2017.03.23
  • 2. 2 OUTLINE ● Ceph background and context – FileStore, and why POSIX failed us ● BlueStore – a new Ceph OSD backend ● Performance ● Recent challenges ● Future ● Status and availability ● Summary
  • 4. 4 CEPH ● Object, block, and file storage in a single cluster ● All components scale horizontally ● No single point of failure ● Hardware agnostic, commodity hardware ● Self-manage whenever possible ● Open source (LGPL) ● “A Scalable, High-Performance Distributed File System” ● “performance, reliability, and scalability”
  • 5. 5 CEPH COMPONENTS RGW A web services gateway for object storage, compatible with S3 and Swift LIBRADOS A library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors RBD A reliable, fully-distributed block device with cloud platform integration CEPHFS A distributed file system with POSIX semantics and scale-out metadata management OBJECT BLOCK FILE
  • 6. 6 OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs btrfs ext4 M M M
  • 7. 7 OBJECT STORAGE DAEMONS (OSDS) FS DISK OSD DISK OSD FS DISK OSD FS DISK OSD FS xfs btrfs ext4 M M M FileStore FileStoreFileStoreFileStore
  • 8. 8 ● ObjectStore – abstract interface for storing local data – EBOFS, FileStore ● EBOFS – a user-space extent-based object file system – deprecated in favor of FileStore on btrfs in 2009 ● Object – “file” – data (file-like byte stream) – attributes (small key/value) – omap (unbounded key/value) ● Collection – “directory” – placement group shard (slice of the RADOS pool) ● All writes are transactions – Atomic + Consistent + Durable – Isolation provided by OSD OBJECTSTORE AND DATA MODEL
  • 9. 9 ● FileStore – PG = collection = directory – object = file ● Leveldb – large xattr spillover – object omap (key/value) data ● Originally just for development... – later, only supported backend (on XFS) ● /var/lib/ceph/osd/ceph-123/ – current/ ● meta/ – osdmap123 – osdmap124 ● 0.1_head/ – object1 – object12 ● 0.7_head/ – object3 – object5 ● 0.a_head/ – object4 – object6 ● omap/ – <leveldb files> FILESTORE
  • 10. 10 ● Most transactions are simple – write some bytes to object (file) – update object attribute (file xattr) – append to update log (kv insert) ...but others are arbitrarily large/complex ● Serialize and write-ahead txn to journal for atomicity – We double-write everything! – Lots of ugly hackery to make replayed events idempotent [ { "op_name": "write", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "length": 4194304, "offset": 0, "bufferlist length": 4194304 }, { "op_name": "setattrs", "collection": "0.6_head", "oid": "#0:73d87003:::benchmark_data_gnit_10346_object23:head#", "attr_lens": { "_": 269, "snapset": 31 } }, { "op_name": "omap_setkeys", "collection": "0.6_head", "oid": "#0:60000000::::head#", "attr_lens": { "0000000005.00000000000000000006": 178, "_info": 847 } } ] POSIX FAILS: TRANSACTIONS
  • 11. 11 POSIX FAILS: ENUMERATION ● Ceph objects are distributed by a 32-bit hash ● Enumeration is in hash order – scrubbing – “backfill” (data rebalancing, recovery) – enumeration via librados client API ● POSIX readdir is not well-ordered – And even if it were, it would be a different hash ● Need O(1) “split” for a given shard/range ● Build directory tree by hash-value prefix – split any directory when size > ~100 files – merge when size < ~50 files – read entire directory, sort in-memory … DIR_A/ DIR_A/A03224D3_qwer DIR_A/A247233E_zxcv … DIR_B/ DIR_B/DIR_8/ DIR_B/DIR_8/B823032D_foo DIR_B/DIR_8/B8474342_bar DIR_B/DIR_9/ DIR_B/DIR_9/B924273B_baz DIR_B/DIR_A/ DIR_B/DIR_A/BA4328D2_asdf …
  • 12. 12 THE HEADACHES CONTINUE ● New FileStore problems continue to surface as we approach switch to BlueStore – Recently discovered bug in FileStore omap implementation, revealed by new CephFS scrubbing – FileStore directory splits lead to throughput collapse when an entire pool’s PG directories split in unison – Read/modify/write workloads perform horribly ● RGW index objects ● RBD object bitmaps – QoS efforts thwarted by deep queues and periodicity in FileStore throughput – Cannot bound deferred writeback work, even with fsync(2) – {RBD, CephFS} snapshots triggering inefficient 4MB object copies to create object clones
  • 14. 14 ● BlueStore = Block + NewStore – consume raw block device(s) – key/value database (RocksDB) for metadata – data written directly to block device – pluggable block Allocator (policy) – pluggable compression – checksums, ponies, ... ● We must share the block device with RocksDB BLUESTORE BlueStore BlueFS RocksDB BlueRocksEnv data metadata ObjectStore BlockDevice
  • 15. 15 ROCKSDB: BLUEROCKSENV + BLUEFS ● class BlueRocksEnv : public rocksdb::EnvWrapper – passes “file” operations to BlueFS ● BlueFS is a super-simple “file system” – all metadata lives in the journal – all metadata loaded in RAM on start/mount – no need to store block free list – coarse allocation unit (1 MB blocks) – journal rewritten/compacted when it gets large superblock journal … more journal … data data data data file 10 file 11 file 12 file 12 file 13 rm file 12 file 13 ... ● Map “directories” to different block devices – db.wal/ – on NVRAM, NVMe, SSD – db/ – level0 and hot SSTs on SSD – db.slow/ – cold SSTs on HDD ● BlueStore periodically balances free space
  • 16. 16 ● Single device – HDD or SSD ● Bluefs db.wal/ + db/ (wal and sst files) ● object data blobs ● Two devices – 512MB of SSD or NVRAM ● bluefs db.wal/ (rocksdb wal) – big device ● bluefs db/ (sst files, spillover) ● object data blobs MULTI-DEVICE SUPPORT ● Two devices – a few GB of SSD ● bluefs db.wal/ (rocksdb wal) ● bluefs db/ (warm sst files) – big device ● bluefs db.slow/ (cold sst files) ● object data blobs ● Three devices – 512MB NVRAM ● bluefs db.wal/ (rocksdb wal) – a few GB SSD ● bluefs db/ (warm sst files) – big device ● bluefs db.slow/ (cold sst files) ● object data blobs
  • 18. 18 BLUESTORE METADATA ● Everything in flat kv database (rocksdb) ● Partition namespace for different metadata – S* – “superblock” properties for the entire store – B* – block allocation metadata (free block bitmap) – T* – stats (bytes used, compressed, etc.) – C* – collection name → cnode_t – O* – object name → onode_t or bnode_t – X* – shared blobs – L* – deferred writes (promises of future IO) – M* – omap (user key/value data, stored in objects)
  • 19. 19 ● Collection metadata – Interval of object namespace shard pool hash name bits C<NOSHARD,12,3d3e0000> “12.e3d3” = <19> shard pool hash name snap gen O<NOSHARD,12,3d3d880e,foo,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3d9223,bar,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e02c2,baz,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e125d,zip,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e1d41,dee,NOSNAP,NOGEN> = … O<NOSHARD,12,3d3e3832,dah,NOSNAP,NOGEN> = … struct spg_t { uint64_t pool; uint32_t hash; shard_id_t shard; }; struct bluestore_cnode_t { uint32_t bits; }; ● Nice properties – Ordered enumeration of objects – We can “split” collections by adjusting collection metadata only CNODE
  • 20. 20 ● Per object metadata – Lives directly in key/value pair – Serializes to 100s of bytes ● Size in bytes ● Attributes (user attr data) ● Inline extent map (maybe) struct bluestore_onode_t { uint64_t size; map<string,bufferptr> attrs; uint64_t flags; struct shard_info { uint32_t offset; uint32_t bytes; }; vector<shard_info> shards; bufferlist inline_extents; bufferlist spanning_blobs; }; ONODE
  • 21. 21 ● Blob – Extent(s) on device – Lump of data originating from same object – May later be referenced by multiple objects – Normally checksummed – May be compressed ● SharedBlob – Extent ref count on cloned blobs – In-memory buffer cache BLOBS struct bluestore_blob_t { vector<bluestore_pextent_t> extents; uint32_t compressed_length_orig = 0; uint32_t compressed_length = 0; uint32_t flags = 0; uint16_t unused = 0; // bitmap uint8_t csum_type = CSUM_NONE; uint8_t csum_chunk_order = 0; bufferptr csum_data; }; struct bluestore_shared_blob_t { uint64_t sbid; bluestore_extent_ref_map_t ref_map; };
  • 22. 22 ● Map object extents → blob extents ● Extent map serialized in chunks – stored inline in onode value if small – otherwise stored in adjacent keys ● Blobs stored inline in each shard – unless it is referenced across shard boundaries – “spanning” blobs stored in onode key ● If blob is “shared” (cloned) – ref count on allocated extents stored in external key – only needed (loaded) on deallocations EXTENT MAP ... O<,,foo,,> =onode + inline extent map O<,,bar,,> =onode + spanning blobs O<,,bar,,0> =extent map shard O<,,bar,,4> =extent map shard O<,,baz,,> =onode + inline extent map ... logicaloffsets blob blob blob shard#1shard#2 spanningblob
  • 24. 24 Terms ● Sequencer – An independent, totally ordered queue of transactions – One per PG ● TransContext – State describing an executing transaction DATA PATH BASICS Three ways to write ● New allocation – Any write larger than min_alloc_size goes to a new, unused extent on disk – Once that IO completes, we commit the transaction ● Unused part of existing blob ● Deferred writes – Commit temporary promise to (over)write data with transaction ● includes data! – Do async (over)write – Then clean up temporary k/v pair
  • 25. 25 TRANSCONTEXT STATE MACHINE PREPARE AIO_WAIT KV_QUEUED KV_COMMITTING DEFERRED_QUEUED DEFERRED_AIO_WAIT FINISH DEFERRED_CLEANUP Initiate some AIO Wait for next TransContext(s) in Sequencer to be ready Sequencer queue Initiate some AIO Wait for next commit batch DEFERRED_AIO_WAIT DEFERRED_CLEANUP DEFERRED_CLEANUP PREPARE AIO_WAIT KV_QUEUED AIO_WAIT KV_COMMITTING KV_COMMITTING KV_COMMITTING FINISH WAL_QUEUEDDEFERRED_QUEUED FINISH FINISH FINISHDEFERRED_CLEANUP_COMMITTING
  • 26. 26 ● Blobs can be compressed – Tunables target min and max sizes – Min size sets ceiling on size reduction – Max size caps max read amplification ● Garbage collection to limit occluded/wasted space – compacted (rewritten) when waste exceeds threshold INLINE COMPRESSION allocated written written (compressed) start of object end of objectend of object uncompressed blob region
  • 27. 27 ● OnodeSpace per collection – in-memory ghobject_t → Onode map of decoded onodes ● BufferSpace for in-memory blobs – all in-flight writes – may contain cached on-disk data ● Both buffers and onodes have lifecycles linked to a Cache – LRUCache – trivial LRU – TwoQCache – implements 2Q cache replacement algorithm (default) ● Cache is sharded for parallelism – Collection → shard mapping matches OSD's op_wq – same CPU context that processes client requests will touch the LRU/2Q lists – aio completion execution not yet sharded – TODO? IN-MEMORY CACHE
  • 29. 29 HDD: RANDOM WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 100 200 300 400 500 Bluestore vs Filestore HDD Random Write Throughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 500 1000 1500 2000 Bluestore vs Filestore HDD Random Write IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  • 30. 30 HDD: MIXED READ/WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 50 100 150 200 250 300 350 Bluestore vs Filestore HDD Random RWThroughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 200 400 600 800 1000 1200 Bluestore vs Filestore HDD Random RW IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  • 31. 31 HDD+NVME: MIXED READ/WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 50 100 150 200 250 300 350 Bluestore vs Filestore HDD/NVMe Random RWThroughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 200 400 600 800 1000 1200 1400 Bluestore vs Filestore HDD/NVMe Random RW IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS
  • 32. 32 HDD: SEQUENTIAL WRITE 4 8 16 32 64 128 256 512 1024 2048 4096 0 100 200 300 400 500 Bluestore vs Filestore HDD Sequential Write Throughput Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size Throughput(MB/s) 4 8 16 32 64 128 256 512 1024 2048 4096 0 500 1000 1500 2000 2500 3000 3500 Bluestore vs Filestore HDD Sequential Write IOPS Filestore Bluestore (wip-bitmap-alloc-perf) BS (Master-a07452d) BS (Master-d62a4948) Bluestore (wip-bluestore-dw) IO Size IOPS WTF?
  • 33. 33 WHEN TO JOURNAL WRITES ● min_alloc_size – smallest allocation unit (16KB on SSD, 64KB on HDD) >= send writes to newly allocated or unwritten space < journal and deferred small overwrites ● Pretty bad for HDDs, especially sequential writes ● New tunable threshold for direct vs deferred writes – Separate default for HDD and SSD ● Batch deferred writes – journal + journal + … + journal + many deferred writes + journal + … ● TODO: plenty more tuning and tweaking here!
  • 34. 34 RGW ON HDD, 3X REPLICATION 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench 0 100 200 300 400 500 600 700 3X Replication RadosGW Write Tests 32MB Objects, 24 HDD OSDs on 4 Servers, 4 Clients Filestore 512KB Chunks Filestore 4MB Chunks Bluestore 512KB Chunks Bluestore 4MB Chunks Throughput(MB/s)
  • 35. 35 RGW ON HDD+NVME, EC 4+2 1 Bucket 128 Buckets 4 Buckets 512 Buckets Rados 1 RGW Server 4 RGW Servers Bench 0 200 400 600 800 1000 1200 1400 1600 1800 4+2 Erasure Coding RadosGW Write Tests 32MB Objects, 24 HDD/NVMe OSDs on 4 Servers, 4 Clients Filestore 512KB Chunks Filestore 4MB Chunks Bluestore 512KB Chunks Bluestore 4MB Chunks Throughput(MB/s)
  • 36. 36 ERASURE CODE OVERWRITES ● Luminous allows overwrites of EC objects – Requires two-phase commit to avoid “RAID-hole” like failure conditions – OSD creates rollback objects ● clone_range $extent to temporary object ● write $extent with overwrite data ● clone[_range] marks blobs immutable, creates SharedBlob record – Future small overwrites to this blob disallowed; new allocation – Overhead of SharedBlob ref-counting record ● TODO: un-share and restore mutability in EC case – Either hack since (in general) all references are in cache – Or general un-sharing solution (if it doesn’t incur any additional cost)
  • 37. 37 BLUESTORE vs FILESTORE 3X vs EC 4+2 vs EC 5+1 Bluestore Filestore HDD/HDD 0 100 200 300 400 500 600 700 800 900 RBD 4K Random Writes 16 HDD OSDs, 8 32GB volumes, 256 IOs in flight 3X EC42 EC51 IOPS
  • 39. 39 USERSPACE CACHE ● Built ‘mempool’ accounting infrastructure – easily annotate/tag C++ classes and containers – low overhead – debug mode provides per-type (vs per-pool) accounting (items and bytes) ● All data managed by ‘bufferlist’ type – manually track bytes in cache – ref-counting behavior can lead to memory use amplification ● Requires configuration – bluestore_cache_size (default 1GB) ● not as convenient as auto-sized kernel caches – bluestore_cache_meta_ratio (default .9) ● Finally have meaningful implementation of fadvise NOREUSE (vs DONTNEED)
  • 40. 40 MEMORY EFFICIENCY ● Careful attention to struct sizes: packing, redundant fields ● Write path changes to reuse and expand existing blobs – expand extent list for existing blob – big reduction in metadata size, increase in performance ● Checksum chunk sizes – client hints to expect sequential read/write → large csum chunks – can optionally select weaker checksum (16 or 8 bits per chunk) ● In-memory red/black trees (std::map<>, boost::intrusive::set<>) – low temporal write locality → many CPU cache misses, failed prefetches – per-onode slab allocators for extent and blob structs
  • 41. 41 ROCKSDB ● Compaction – Awkward to control priority – Overall impact grows as total metadata corpus grows – Invalidates rocksdb block cache (needed for range queries) ● we prefer O_DIRECT libaio – workaround by using buffered reads and writes ● Bluefs write buffer ● Many deferred write keys end up in L0 ● High write amplification – SSDs with low-cost random reads care more about total write overhead ● Open to alternatives for SSD/NVM – ZetaScale (recently open sourced by SanDisk) – Or let Facebook et al make RocksDB great?
  • 43. 43 MORE RUN TO COMPLETION (RTC) ● “Normal” write has several context switches – A: prepare transaction, initiate any aio – B: io completion handler, queue txc for commit – C: commit to kv db – D: completion callbacks ● Metadata-only transactions or deferred writes? – skip B, do half of C ● Very fast journal devices (e.g., NVDIMM)? – do C commit synchronously ● Some completion callbacks back into OSD can be done synchronously – avoid D ● Pipeline nature of each Sequencer makes this all opportunistic
  • 44. 44 SMR ● Described high-level strategy at Vault’16 – GSoC project ● Recent work shows less-horrible performance on DM-SMR – Evolving ext4 for shingled disks (FAST’17, Vault’17) – “Keep metadata in journal, avoid random writes” ● Still plan SMR-specific allocator/freelist implementation – Tracking released extents in each zone useless – Need reverse map to identify remaining objects – Clean least-full zones ● Some speculation that this will be good strategy for non-SMR devices too
  • 45. 45 “TIERING” TODAY ● We do only very basic multi-device – WAL, KV, object data ● Not enthusiastic about doing proper tiering within BlueStore – Basic multi-device infrastructure not difficult, but – Policy and auto-tiering are complex and unbounded BlueStore BlueFS RocksDB data metadata ObjectStore HDD SSD NVDIMM
  • 46. 46 TIERING BELOW! ● Prefer to defer tiering to block layer – bcache, dm-cache, FlashCache ● Extend libaio interface to enable hints – HOT and COLD flags – Add new IO_CMD_PWRITEV2 with a usable flags field ● and eventually DIF/DIX for passing csum to device? ● Modify bcache, dm-cache, etc to respect hints ● (And above! This is unrelated to RADOS cache tiering and future tiering plans across OSDs) BlueStore BlueFS RocksDB data metadata ObjectStore HDD SSD dm-cache, bcache, ...
  • 47. 47 SPDK – KERNEL BYPASS FOR NVME ● SPDK support is in-tree, but – Lack of caching for bluefs/rocksdb – DPDK polling thread per OSD not practical ● Ongoing work to allow multiple logical OSDs to coexist in same process – Share DPDK infrastructure – Share some caches (e.g., OSDMap cache) – Multiplex across shared network connections (good for RDMA) – DPDK backend for AsyncMessenger ● Blockers – msgr2 messenger protocol to allow multiplexing – some common shared code cleanup (g_ceph_context)
  • 49. 49 STATUS ● Early prototype in Jewel v10.2.z – Very different than current code; no longer useful or interesting ● Stable (code and on-disk format) in Kraken v11.2.z – Still marked ‘experimental’ ● Stable and recommended default in Luminous v12.2.z (out this Spring) ● Current efforts – Workload throttling – Performance anomalies – Optimizing for CPU time
  • 50. 50 MIGRATION FROM FILESTORE ● Fail in place – Fail FileStore OSD – Create new BlueStore OSD on same device → period of reduced redundancy ● Disk-wise replacement – Format new BlueStore on spare disk – Stop FileStore OSD on same host – Local host copy between OSDs/disks → reduce online redundancy, but data still available offline – Requires extra drive slot per host ● Host-wise replacement – Provision new host with BlueStore OSDs – Swap new host into old hosts CRUSH position → no reduced redundancy during migration – Requires spare host per rack, or extra host migration for each rack
  • 51. 51 SUMMARY ● Ceph is great at scaling out ● POSIX was poor choice for storing objects ● Our new BlueStore backend is so much better – Good (and rational) performance! – Inline compression and full data checksums ● We are definitely not done yet – Performance, performance, performance – Other stuff ● We can finally solve our IO problems
  • 52. 52 ● 2 TB 2.5” HDDs ● 1 TB 2.5” SSDs (SATA) ● 400 GB SSDs (NVMe) ● Kraken 11.2.0 ● CephFS ● Cache tiering ● Erasure coding ● BlueStore ● CRUSH device classes ● Untrained IT staff! BASEMENT CLUSTER
  • 53. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas
  • 54. 54 ● FreelistManager – persist list of free extents to key/value store – prepare incremental updates for allocate or release ● Initial implementation – extent-based <offset> = <length> – kept in-memory copy – small initial memory footprint, very expensive when fragmented – imposed ordering constraint on commits :( ● Newer bitmap-based approach <offset> = <region bitmap> – where region is N blocks ● 128 blocks = 8 bytes – use k/v merge operator to XOR allocation or release merge 10=0000000011 merge 20=1110000000 – RocksDB log-structured-merge tree coalesces keys during compaction – no in-memory state or ordering BLOCK FREE LIST
  • 55. 55 ● Allocator – abstract interface to allocate blocks ● StupidAllocator – extent-based – bin free extents by size (powers of 2) – choose sufficiently large extent closest to hint – highly variable memory usage ● btree of free extents – implemented, works – based on ancient ebofs policy ● BitmapAllocator – hierarchy of indexes ● L1: 2 bits = 2^6 blocks ● L2: 2 bits = 2^12 blocks ● ... 00 = all free, 11 = all used, 01 = mix – fixed memory consumption ● ~35 MB RAM per TB BLOCK ALLOCATOR
  • 56. 56 ● We scrub... periodically – window before we detect error – we may read bad data – we may not be sure which copy is bad ● We want to validate checksum on every read ● Blobs include csum metadata – crc32c (default), xxhash{64,32} ● Overhead – 32-bit csum metadata for 4MB object and 4KB blocks = 4KB – larger csum blocks (compression!) – smaller csums ● crc32c_8 or 16 ● IO hints – seq read + write → big chunks – compression → big chunks ● Per-pool policy CHECKSUMS