Introduction to Ceph, an open-source, massively scalable distributed file system.
This document explains the architecture of Ceph and integration with OpenStack.
15. RADOS
Reliable
Replicated to avoid data loss
Autonomic
Communicate each other to
detect failures
Replication done transparently
Distributed
Object Store
16. RADOS (2)
Fundamentals of Ceph
Everything is stored in
RADOS
Including Ceph FS metadata
Two components: mon, osd
CRUSH algorithm
17. OSD
Object storage daemon
One OSD per disk
Uses xfs/btrfs as backend
Btrfs is experimental!
Write-ahead journal for
integrity and performance
3 to 10000s OSDs in a cluster
20. Locating objects
RADOS uses an algorithm
“CRUSH” to locate objects
Location is decided through
pure “calculation”
No central “metadata” server
No SPoF
Massive scalability
21. CRUSH
1. Assign a placement group
pg = Hash(object name) % num pg
2. CRUSH(pg, cluster map, rule)
1
2
27. Wrap-up: CRUSH
Object name + cluster map
→ object locations
Deterministic
No metadata at all
Calculation done on clients
Cluster map reflects network
hierarchy
28. RADOSGW
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
30. RBD
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
33. Ceph FS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file
system, with a Linux
kernel client and
support for FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
34. Ceph FS
POSIX compliant file system
build on top of RADOS
Can mount with Linux native
kernel driver (cephfs) or FUSE
Metadata servers (mds)
manages metadata of the
file system tree
35. Ceph FS is reliable
MDS writes journal to RADOS
so that metadata doesn’t
lose by MDS failures
Multiple MDS can run for HA
and load balancing
36. Ceph FS and OSD
MDS
OSDOSDOSD
POSIX Metadata
(directory, time, owner, etc)
MDS
Write metadata journal
Data I/O
Metadata held in-memory
45. Erasure coding
Use erasure coding instead
of parity for data durability
Suitable for rarely modified
or accessed objects
Erasure Coding Replication
Space overhead
(survive 2 fails)
Approx 40% 200%
CPU High Low
Latency High Low
46. Cache tiering
Cache tier
ex. SSD
Base tier
ex. HDD,
erasure coded
librados
transparent to clients
read/write
read when miss
fetch when miss
flush to base tier
47. Key-value OSD backend
Use LevelDB for OSD
backend (instead of xfs)
Better performance esp for
small objects
Plans to support RocksDB,
NVMKV, etc
54. Benefits of using with
Unified storage for both
images and volumes
Copy-on-write cloning and
snapshot support
Native qemu / KVM support
for better performance
57. Ceph is
Massively scalable storage
Unified architecture for
object / block / POSIX FS
OpenStack integration is
ready to use & awesome
58. Ceph and GlusterFS
Ceph GlusterFS
Distribution Object based File based
File location Deterministic
algorithm
(CRUSH)
Distributed
hash table,
stored in xattr
Replication Server side Client side
Primary usage Object / block
storage
POSIX-like file
system
Challenge POSIX file
system needs
improvement
Object / block
storage needs
improvement
59. Further readings
Ceph Documents
https://ceph.com/docs/master/
Well documented.
Sébastien Han
http://www.sebastien-han.fr/blog/
An awesome blog.
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
http://ceph.com/papers/weil-crush-sc06.pdf
CRUSH algorithm paper
Ceph: A Scalable, High-Performance Distributed File System
http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
Ceph paper
Ceph の覚え書きのインデックス
http://www.nminoru.jp/~nminoru/unix/ceph/
Well written introduction in Japanese
61. Calamari will be open sourced
“Calamari, the monitoring
and diagnostics tool that
Inktank has developed as
part of the Inktank Ceph
Enterprise product, will soon
be open sourced.”
http://ceph.com/community/red-hat-to-acquire-inktank/#sthash.1rB0kfRS.dpuf