PaxosStore is a high-availability storage system developed to support the comprehensive business of WeChat. It employs a combinational design in the storage layer to engage multiple storage engines constructed for different storage models. PaxosStore is characteristic of extracting the Paxos-based distributed consensus protocol as a middleware that is universally accessible to the underlying multi-model storage engines. This facilitates tuning, maintaining, scaling and extending the storage engines. According to our experience in engineering practice, to achieve a practical consistent read/write protocol is far more complex than its theory. To tackle such engineering complexity, we propose a layered design of the Paxos-based storage protocol stack, where PaxosLog, the key data structure used in the protocol, is devised to bridge the programming-oriented consistent read/write to the storage-oriented Paxos procedure. Additionally, we present optimizations based on Paxos that made fault-tolerance more efficient.
PaxosStore is open source:
https://github.com/tencent/paxosstore
For details of the design, please refer to our VLDB 2017 paper:
http://www.vldb.org/pvldb/vol10/p1730-lin.pdf
Video of the presentation is also available:
https://youtu.be/5zNRfuaCgBI
6. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
PaxosStore implements the Paxos procedure
using semi-symmetry message passing (read our paper for details)
Prepare phase -- making a preliminary agreement
Accept phase -- reaching the eventual consensus
7. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
Entry EntryPaxosLog ⋯
Request
ID
Timestamp
(16 bits)
Request Seq.
(16 bits)
Client ID
(32 bits)
Promise
No.
Entry
Proposal
No.
Value
Proposer
ID
8. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
𝑖 + 1 𝒊 𝑖 − 1 𝑖 − 2 ⋯
𝒓
PaxosLog
Data Object
Pending
Chosen
Data Key
9. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
𝑟𝑖+1 𝒓𝒊
PaxosLog
Data Key
PaxosLog-as-Value
(for key-value storage)
10. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
For a data object 𝑟,
1) system reads its value from any
of the up-to-date 𝑟 replicas, and
2) these up-to-date replicas need to
dominate the total replicas of 𝑟
Consistent Read For read-frequent data, these criteria are likely to be satisfied
For data contention, use trial Paxos procedure to sync replicas
do not correspond to
any substantive write operation
11. Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
Liveness
PaxosLog-entry batched applying
Consistent Write
Relying on the Paxos procedures
18. Data Recovery
Recover through
PaxosLog
Recover through
delta updates of data image
Recover through
whole data image
Recovery
starts
Incremental
PaxosLog entries
exist?
No
Yes Data object is
append-only?
Yes
No
Recovery time decreases
Lazy Recovery
Obsolete data replicas are
not recovered immediately
upon node recovery, but
recovered when they are
subsequently accessed.
Failover reads
De-duplicated processing
19. Implementation
• Use coroutine to program asynchronous procedure in the
synchronous paradigm
Search Repository https://github.com/Tencent/libco
Much more efficient than Boost.Coroutine, while easy to use
20. Failure Recovery in WeChat Production
• Read/Write ratio is 15:1 on average
Failure happens at 14:20 Node resumes at 15:27
Restored to
95% normal throughput
within 3 minutes
21. Summary
• What covered in the paper
– The design of PaxosStore, with emphasis on the construction of the
consistent read/write protocol
– Fault-tolerant scheme and data recovery strategies
– Pragmatic optimizations come from our engineering practice
• Key lessons learned
– Apart from faults and failure, system overload is also a critical factor
that affects system availability
o Especially, the potential avalanche effect caused by overload must be paid
enough attention to when designing the system fault-tolerant scheme.
– Use coroutine and socket hook to program asynchronous procedures
in a pseudo-synchronous style
o This helps eliminate the error-prone function callbacks and simplify the
implementation of asynchronous logics.