Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!
20. @tyler_treat
The purpose of this talk is to learn…
-> a bit about the internals of a log abstraction.
-> how it can achieve these goals.
-> some applied distributed systems theory.
25. @tyler_treat
Some first principles…
• The log is an ordered, immutable sequence of messages
• Messages are atomic (meaning they can’t be broken up)
• The log has a notion of message retention based on some policies
(time, number of messages, bytes, etc.)
• The log can be played back from any arbitrary position
• The log is stored on disk
• Sequential disk access is fast*
• OS page cache means sequential access often avoids disk
45. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
46. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
48. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
50. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
52. @tyler_treat
Replication in Kafka
1. Select a leader
2. Maintain in-sync replica set (ISR) (initially every replica)
3. Leader writes messages to write-ahead log (WAL)
4. Leader commits messages when all replicas in ISR ack
5. Leader maintains high-water mark (HW) of last
committed message
6. Piggyback HW on replica fetch responses which
replicas periodically checkpoint to disk
69. @tyler_treat
Replication in NATS Streaming
1. Raft replicates client state, messages, and
subscriptions
2. Conceptually, two logs: Raft log and message log
3. Parallels work implementing Raft in RabbitMQ
71. @tyler_treat
Replication in NATS Streaming
• Initially used Raft group per topic and separate
metadata group
• A couple issues with this:
-> Topic scalability
-> Increased complexity due to lack of ordering between Raft groups
76. @tyler_treat
Scaling Raft
Technique 2: run an entire node’s worth of topics as a
single group using a layer on top of Raft.
https://www.cockroachlabs.com/blog/scaling-raft
91. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
92. @tyler_treat
Performance
1. Publisher acks
-> broker acks on commit (slow but safe)
-> broker acks on local log append (fast but unsafe)
-> publisher doesn’t wait for ack (fast but unsafe)
2. Don’t fsync, rely on replication for durability
3. Keep disk access sequential and maximize zero-copy reads
4. Batch aggressively
93. @tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
94. @tyler_treat
Durability
1. Quorum guarantees durability
-> Comes for free with Raft
-> In Kafka, need to configure min.insync.replicas and acks, e.g.
topic with replication factor 3, min.insync.replicas=2, and
acks=all
2. Disable unclean leader elections
3. At odds with availability,
i.e. no quorum == no reads/writes
106. @tyler_treat
High Fan-Out
1. Observation: with an immutable log, there are no
stale/phantom reads
2. This should make it “easy” (in theory) to scale to a
large number of consumers
3. With Raft, we can use “non-voters” to act as read
replicas and load balance consumers
108. @tyler_treat
Push vs. Pull
• In Kafka, consumers pull data from brokers
• In NATS Streaming, brokers push data to consumers
• Design implications:
• Fan-out
• Flow control
• Optimizing for latency vs. throughput
• Client complexity
111. @tyler_treat
Competing Goals
1. Performance
-> Easy to make something fast that’s not fault-tolerant or scalable
-> Simplicity of mechanism makes this easier
-> Simplicity of “UX” makes this harder
2. Scalability and fault-tolerance
-> At odds with simplicity
-> Cannot be an afterthought
3. Simplicity
-> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
-> Easy to let server handle complexity; hard when that needs to be
distributed, consistent, and fast
116. @tyler_treat
“A complex system designed from
scratch never works and cannot
be patched up to make it work.
You have to start over, beginning
with a working simple system.”
117. @tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
119. @tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
120. @tyler_treat
There are many failure modes, and you can
only write so many tests.
Formal methods and property-based/
generative testing can help.
122. @tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
6. Be honest with your users
123. @tyler_treat
Don’t try to be everything to everyone.
Be explicit about design decisions, trade-
offs, guarantees, defaults, etc.