Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!
20. The purpose of this talk is to learn…
-> a bit about the internals of a log abstraction.
-> how it can achieve these goals.
-> some applied distributed systems theory.
21. You will probably never need to
build something like this yourself,
but it helps to know how it works.
24. Some first principles…
Storage Mechanics
• The log is an ordered, immutable sequence of messages
• Messages are atomic (meaning they can’t be broken up)
• The log has a notion of message retention based on some policies
(time, number of messages, bytes, etc.)
• The log can be played back from any arbitrary position
• The log is stored on disk
• Sequential disk access is fast*
• OS page cache means sequential access often avoids disk
42. How do we achieve high availability
and fault tolerance?
43. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
44. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
46. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
48. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
51. Pros Cons
All Replicas
Tolerates f failures with
f+1 replicas
Latency pegged to
slowest replica
Quorum
Hides delay from a slow
replica
Tolerates f failures with
2f+1 replicas
Consensus-Based Replication
52. Replication in Kafka
1. Select a leader
2. Maintain in-sync replica set (ISR) (initially every replica)
3. Leader writes messages to write-ahead log (WAL)
4. Leader commits messages when all replicas in ISR ack
5. Leader maintains high-water mark (HW) of last
committed message
6. Piggyback HW on replica fetch responses which
replicas periodically checkpoint to disk
74. Replication in NATS Streaming
1. Metadata Raft group replicates client state
2. Separate Raft group per topic replicates messages
and subscriptions
3. Conceptually, two logs: Raft log and message log
77. Scaling Raft
With a single topic, one node is elected leader and it
heartbeats messages to followers
78. Scaling Raft
As the number of topics increases unbounded, so do the
number of Raft groups.
79. Scaling Raft
Technique 1: run a fixed number of Raft groups and use
a consistent hash to map a topic to a group.
80. Scaling Raft
Technique 2: run an entire node’s worth of topics as a
single group using a layer on top of Raft.
https://www.cockroachlabs.com/blog/scaling-raft
94. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
95. Performance
1. Publisher acks
-> broker acks on commit (slow but safe)
-> broker acks on local log append (fast but unsafe)
-> publisher doesn’t wait for ack (fast but unsafe)
2. Don’t fsync, rely on replication for durability
3. Keep disk access sequential and maximize zero-copy reads
4. Batch aggressively
96. Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
97. Durability
1. Quorum guarantees durability
-> Comes for free with Raft
-> In Kafka, need to configure min.insync.replicas and acks, e.g.
topic with replication factor 3, min.insync.replicas=2, and
acks=all
2. Disable unclean leader elections
3. At odds with availability,
i.e. no quorum == no reads/writes
106. High Fan-out
1. Observation: with an immutable log, there are no
stale/phantom reads
2. This should make it “easy” (in theory) to scale to a
large number of consumers (e.g. hundreds of
thousands of IoT/edge devices)
3. With Raft, we can use “non-voters” to act as read
replicas and load balance consumers
108. Push vs. Pull
• In Kafka, consumers pull data from brokers
• In NATS Streaming, brokers push data to consumers
• Pros/cons to both:
-> With push we need flow control; implicit in pull
-> Need to make decisions about optimizing for
latency vs. throughput
-> Thick vs. thin client and API ergonomics
110. Bookkeeping
• Two ways to track position in the log:
-> Have the server track it for consumers
-> Have consumers track it
• Trade-off between API simplicity and performance/server
complexity
• Also, consumers might not have stable storage (e.g. IoT device,
ephemeral container, etc.)
• Can we split the difference?
111. Offset Storage
• Can store offsets themselves in the log (in Kafka,
originally had to store them in ZooKeeper)
• Clients periodically checkpoint offset to log
• Use log compaction to retain only latest offsets
• On recovery, fetch latest offset from log
117. Competing Goals
1. Performance
-> Easy to make something fast that’s not fault-tolerant or scalable
-> Simplicity of mechanism makes this easier
-> Simplicity of “UX” makes this harder
2. Scalability (and fault-tolerance)
-> Scalability and FT are at odds with simplicity
-> Cannot be an afterthought—needs to be designed from day 1
3. Simplicity (“UX”)
-> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
-> Easy to let server handle complexity; hard when that needs to be
distributed and consistent while still being fast
119. Availability vs. Consistency
• CAP theorem
• Consistency requires quorum which hinders
availability and performance
• Minimize what you need to replicate
120. Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
122. Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
123. Don’t roll your own coordination protocol,
use Raft, ZooKeeper, etc.
124. Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
125. There are many failure modes, and you can
only write so many tests.
Formal methods and property-based/
generative testing can help.
126.
127. Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
6. Be honest with your users
128. Don’t try to be everything to everyone. Be
explicit about design decisions, trade-
offs, guarantees, defaults, etc.