Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 129

Building a Distributed Message Log from Scratch

0

Share

Download to read offline

Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Building a Distributed Message Log from Scratch

  1. 1. Building a Distributed Message Log from Scratch Tyler Treat · Iowa Code Camp · 11/04/17
  2. 2. - Messaging Nerd @ Apcera - Working on nats.io - Distributed systems - bravenewgeek.com Tyler Treat
  3. 3. - The Log
 -> What?
 -> Why? - Implementation
 -> Storage mechanics
 -> Data-replication techniques
 -> Scaling message delivery
 -> Trade-offs and lessons learned Outline
  4. 4. The Log
  5. 5. The Log A totally-ordered, append-only data structure.
  6. 6. The Log 0
  7. 7. 0 1 The Log
  8. 8. 0 1 2 The Log
  9. 9. 0 1 2 3 The Log
  10. 10. 0 1 2 3 4 The Log
  11. 11. 0 1 2 3 4 5 The Log
  12. 12. 0 1 2 3 4 5 newest recordoldest record The Log
  13. 13. newest recordoldest record The Log
  14. 14. Logs record what happened and when.
  15. 15. caches databases indexes writes
  16. 16. Examples in the wild: -> Apache Kafka
 -> Amazon Kinesis -> NATS Streaming
 -> Tank
  17. 17. Key Goals: -> Performance -> High Availability -> Scalability
  18. 18. The purpose of this talk is to learn…
 -> a bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.
  19. 19. You will probably never need to build something like this yourself, but it helps to know how it works.
  20. 20. Implemen- tation
  21. 21. Implemen- tation Don’t try this at home.
  22. 22. Some first principles… Storage Mechanics • The log is an ordered, immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk
  23. 23. http://queue.acm.org/detail.cfm?id=1563874
  24. 24. avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00 11.28 0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat
  25. 25. Storage Mechanics log file 0
  26. 26. Storage Mechanics log file 0 1
  27. 27. Storage Mechanics log file 0 1 2
  28. 28. Storage Mechanics log file 0 1 2 3
  29. 29. Storage Mechanics log file 0 1 2 3 4
  30. 30. Storage Mechanics log file 0 1 2 3 4 5
  31. 31. Storage Mechanics log file … 0 1 2 3 4 5
  32. 32. Storage Mechanics log segment 3 filelog segment 0 file 0 1 2 3 4 5
  33. 33. Storage Mechanics log segment 3 filelog segment 0 file 0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file
  34. 34. Zero-copy Reads user space kernel space page cache disk socket NIC application read send
  35. 35. Zero-copy Reads user space kernel space page cache disk NIC sendfile
  36. 36. Left as an exercise for the listener…
 -> Batching
 -> Compression
  37. 37. caches databases indexes writes
  38. 38. caches databases indexes writes
  39. 39. caches databases indexes writes
  40. 40. How do we achieve high availability and fault tolerance?
  41. 41. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  42. 42. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  43. 43. caches databases indexes writes
  44. 44. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  45. 45. Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  46. 46. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  47. 47. Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  48. 48. Consensus-Based Replication 1. Designate a leader 2. Replicate by either:
 a) waiting for all replicas
 —or— b) waiting for a quorum of replicas
  49. 49. Pros Cons All Replicas Tolerates f failures with f+1 replicas Latency pegged to slowest replica Quorum Hides delay from a slow replica Tolerates f failures with 2f+1 replicas Consensus-Based Replication
  50. 50. Replication in Kafka 1. Select a leader 2. Maintain in-sync replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk
  51. 51. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Replication in Kafka
  52. 52. Failure Modes 1. Leader fails
  53. 53. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Leader fails
  54. 54. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Leader fails
  55. 55. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Leader fails
  56. 56. 0 1 2 3 HW: 3 0 1 2 3 HW: 3 b2 (leader) b3 (follower)ISR: {b2, b3} writes Leader fails
  57. 57. Failure Modes 1. Leader fails
 2. Follower fails
  58. 58. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Follower fails
  59. 59. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Follower fails
  60. 60. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Follower fails replica.lag.time.max.ms
  61. 61. 0 1 2 3 4 5 b1 (leader) HW: 3 0 1 2 3 HW: 3 b3 (follower)ISR: {b1, b3} writes Follower fails replica.lag.time.max.ms
  62. 62. Failure Modes 1. Leader fails
 2. Follower fails
 3. Follower temporarily partitioned
  63. 63. 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes Follower temporarily
 partitioned
  64. 64. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes
  65. 65. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes replica.lag.time.max.ms
  66. 66. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2} writes replica.lag.time.max.ms
  67. 67. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2} writes 5
  68. 68. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower)ISR: {b1, b2} writes 5
  69. 69. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower)ISR: {b1, b2} writes 5 4
  70. 70. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower)ISR: {b1, b2} writes 5 4 5
  71. 71. Follower temporarily
 partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower)ISR: {b1, b2, b3} writes 5 4 5
  72. 72. Replication in NATS Streaming 1. Metadata Raft group replicates client state
 2. Separate Raft group per topic replicates messages and subscriptions
 3. Conceptually, two logs: Raft log and message log
  73. 73. http://thesecretlivesofdata.com/raft
  74. 74. Challenges 1. Scaling Raft
  75. 75. Scaling Raft With a single topic, one node is elected leader and it heartbeats messages to followers
  76. 76. Scaling Raft As the number of topics increases unbounded, so do the number of Raft groups.
  77. 77. Scaling Raft Technique 1: run a fixed number of Raft groups and use a consistent hash to map a topic to a group.
  78. 78. Scaling Raft Technique 2: run an entire node’s worth of topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft
  79. 79. Challenges 1. Scaling Raft 2. Dual writes
  80. 80. Dual Writes Raft Store committed
  81. 81. Dual Writes msg 1Raft Store committed
  82. 82. Dual Writes msg 1 msg 2Raft Store committed
  83. 83. Dual Writes msg 1 msg 2Raft msg 1 msg 2Store committed
  84. 84. Dual Writes msg 1 msg 2 subRaft msg 1 msg 2Store committed
  85. 85. Dual Writes msg 1 msg 2 sub msg 3Raft msg 1 msg 2Store committed
  86. 86. Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4Raft msg 1 msg 2 msg 3Store committed
  87. 87. Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4Raft msg 1 msg 2 msg 3Store committed
  88. 88. Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4Raft msg 1 msg 2 msg 3 msg 4Store commit
  89. 89. Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4Raft msg 1 msg 2 msg 3 msg 4Store 0 1 2 3 4 5 0 1 2 3 physical offset logical offset
  90. 90. Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4Raft msg 1 msg 2Index 0 1 2 3 4 5 0 1 2 3 physical offset logical offset msg 3 msg 4
  91. 91. Treat the Raft log as our message write-ahead log.
  92. 92. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  93. 93. Performance 1. Publisher acks 
 -> broker acks on commit (slow but safe)
 -> broker acks on local log append (fast but unsafe)
 -> publisher doesn’t wait for ack (fast but unsafe) 
 2. Don’t fsync, rely on replication for durability
 3. Keep disk access sequential and maximize zero-copy reads
 4. Batch aggressively
  94. 94. Questions:
 -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  95. 95. Durability 1. Quorum guarantees durability
 -> Comes for free with Raft
 -> In Kafka, need to configure min.insync.replicas and acks, e.g.
 topic with replication factor 3, min.insync.replicas=2, and
 acks=all
 2. Disable unclean leader elections
 3. At odds with availability,
 i.e. no quorum == no reads/writes
  96. 96. Scaling Message Delivery 1. Partitioning
  97. 97. Partitioning is how we scale linearly.
  98. 98. caches databases indexes writes
  99. 99. HELLA WRITES caches databases indexes
  100. 100. caches databases indexes HELLA WRITES
  101. 101. caches databases indexes writes writes writes writes Topic: purchases Topic: inventory
  102. 102. caches databases indexes writes writes writes writes Topic: purchases Topic: inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z
  103. 103. Scaling Message Delivery 1. Partitioning 2. High fan-out
  104. 104. High Fan-out 1. Observation: with an immutable log, there are no stale/phantom reads
 2. This should make it “easy” (in theory) to scale to a large number of consumers (e.g. hundreds of thousands of IoT/edge devices)
 3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers
  105. 105. Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push vs. pull
  106. 106. Push vs. Pull • In Kafka, consumers pull data from brokers • In NATS Streaming, brokers push data to consumers • Pros/cons to both:
 -> With push we need flow control; implicit in pull
 -> Need to make decisions about optimizing for
 latency vs. throughput
 -> Thick vs. thin client and API ergonomics
  107. 107. Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push vs. pull 4. Bookkeeping
  108. 108. Bookkeeping • Two ways to track position in the log:
 -> Have the server track it for consumers
 -> Have consumers track it
 • Trade-off between API simplicity and performance/server complexity
 • Also, consumers might not have stable storage (e.g. IoT device, ephemeral container, etc.)
 • Can we split the difference?
  109. 109. Offset Storage • Can store offsets themselves in the log (in Kafka, originally had to store them in ZooKeeper)
 • Clients periodically checkpoint offset to log
 • Use log compaction to retain only latest offsets
 • On recovery, fetch latest offset from log
  110. 110. Offset Storage bob-foo-0
 11 alice-foo-0
 15Offsets 0 1 2 3 bob-foo-1
 20 bob-foo-0
 18 4 bob-foo-0
 21
  111. 111. Offset Storage bob-foo-0
 11 alice-foo-0
 15Offsets 0 1 2 3 bob-foo-1
 20 bob-foo-0
 18 4 bob-foo-0
 21
  112. 112. Offset Storage alice-foo-0
 15 bob-foo-1
 20Offsets 1 2 4 bob-foo-0
 21
  113. 113. Offset Storage Advantages:
 -> Fault-tolerant
 -> Consistent reads
 -> High write throughput (unlike ZooKeeper)
 -> Reuses existing structures, so less server
 complexity
  114. 114. Trade-offs and Lessons Learned 1. Competing goals
  115. 115. Competing Goals 1. Performance
 -> Easy to make something fast that’s not fault-tolerant or scalable
 -> Simplicity of mechanism makes this easier
 -> Simplicity of “UX” makes this harder 2. Scalability (and fault-tolerance)
 -> Scalability and FT are at odds with simplicity
 -> Cannot be an afterthought—needs to be designed from day 1 3. Simplicity (“UX”)
 -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
 -> Easy to let server handle complexity; hard when that needs to be
 distributed and consistent while still being fast
  116. 116. Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency
  117. 117. Availability vs. Consistency • CAP theorem • Consistency requires quorum which hinders availability and performance • Minimize what you need to replicate
  118. 118. Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity
  119. 119. Distributed systems are complex enough.
 Simple is usually better (and faster).
  120. 120. Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work
  121. 121. Don’t roll your own coordination protocol,
 use Raft, ZooKeeper, etc.
  122. 122. Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests
  123. 123. There are many failure modes, and you can only write so many tests.
 
 Formal methods and property-based/ generative testing can help.
  124. 124. Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests 6. Be honest with your users
  125. 125. Don’t try to be everything to everyone. Be explicit about design decisions, trade- offs, guarantees, defaults, etc.
  126. 126. Thanks! @tyler_treat
 bravenewgeek.com

×