A talk given at JDay Lviv 2015 in Ukraine; originally developed by Yoav Abrahami, and based on the works of Kyle "Aphyr" Kingsbury:
Consistency, availability and partition tolerance: these seemingly innocuous concepts have been giving engineers and researchers of distributed systems headaches for over 15 years. But despite how important they are to the design and architecture of modern software, they are still poorly understood by many engineers.
This session covers the definition and practical ramifications of the CAP theorem; you may think that this has nothing to do with you because you "don't work on distributed systems", or possibly that it doesn't matter because you "run over a local network." Yet even traditional enterprise CRUD applications must obey the laws of physics, which are exactly what the CAP theorem describes. Know the rules of the game and they'll serve you well, or ignore them at your own peril...
5. By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
6. By Example
• I want this book!
– I add it to the cart
– Then continue
browsing
• There’s only one copy
in stock!
• … and someone else
just bought it.
8. Consistency: Defined
• In a consistent
system:
All participants
see the same value
at the same time
• “Do you have this
book in stock?”
9. Consistency: Defined
• If our book store is an
inconsistent system:
– Two customers may
buy the book
– But there’s only one
item in inventory!
• We’ve just violated a
business constraint.
11. Availability: Defined
• An available system:
– Is reachable
– Responds to requests
(within SLA)
• Availability does not
guarantee success!
– The operation may fail
– “This book is no longer
available”
12. Availability: Defined
• What if the system is
unavailable?
– I complete the
checkout
– And click on “Pay”
– And wait
– And wait some more
– And…
• Did I purchase the
book or not?!
14. Partition Tolerance: Defined
• Partition: one or
more nodes are
unreachable
• No practical
system runs on a
single node
• So all systems are
susceptible!
A
B
C
D
E
15. “The Network is Reliable”
• All four happen in an
IP network
• To a client, delays
and drops are the
same
• Perfect failure
detection is provably
impossible1!
A B
drop delay
duplicate reorder
A B
A B A B
time
1 “Impossibility of Distributed Consensus with One Faulty Process”, Fischer, Lynch and Paterson
16. Partition Tolerance: Reified
• External causes:
– Bad network config
– Faulty equipment
– Scheduled
maintenance
• Even software causes
partitions:
– Bad network config.
– GC pauses
– Overloaded servers
• Plenty of war stories!
– Netflix
– Twilio
– GitHub
– Wix :-)
• Some hard numbers1:
– 5.2 failed devices/day
– 59K lost packets/day
– Adding redundancy
only improves by 40%
1 “Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications”, Gill et al
18. In Pictures
• Let’s consider a simple
system:
– Service A writes values
– Service B reads values
– Values are replicated
between nodes
• These are “ideal”
systems
– Bug-free, predictable
Node 1
V0A
Node 2
V0B
19. In Pictures
• “Sunny day scenario”:
– A writes a new value V1
– The value is replicated
to node 2
– B reads the new value
Node 1
V0A
Node 2
V0B
V1
V1
V1
V1
20. In Pictures
• What happens if the
network drops?
– A writes a new value V1
– Replication fails
– B still sees the old value
– The system is
inconsistent
Node 1
V0A
Node 2
V0B
V1
V0
V1
21. In Pictures
• Possible mitigation is
synchronous replication
– A writes a new value V1
– Cannot replicate, so write is
rejected
– Both A and B still see V0
– The system is logically
unavailable
Node 1
V0A
Node 2
V0B
V1
23. The network is not reliable
• Distributed systems must handle partitions
• Any modern system runs on >1 nodes…
• … and is therefore distributed
• Ergo, you have to choose:
– Consistency over availability
– Availability over consistency
24. Granularity
• Real systems comprise many operations
– “Add book to cart”
– “Pay for the book”
• Each has different properties
• It’s a spectrum, not a binary choice!
Consistency Availability
Shopping CartCheckout
25. CAP IN THE REAL
WORLD
Kyle “Aphyr” Kingsbury
Breaking consistency
guarantees since 2013
26. PostgreSQL
• Traditional RDBMS
– Transactional
– ACID compliant
• Primarily a CP system
– Writes against a
master node
• “Not a distributed
system”
– Except with a client at
play!
27. PostgreSQL
• Writes are a simplified
2PC:
– Client votes to commit
– Server validates
transaction
– Server stores changes
– Server acknowledges
commit
– Client receives
acknowledgement
Client Server
Store
28. PostgreSQL
• But what if the ack is
never received?
• The commit is already
stored…
• … but the client has
no indication!
• The system is in an
inconsistent state
Client Server
Store
?
29. PostgreSQL
• Let’s experiment!
• 5 clients write to a
PostgreSQL instance
• We then drop the server
from the network
• Results:
– 1000 writes
– 950 acknowledged
– 952 survivors
30. So what can we do?
1. Accept false-negatives
– May not be acceptable for your use case!
2. Use idempotent operations
3. Apply unique transaction IDs
– Query state after partition is resolved
• These strategies apply to any RDBMS
31. • A document-oriented database
• Availability/scale via replica sets
– Client writes to a master node
– Master replicates writes to n replicas
• User-selectable consistency guarantees
32. MongoDB
• When a partition occurs:
– If the master is in the
minority, it is demoted
– The majority promotes a
new master…
– … selected by the highest
optime
33. MongoDB
• The cluster “heals” after partition resolution:
– The “old” master rejoins the cluster
– Acknowleged minority writes are reverted!
34. MongoDB
• Let’s experiment!
• Set up a 5-node
MongoDB cluster
• 5 clients write to
the cluster
• We then partition
the cluster
• … and restore it to
see what happens
35. MongoDB
• With write concern
unacknowleged:
– Server does not ack
writes (except TCP)
– The default prior to
November 2012
• Results:
– 6000 writes
– 5700 acknowledged
– 3319 survivors
– 42% data loss!
36. MongoDB
• With write concern
acknowleged:
– Server acknowledges
writes (after store)
– The default guarantee
• Results:
– 6000 writes
– 5900 acknowledged
– 3692 survivors
– 37% data loss!
37. MongoDB
• With write concern
replica acknowleged:
– Client specifies
minimum replicas
– Server acks after
writes to replicas
• Results:
– 6000 writes
– 5695 acknowledged
– 3768 survivors
– 33% data loss!
38. MongoDB
• With write concern
majority:
– For an n-node cluster,
requires at least n/2
replicas
– Also called “quorum”
• Results:
– 6000 writes
– 5700 acknowledged
– 5701 survivors
– No data loss
39. So what can we do?
1. Keep calm and carry on
– As Aphyr puts it, “not all applications need
consistency”
– Have a reliable backup strategy
– … and make sure you drill restores!
2. Use write concern majority
– And take the performance hit
40. The prime suspects
• Aphyr’s Jepsen tests
include:
– Redis
– Riak
– Zookeeper
– Kafka
– Cassandra
– RabbitMQ
– etcd (and consul)
– ElasticSearch
• If you’re
considering them,
go read his posts
• In fact, go read his
posts regardless
http://aphyr.com/tags/jepsen
42. Immutable Data
• Immutable (adj.):
“Unchanging over
time or unable to be
changed.”
• Meaning:
– No deletes
– No updates
– No merge conflicts
– Replication is trivial
43. Idempotence
• An idempotent
operation:
– Can be applied one or
more times with the
same effect
• Enables retries
• Not always possible
– Side-effects are key
– Consider: payments
44. Eventual Consistency
• A design which prefers
availability
• … but guarantees that
clients will eventually see
consistent reads
• Consider git:
– Always available locally
– Converges via push/pull
– Human conflict resolution
45. Eventual Consistency
• The system expects
data to diverge
• … and includes
mechanisms to regain
convergence
– Partial ordering to
minimize conflicts
– A merge function to
resolve conflicts
46. Vector Clocks
• A technique for partial ordering
• Each node has a logical clock
– The clock increases on every write
– Track the last observed clocks for each item
– Include this vector on replication
• When observed and inbound vectors have
no common ancestor, we have a conflict
• This lets us know when history diverged
47. CRDTs
• Commutative Replicated Data Types1
• A CRDT is a data structure that:
– Eventually converges to a consistent state
– Guarantees no conflicts on replication
1 “A comprehensive study of Convergent and Commutative Replicated Data Types”, Shapiro et al
48. CRDTs
• CRDTs provide specialized semantics:
– G-Counter: Monotonously increasing counter
– PN-Counter: Also supports decrements
– G-Set: A set that only supports adds
– 2P-Set: Supports removals but only once
• OR-Sets are particularly useful
– Keeps track of both additions and removals
– Can be used for shopping carts
50. WE’RE DONE
HERE!
Thank you for listening
tomer@tomergabel.com
@tomerg
http://il.linkedin.com/in/tomergabel
Aphyr’s “Call Me Maybe” blog posts:
http://aphyr.com/tags/jepsen