RICON keynote: outwards from the middle of the maze

Outwards
from the middle of the maze
Peter Alvaro
UC Berkeley

Outline
1. Mourning the death of transactions
2. What is so hard about distributed systems?
3. Distributed consistency: managing asynchrony
4. Fault-tolerance: progress despite failures

The transaction concept
DEBIT_CREDIT:
BEGIN_TRANSACTION;
GET
MESSAGE;
EXTRACT
ACCOUT_NUMBER,
DELTA,
TELLER,
BRANCH
FROM
MESSAGE;
FIND
ACCOUNT(ACCOUT_NUMBER)
IN
DATA
BASE;
IF
NOT_FOUND
|
ACCOUNT_BALANCE
+
DELTA
<
0
THEN
PUT
NEGATIVE
RESPONSE;
ELSE
DO;
ACCOUNT_BALANCE
=
ACCOUNT_BALANCE
+
DELTA;
POST
HISTORY
RECORD
ON
ACCOUNT
(DELTA);
CASH_DRAWER(TELLER)
=
CASH_DRAWER(TELLER)
+
DELTA;
BRANCH_BALANCE(BRANCH)
=
BRANCH_BALANCE(BRANCH)
+
DELTA;
PUT
MESSAGE
('NEW
BALANCE
='
ACCOUNT_BALANCE);
END;
COMMIT;

Transactions: a holistic contract
Write
Read
Application
Opaque
store
Transactions

Write
Read
Application
Opaque
store
Transactions
Assert:
balance > 0

Assert:
balance > 0
Write
Read
Application
Opaque
store
Transactions

Incidental complexities
• The “Internet.” Searching it.
• Cross-datacenter replication schemes
• CAP Theorem
• Dynamo & MapReduce
• “Cloud”

Fundamental complexity
“[…] distributed systems require that the
programmer be aware of latency, have a different
model of memory access, and take into account
issues of concurrency and partial failure.”
Jim Waldo et al.,
A Note on Distributed Computing (1994)

A holistic contract
…stretched to the limit
Write
Read
Application
Opaque
store
Transactions

Are you blithely asserting
that transactions aren’t webscale?
Some people just want to see the world burn.
Those same people want to see the world use inconsistent databases.
- Emin Gun Sirer

Alternative to top-down design?
The “bottom-up,” systems tradition:
Simple, reusable components first.
Semantics later.

Alternative:
the “bottom-up,” systems ethos

The “bottom-up” ethos
“‘Tis a fine barn, but sure ‘tis no castle, English”

The “bottom-up” ethos
Simple, reusable components first.
Semantics later.
This is how we live now.
Question: Do we ever get those
application-level guarantees back?

Low-level contracts
Write
Read
Application
Distributed
store KVS

Low-level contracts
Write
Read
Application
Distributed
store KVS
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)

Low-level contracts
Write
Read
Application
Distributed
store KVS
Assert:
balance > 0
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)

Low-level contracts
Write
Read
Application
Distributed
store KVS
Assert:
balance > 0
causal?
PRAM?
delta?
fork/join?
red/blue?
Release?
R1(X=1)
R2(X=1)
W1(X=2)
W2(X=0)
W1(X=1)
W1(Y=2)
R2(Y=2)
R2(X=0)

When do contracts compose?
Application
Distributed
service
Assert:
balance > 0

iw, did I get mongo in my riak?
Assert:
balance > 0

Composition is the last hard
problem
Composing modules is hard enough
We must learn how to compose guarantees

Why distributed systems are hard2
Asynchrony Partial Failure
Fundamental Uncertainty

Asynchrony isn’t that hard
Ameloriation:
Logical timestamps
Deterministic interleaving

Partial failure isn’t that hard
Ameloriation:
Replication
Replay

(asynchrony * partial failure) = hard2
Logical timestamps
Deterministic interleaving
Replication
Replay

(asynchrony * partial failure) = hard2
Tackling one clown at a time
Poor strategy for programming distributed systems
Winning strategy for analyzing distributed programs

Distributed consistency
Today: A quick summary of some great work.

Consider a (distributed) graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14

Partitioned, for scalability
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14

Replicated, for availability
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14

Deadlock detection
Task: Identify strongly-connected
components
Waits-for graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14

Garbage collection
Task: Identify nodes not reachable
from Root. Root
Refers-to graph
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14

T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Correctness
Deadlock detection
• Safety: No false positives
• Liveness: Identify all deadlocks
Garbage collection
• Safety: Never GC live memory!
• Liveness: GC all orphaned memory

T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Correctness
Deadlock detection
• Safety: No false positives-
Garbage collection

Correctness
Deadlock detection
• Safety: No false positives
Garbage collection
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Root

Consistency at the extremes
Application
Language
Custom s
olutions?
Flow
Object
Storage
Linearizable
key-value store?

Consistency at the extremes
Application
Language
Custom s
olutions?
Flow
Efficient Object
Correct
Storage
Linearizable
key-value store?

Object-level consistency
Capture semantics of data structures that
• allow greater concurrency
• maintain guarantees (e.g. convergence)
Application
Language
Flow
Object
Storage

Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence

Insert
Read
Convergent
data structure
(e.g., Set CRDT)
Insert
Read
Commutativity
Associativity
Idempotence
Reordering
Batching
Retry/duplication
Tolerant to

Object-level composition?
Application
Convergent
data structures
Assert:
Graph replicas
converge

Application
Convergent
data structures
GC Assert:
No live nodes are reclaimed
Assert:
Graph replicas
converge

Application
Convergent
data structures
GC Assert:
No live nodes are reclaimed
?
?
Assert:
Graph replicas
converge

Flow-level consistency
Application
Language
Flow
Object
Storage

Capture semantics of data in motion
• Asynchronous dataflow model
• component properties à system-wide guarantees
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector

Order-insensitivity (confluence)
output
set
=
f(input
set)

output
set
=
f(input
set)
=

output
set
=
f(input
set)
{
}
=
{
}

Confluence is compositional
output
set
=
f
Ÿ
g(input
set)

Graph queries as dataflow
Graph
store
Memory
allocator
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Confluent Confluent Confluent

Graph queries as dataflow
Graph
store
Memory
allocator
Confluent
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Coordinate
here

Coordination: what is that?
Strategy 1: Establish a total order
Graph
store
Memory
allocator
Coordinate
here
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent

Coordination: what is that?
Strategy 2: Establish a producer-consumer
Graph
store
Memory
allocator
Coordinate
here
Transitive
closure
Garbage
collector
Confluent Not
Confluent
Confluent
barrier

Fundamental costs: FT via replication
(mostly) free!
Graph
store
Transaction
manager
Transitive
closure
Deadlock
detector
Graph
store
Transitive
closure
Deadlock
detector

global synchronization!
Graph
store
Transaction
manager
Transitive
closure
Garbage
Collector
Confluent Confluent
Graph
store
Transitive
closure
Garbage
Collector
Confluent Not
Confluent
Confluent
Paxos
Not
Confluent

The first principle of successful scalability is to batter the
consistency mechanisms down to a minimum.
– James Hamilton
Garbage
Collector
Graph
store
Transaction
manager
Transitive
closure
Garbage
Collector
Confluent Confluent
Graph
store
Transitive
closure
Confluent Not
Confluent
Confluent
Barrier
Not
Confluent
Barrier

Language-level consistency
DSLs for distributed programming?
• Capture consistency concerns in the
type system
Application
Language
Flow
Object
Storage

CALM Theorem:
Monotonic à confluent
Conservative, syntactic test for confluence

Deadlock detector
Garbage collector

Deadlock detector
Garbage collector
nonmonotonic

Let’s review
• Consistency is tolerance to asynchrony
• Tricks:
– focus on data in motion, not at rest
– avoid coordination when possible
– choose coordination carefully otherwise
(Tricks are great, but tools are better)

Grand challenge: composition
Hard problem:
Is a given component fault-tolerant?
Much harder:
Is this system (built up from components)
fault-tolerant?

Example: Atomic
multi-partition update
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Two-phase
commit

Example: replication
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
T1
T2
T4
T3
T10
T6
T5
T9
T7
T11
T8
T12
T13
T14
Reliable
broadcast

Popular wisdom: don’t reinvent

Example: Kafka replication bug
Three “correct” components:
1. Primary/backup replication
2. Timeout-based failure detectors
3. Zookeeper
One nasty bug:
Acknowledged writes are lost

A guarantee would be nice
Bottom up approach:
• use formal methods to verify individual
components (e.g. protocols)
• Build systems from verified components
Shortcomings:
• Hard to use
• Hard to compose
Investment
Returns

Bottom-up assurances
Formal
verifica[on
Environment
Program
Correctness
Spec

Composing bottom-up
assurances

Composing bottom-up
assurances
Issue 1: incompatible failure models
eg, crash failure vs. omissions
Issue 2: Specs do not compose
(FT is an end-to-end property)
If you take 10 components off the shelf, you are putting 10 world views
together, and the result will be a mess. -- Butler Lampson

Top-down “assurances”
Testing

Fault
injection Testing

Fault
injection
Testing

End-to-end testing
would be nice
Top-down approach:
• Build a large-scale system
• Test the system under faults
Shortcomings:
• Hard to identify complex bugs
• Fundamentally incomplete
Investment
Returns

Lineage-driven fault injection
Goal: top-down testing that
• finds all of the fault-tolerance bugs, or
• certifies that none exist

Correctness
Specification
Malevolent
sentience
Molly

Molly
Correctness
Specification
Malevolent
sentience

(LDFI)
Approach: think backwards from outcomes
Question: could a bad thing ever happen?
Reframe:
• Why did a good thing happen?
• What could have gone wrong along the way?

Thomasina: What a faint-heart! We must
work outward from the middle of the
maze. We will start with something simple.

The game
• Both players agree on a failure model
• The programmer provides a protocol
• The adversary observes executions and
chooses failures for the next execution.

Dedalus: it’s about data
log(B, “data”)@5
What
Where
When
Some data

Dedalus: it’s like Datalog
consequence ! :- premise[s]!
!
log(Node, Pload) ! ! ! :- bcast(Node, Pload);!
!

Dedalus: it’s like Datalog
consequence ! :- premise[s]!
!
!
(Which is like SQL)
create view log as
select Node, Pload from bcast;!

Dedalus: it’s about time
consequence@when ! :- premise[s]!
!!
node(Node, Neighbor)@next :- node(Node, Neighbor);!
!!
log(Node2, Pload)@async :- bcast(Node1, Pload),
! ! ! ! ! ! ! ! ! node(Node1, Node2);

Dedalus: it’s about time
consequence@when ! :- premise[s]!
!!
!!
! ! ! ! ! ! ! ! ! node(Node1, Node2);
State change
Natural join (bcast.Node1 == node.Node1)
Communication

The match
Protocol:
Reliable broadcast
Specification:
Pre: A correct process delivers a message m
Post: All correct process delivers m
Failure Model:
(Permanent) crash failures
Message loss / partitions

Round 1
log(Node, Pload)@next ! :- log(Node, Pload);!
!!
!
! ! ! ! ! ! ! ! ! node(Node1, Node2);
“An effort” delivery protocol

Round 1 in space / time
Process b Process a Process c
2
1
2
log log

Round 1: Lineage
log(B,
data)@5

Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(Node, Pload)@next :- log(Node, Pload);!
!!!
log(B, data)@5:- log(B, data)@4;!

Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3

Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3
log(B,data)@2

Round 1: Lineage
log(B,
data)@5
log(B,
data)@4
log(B,
data)@3
log(B,data)@2
log(Node2, Pload)@async :- bcast(Node1, Pload), !
! ! ! ! ! ! node(Node1, Node2);!
!!!!
log(B, data)@2 :- bcast(A, data)@1, !
! ! ! ! ! ! node(A, B)@1;!
log(A,
data)@1

An execution is a (fragile) “proof”
of an outcome
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(log(AB2 log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
log(log(log(log((which required a message from A to B at time 1)

Valentine: “The unpredictable and the
predetermined unfold together to make
everything the way it is.”

Round 1: counterexample
1
2
log (LOST) log
The adversary wins!

Round
2
Same
as
Round
1,
but
A
retries.
bcast(N, P)@next ! ! ! :- bcast(N, P);!

Round 2 in spacetime
2
3
4
5
1
2
3
4
2
3
4
5
log log
log log
log log
log log

Round 2
log(B,
data)@5
log(B,
data)@4
log(Node, Pload)@next :- log(Node, Pload);!
!!!
log(B, data)@5:- log(B, data)@4;!

Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!
!!!! log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!

Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3

Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2

Round 2
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2
log(A,
data)@1

Round 2
Retry provides redundancy in time
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(B,
data)@3
log(A,
data)@3
log(B,data)@2
log(A,
data)@2
log(A,
data)@1

Traces
are
forests
of
proof
trees
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4

Round
2:
counterexample
1
log (LOST) log
CRASHED 2
The adversary wins!

Round 3
Same
as
in
Round
2,
but
symmetrical.
bcast(N, P)@next ! ! ! :- log(N, P);!

Round 3 in space / time
2
3
4
5
1
log log
2
3
4
5
2
3
4
5
log log
log log
log log
log log
log log
log log
log log
log log
log log
Redundancy in
space and time

Round 3 -- lineage
log(B,
data)@5

Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4

Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
Log(B,
data)@3
log(A,
data)@3
log(C,
data)@3

Round 3 -- lineage
log(B,
data)@5
log(B,
data)@4
log(A,
data)@4
log(C,
data)@4
Log(B,
data)@3
log(A,
data)@3
log(C,
data)@3
log(B,data)@2
log(A,
data)@2
log(C,
data)@2
log(A,
data)@1

Let’s reflect
Fault-tolerance is redundancy in space and
time.
Best strategy for both players: reason
backwards from outcomes using lineage
Finding bugs: find a set of failures that
“breaks” all derivations
Fixing bugs: add additional derivations

The role of the adversary
can be automated
1. Break a proof by dropping any contributing
message.
(AB1 ∨ BC2)
Disjunction

The role of the adversary
can be automated
1. Break a proof by dropping any contributing
message.
2. Find a set of failures that breaks all proofs
of a good outcome.
(AB1 ∨ BC2)
Disjunction
∧ (AC1) ∧ (AC2)
Conjunction of disjunctions (AKA CNF)

Molly, the LDFI prototype
Molly finds fault-tolerance violations quickly
or guarantees that none exist.
Molly finds bugs by explaining good
outcomes – then it explains the bugs.
Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka
Certified correct: paxos (synod), Flux, bully
leader election, reliable broadcast

Commit protocols
Problem:
Atomically change things
Correctness properties:
1. Agreement (All or nothing)
2. Termination (Something)

Two-phase commit
Agent a Agent b Coordinator Agent d
2
5
2
5
1
prepare prepare prepare
3
4
2
5
vote vote
vote
commit commit commit

Two-phase commit
2
5
2
5
1
3
4
2
5
vote vote
vote
Can I kick it?

Two-phase commit
2
5
2
5
1
3
4
2
5
vote vote
vote
Can I kick it?
YES YOU CAN

Two-phase commit
2
5
2
5
1
3
4
2
5
vote vote
vote
Can I kick it?
YES YOU CAN
Well I’m gone

Two-phase commit
Agent a Agent a Coordinator Agent d
2 2
1
p p p
3
CRASHED
2
v v
v
Violation: Termination

The
collabora[ve
termina[on
protocol
Basic idea:
Agents talk amongst themselves when the
coordinator fails.
Protocol: On timeout, ask other agents
about decision.

2PC - CTP
2
3
4
5
6
7
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
decision_req decision_req
vote
vote

2PC - CTP
2
3
4
5
6
7
2
3
4
5
6
7
1
2
3
CRASHED
2
3
4
5
6
7
vote
vote
vote
Can I kick it?
YES YOU CAN
……?

3PC
Basic idea:
Add a round, a state, and simple failure
detectors (timeouts).
Protocol:
1. Phase 1: Just like in 2PC
– Agent timeout à abort
2. Phase 2: send canCommit, collect acks
– Agent timeout à commit
3. Phase 3: Just like phase 2 of 2PC

3PC
Process a Process b Process C Process d
2
4
7
2
4
7
1
cancommit cancommit cancommit
3
vote_msg
precommit precommit precommit
5
6
2
4
7
vote_msg
ack
vote_msg
ack
ack

3PC
2
4
7
2
4
7
1
3
vote_msg
5
6
2
4
7
vote_msg
ack
vote_msg
ack
ack
Timeout
à Abort
Timeout
à Commit

Network partitions
make 3pc act crazy
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
abort (LOST) abort (LOST)
abort abort
vote_msg

Network partitions
make 3pc act crazy
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
abort abort
vote_msg
Agent crash
Agents learn
commit decision

Network partitions
make 3pc act crazy
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
abort abort
vote_msg
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort

Network partitions
make 3pc act crazy
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
decides to abort

Network partitions
make 3pc act crazy
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
decides to abort
Agents A & B
decide to
commit

Kafka durability bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w

1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition

1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica

1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica
a ACKs
client write

1 1
2
1
3
4
CRASHED
1
3
5
m m
m
m l
a
c
w
Brief network
partition
a becomes
leader and
sole replica
a ACKs
client write
Data
loss

Molly summary
Lineage allows us to reason backwards
from good outcomes
Molly: surgically-targeted fault injection
Investment similar to testing
Returns similar to formal methods

Where we’ve been; where we’re headed
1. Mourning the death of transactions

1. We need application-level guarantees

2. (asynchrony X partial failure) = too hard to
hide! We need tools to manage it.

2. asynchrony X partial failure = too hard to hide!
We need tools to manage it.

3. Focus on flow: data in motion

Outline

Outline
4. Backwards from outcomes

Remember
2. asynchrony X partial failure = too hard to hide! We
need tools to manage it.
4. Backwards from outcomes
Composition is the hardest problem

A happy crisis
Valentine: “It makes me so happy. To be at
the beginning again, knowing almost
nothing.... It's the best possible time of
being alive, when almost everything you
thought you knew is wrong.”

RICON keynote: outwards from the middle of the maze

Recommandé

Recommandé

Contenu connexe

Similaire à RICON keynote: outwards from the middle of the maze

Similaire à RICON keynote: outwards from the middle of the maze (20)

Dernier

Dernier (20)

RICON keynote: outwards from the middle of the maze

Notes de l'éditeur