SlideShare une entreprise Scribd logo
1  sur  60
Télécharger pour lire hors ligne
Lineage-driven
Fault Injection
Peter Alvaro Joshua Rosen Joseph M. Hellerstein
UC Berkeley
The future is disorder
•  Data-intensive systems are increasingly
distributed and heterogeneous
•  Distributed systems suffer partial failures
•  Fault-tolerant code is hard to get right
•  Composing FT components is hard too!
Motivation: Kafka replication bug
Three correct components:
1.  Primary/backup replication
2.  Timeout-based failure detectors
3.  Zookeeper
One nasty bug:
Acknowledged writes are lost
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
a ACKs
client write
‘Molly’ witnesses the bug
Replica b Replica c Zookeeper Replica a Client
1 1
2
1
3
4
CRASHED
1
3
5
m m
m l
m
a
c
w
Brief network
partition
a becomes
primary and
sole replica
a ACKs
client write
Data
loss
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Investment
Returns
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Fault-tolerance:
the state of the art
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Investment
Returns
Fault-tolerance:
the state of the art
1.  Bottom-up approaches
(e.g. verification)
2.  Top-down approaches
(e.g. fault injection)
Lineage-driven fault injection
Goal: whole-system testing that
•  finds all of the fault-tolerance bugs, or
•  certifies that none exist
Main idea: fault-tolerance is redundancy.
Lineage-driven fault injection
Approach: think backwards from outcomes
Use lineage to find evidence of redundancy
Original Question:
•  Could a bad thing ever happen?
Reframed question:
•  Why did a good thing happen?
•  What could have gone wrong?
A game
Protocol:
Reliable broadcast
Specification:
Pre: A correct process delivers a message m
Post: All correct process delivers m
Failure Model:
(Permanent) crash failures
Message loss / partitions
Program'
Output%
constraints%
Round 1
The broadcaster makes an attempt to
relay the message to the other nodes	
  	
  
“An effort” delivery protocol:
Round 1 in space / time
Process b Process a Process c
2
1
2
log log
Outcomes are data
log(B, “data”)@5	
  
What
Where
When
Some data
Round 1: Lineage
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(Node, Pload)@next :- log(Node, Pload);
log(B, data)@5:- log(B, data)@4;
Round 1: Lineage	
  
Round 1: Lineage	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
Round 1: Lineage	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
bcast(A,	
  data)@1	
  
	
  
log(Node2, Pload)@async :- bcast(Node1, Pload),
node(Node1, Node2);
log(B, data)@2 :- bcast(A, data)@1,
node(A, B)@1;
	
  
Round 1: Lineage	
  
An execution is a (fragile) “proof”
of an outcome
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
l
l
AB2
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
l
l
l
(which required a message from A to B at time 1)
Round 1: counterexample
The adversary wins!
Process b Process a Process c
1
2
log (LOST) log
Round 2
The broadcaster makes repeated attempts
to relay the message to the other nodes	
  	
  
“Sender retries” delivery protocol:
Round 2 in spacetime
Process b Process a Process c
2
3
4
5
1
2
3
4
2
3
4
5
log log
log log
log log
log log
Round 2: sender retries
log(B,	
  data)@5	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);
log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Round 2: sender retries	
  
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Retry provides redundancy in time
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
✖	
  
✖	
   ✖	
  
Traces	
  are	
  forests	
  of	
  proof	
  trees	
  
log(A, data)@1 node(A, B)@1
AB1 r2
log(B, data)@2
r1
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
node(A, B)@1
r3
node(A, B)@2
AB2 r2
log(B, data)@3
r1
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
AB3 r2
log(B, data)@4
r1
log(B, data)@5
log(A, data)@1
r1
log(A, data)@2
r1
log(A, data)@3
r1
log(A, data)@4
node(A, B)@1
r3
node(A, B)@2
r3
node(A, B)@3
r3
node(A, B)@4
AB4 r2
log(B, data)@5
AB1 ^ AB2 ^ AB3 ^ AB4
✖	
  
✖	
   ✖	
  
Round	
  2:	
  counterexample	
  
Process b Process a Process c
1
CRASHED 2
log (LOST) log
The adversary wins!
Round 1
All participants make repeated attempts to
relay the message to the other nodes	
  	
  
“Symmetric retry” delivery protocol:
Round 3 in space / time
Process b Process a Process c
2
3
4
5
1
2
3
4
5
2
3
4
5
log log
log log
log log
log log
log log
log log
log log
log log
log log
log log
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(C,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Round 3: symmetric retry
log(B,	
  data)@5	
  
	
  
log(B,	
  data)@4	
  
	
  
log(A,	
  data)@4	
  
	
  
log(C,	
  data)@4	
  
	
  
Log(B,	
  data)@3	
  
	
  
log(A,	
  data)@3	
  
	
  
log(C,	
  data)@3	
  
	
  
log(B,data)@2	
  
	
  
log(A,	
  data)@2	
  
	
  
log(C,	
  data)@2	
  
	
  
log(A,	
  data)@1	
  
	
  
Redundancy in space and time
Round 3: symmetric retry
The programmer wins!
Let’s reflect
Intuition:
Fault-tolerance is redundancy in space and time.
Strategy:
Reason backwards from outcomes using lineage
Lineage exposes redundancy of outcome support.
Finding bugs: choose failures that “break” all derivations
Fixing bugs: add additional derivations
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
(AB1 ∨ BC2)
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
2.  Find a set of failures that breaks all proofs
of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
Automating the role of the adversary
1.  Break a proof by dropping any
contributing message.
2.  Find a set of failures that breaks all proofs
of a good outcome.
Disjunction
Conjunction of disjunctions (AKA CNF)
(AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
By injecting only “interesting” faults…
Molly finds bugs quickly
By injecting only “interesting” faults…
Molly finds bugs quickly
By injecting only “interesting” faults…
Molly provides guarantees that
outcomes are fault-tolerant
Program	
   Bound	
   Combina/ons	
   Execu/ons	
  
redun-­‐deliv	
   11	
   8.07	
  X	
  1018	
   11	
  
ack-­‐deliv	
   8	
   3.08	
  X	
  1013	
   673	
  
paxos-­‐synod	
   7	
   4.81	
  X	
  1011	
   173	
  
bully-­‐leader	
   10	
   1.26	
  X	
  1017	
   2	
  
flux	
   22	
   6.20	
  X	
  1076	
   187	
  
Molly, the LDFI prototype
Molly finds fault-tolerance violations
quickly or guarantees that none exist.
Molly uses data lineage to reason about
redundancy of support (or lack thereof)
for system outcomes.
Case study: commit protocols
Agent a Agent a Coordinator Agent d
2 2
1
3
CRASHED
2
v v
p p p
v
2-Phase commit
Agent a Agent b Coordinator Agent d
2
3
4
5
6
2
3
4
5
6
1
2
3
CRASHED
2
3
4
5
6
vote
decision_req decision_req
vote
decision_req decision_req
prepare prepare prepare
vote
decision_req decision_req
Collaborative termination
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
3-Phase commit
3PC in an asynchronous network
Process a Process b Process C Process d
2
4
7
8
2
4
7
8
1
3
5
6
7
8
2
CRASHED
vote_msg
ack
commit
vote_msg
ack
commit
cancommit cancommit cancommit
precommit precommit precommit
abort (LOST) abort (LOST)
abort abort
vote_msg
Brief network
partition
Agent crash
Agents learn
commit decision
d is dead; coordinator
decides to abort
Agents A & B
decide to
commit

Contenu connexe

Tendances

Extending Python, what is the best option for me?
Extending Python, what is the best option for me?Extending Python, what is the best option for me?
Extending Python, what is the best option for me?Codemotion
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 
Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1Goran S. Milovanovic
 
semlavssws2015
semlavssws2015semlavssws2015
semlavssws2015hala Skaf
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2Wisdio
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQLOlaf Hartig
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2Ryan Rosario
 

Tendances (17)

Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Extending Python, what is the best option for me?
Extending Python, what is the best option for me?Extending Python, what is the best option for me?
Extending Python, what is the best option for me?
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1Introduction to R for Data Science :: Session 1
Introduction to R for Data Science :: Session 1
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
semlavssws2015
semlavssws2015semlavssws2015
semlavssws2015
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
Anomaly detection in dns traffic
Anomaly detection in dns trafficAnomaly detection in dns traffic
Anomaly detection in dns traffic
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Accessing r from python using r py2
Accessing r from python using r py2Accessing r from python using r py2
Accessing r from python using r py2
 
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
 
The Semantics of SPARQL
The Semantics of SPARQLThe Semantics of SPARQL
The Semantics of SPARQL
 
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Accessing R from Python using RPy2
Accessing R from Python using RPy2Accessing R from Python using RPy2
Accessing R from Python using RPy2
 

Similaire à Lineage-driven Fault Injection, SIGMOD'15

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...DataWorks Summit
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control HazardPipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazardbabuece
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...jon_bell
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Sangmin Park
 
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...Fwdays
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xTyler Treat
 
Deep learning for biotechnology presentation
Deep learning for biotechnology presentationDeep learning for biotechnology presentation
Deep learning for biotechnology presentationashuh3
 
1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTPJordi Llonch
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchTyler Treat
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
 
Design principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalencesDesign principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalencesMichael P.H. Stumpf
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 

Similaire à Lineage-driven Fault Injection, SIGMOD'15 (20)

Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
Losing Data in a Safe Way – Advanced Replication Strategies in Apache Hadoop ...
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control HazardPipeline hazards | Structural Hazard, Data Hazard & Control Hazard
Pipeline hazards | Structural Hazard, Data Hazard & Control Hazard
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
Chronicler: Lightweight Recording to Reproduce Field Failures (Presented at I...
 
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
 
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
Dmytro Okhonko "LogDevice: durable and highly available sequential distribute...
 
A Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description LogicsA Distributed Tableau Algorithm for Package-based Description Logics
A Distributed Tableau Algorithm for Package-based Description Logics
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
Building a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16xBuilding a Distributed Message Log from Scratch - SCaLE 16x
Building a Distributed Message Log from Scratch - SCaLE 16x
 
Deep learning for biotechnology presentation
Deep learning for biotechnology presentationDeep learning for biotechnology presentation
Deep learning for biotechnology presentation
 
1 hour dive into erlang
1  hour dive into erlang1  hour dive into erlang
1 hour dive into erlang
 
1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP1 hour dive into Erlang/OTP
1 hour dive into Erlang/OTP
 
Building a Distributed Message Log from Scratch
Building a Distributed Message Log from ScratchBuilding a Distributed Message Log from Scratch
Building a Distributed Message Log from Scratch
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
Design principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalencesDesign principles in pattern formation: Robustness and equivalences
Design principles in pattern formation: Robustness and equivalences
 
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdfΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
ΥΛΗ_ΕΠΑΛ_Γ_2223.pdf
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 

Dernier

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 

Dernier (20)

Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 

Lineage-driven Fault Injection, SIGMOD'15

  • 1. Lineage-driven Fault Injection Peter Alvaro Joshua Rosen Joseph M. Hellerstein UC Berkeley
  • 2. The future is disorder •  Data-intensive systems are increasingly distributed and heterogeneous •  Distributed systems suffer partial failures •  Fault-tolerant code is hard to get right •  Composing FT components is hard too!
  • 3. Motivation: Kafka replication bug Three correct components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper One nasty bug: Acknowledged writes are lost
  • 4. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w
  • 5. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition
  • 6. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica
  • 7. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica a ACKs client write
  • 8. ‘Molly’ witnesses the bug Replica b Replica c Zookeeper Replica a Client 1 1 2 1 3 4 CRASHED 1 3 5 m m m l m a c w Brief network partition a becomes primary and sole replica a ACKs client write Data loss
  • 9. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns Investment Returns
  • 10. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns Investment Returns
  • 11. 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Fault-tolerance: the state of the art Investment Returns
  • 12. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns
  • 13. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection) Investment Returns
  • 14. Fault-tolerance: the state of the art 1.  Bottom-up approaches (e.g. verification) 2.  Top-down approaches (e.g. fault injection)
  • 15. Lineage-driven fault injection Goal: whole-system testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist Main idea: fault-tolerance is redundancy.
  • 16. Lineage-driven fault injection Approach: think backwards from outcomes Use lineage to find evidence of redundancy Original Question: •  Could a bad thing ever happen? Reframed question: •  Why did a good thing happen? •  What could have gone wrong?
  • 17. A game Protocol: Reliable broadcast Specification: Pre: A correct process delivers a message m Post: All correct process delivers m Failure Model: (Permanent) crash failures Message loss / partitions Program' Output% constraints%
  • 18. Round 1 The broadcaster makes an attempt to relay the message to the other nodes     “An effort” delivery protocol:
  • 19. Round 1 in space / time Process b Process a Process c 2 1 2 log log
  • 20. Outcomes are data log(B, “data”)@5   What Where When Some data
  • 21. Round 1: Lineage log(B,  data)@5    
  • 22. log(B,  data)@5     log(B,  data)@4     log(Node, Pload)@next :- log(Node, Pload); log(B, data)@5:- log(B, data)@4; Round 1: Lineage  
  • 23. Round 1: Lineage   log(B,  data)@5     log(B,  data)@4     log(B,  data)@3    
  • 24. Round 1: Lineage   log(B,  data)@5     log(B,  data)@4     log(B,  data)@3     log(B,data)@2    
  • 25. log(B,  data)@5     log(B,  data)@4     log(B,  data)@3     log(B,data)@2     bcast(A,  data)@1     log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2); log(B, data)@2 :- bcast(A, data)@1, node(A, B)@1;   Round 1: Lineage  
  • 26. An execution is a (fragile) “proof” of an outcome log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 l l AB2 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 l l l (which required a message from A to B at time 1)
  • 27. Round 1: counterexample The adversary wins! Process b Process a Process c 1 2 log (LOST) log
  • 28. Round 2 The broadcaster makes repeated attempts to relay the message to the other nodes     “Sender retries” delivery protocol:
  • 29. Round 2 in spacetime Process b Process a Process c 2 3 4 5 1 2 3 4 2 3 4 5 log log log log log log log log
  • 30. Round 2: sender retries log(B,  data)@5    
  • 31. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4    
  • 32. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2); log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;  
  • 33. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3    
  • 34. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2    
  • 35. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2     log(A,  data)@1    
  • 36. Round 2: sender retries   log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(B,  data)@3     log(A,  data)@3     log(B,data)@2     log(A,  data)@2     log(A,  data)@1     Retry provides redundancy in time
  • 37. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 38. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4
  • 39. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4 ✖   ✖   ✖  
  • 40. Traces  are  forests  of  proof  trees   log(A, data)@1 node(A, B)@1 AB1 r2 log(B, data)@2 r1 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 node(A, B)@1 r3 node(A, B)@2 AB2 r2 log(B, data)@3 r1 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 AB3 r2 log(B, data)@4 r1 log(B, data)@5 log(A, data)@1 r1 log(A, data)@2 r1 log(A, data)@3 r1 log(A, data)@4 node(A, B)@1 r3 node(A, B)@2 r3 node(A, B)@3 r3 node(A, B)@4 AB4 r2 log(B, data)@5 AB1 ^ AB2 ^ AB3 ^ AB4 ✖   ✖   ✖  
  • 41. Round  2:  counterexample   Process b Process a Process c 1 CRASHED 2 log (LOST) log The adversary wins!
  • 42. Round 1 All participants make repeated attempts to relay the message to the other nodes     “Symmetric retry” delivery protocol:
  • 43. Round 3 in space / time Process b Process a Process c 2 3 4 5 1 2 3 4 5 2 3 4 5 log log log log log log log log log log log log log log log log log log log log
  • 44. Round 3: symmetric retry log(B,  data)@5    
  • 45. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4    
  • 46. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3    
  • 47. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3     log(B,data)@2     log(A,  data)@2     log(C,  data)@2     log(A,  data)@1    
  • 48. Round 3: symmetric retry log(B,  data)@5     log(B,  data)@4     log(A,  data)@4     log(C,  data)@4     Log(B,  data)@3     log(A,  data)@3     log(C,  data)@3     log(B,data)@2     log(A,  data)@2     log(C,  data)@2     log(A,  data)@1     Redundancy in space and time
  • 49. Round 3: symmetric retry The programmer wins!
  • 50. Let’s reflect Intuition: Fault-tolerance is redundancy in space and time. Strategy: Reason backwards from outcomes using lineage Lineage exposes redundancy of outcome support. Finding bugs: choose failures that “break” all derivations Fixing bugs: add additional derivations
  • 51. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. (AB1 ∨ BC2)
  • 52. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. 2.  Find a set of failures that breaks all proofs of a good outcome. Disjunction Conjunction of disjunctions (AKA CNF) (AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
  • 53. Automating the role of the adversary 1.  Break a proof by dropping any contributing message. 2.  Find a set of failures that breaks all proofs of a good outcome. Disjunction Conjunction of disjunctions (AKA CNF) (AB1 ∨ BC2) ∧ (AC1) ∧ (AC2)
  • 54. By injecting only “interesting” faults… Molly finds bugs quickly
  • 55. By injecting only “interesting” faults… Molly finds bugs quickly
  • 56. By injecting only “interesting” faults… Molly provides guarantees that outcomes are fault-tolerant Program   Bound   Combina/ons   Execu/ons   redun-­‐deliv   11   8.07  X  1018   11   ack-­‐deliv   8   3.08  X  1013   673   paxos-­‐synod   7   4.81  X  1011   173   bully-­‐leader   10   1.26  X  1017   2   flux   22   6.20  X  1076   187  
  • 57. Molly, the LDFI prototype Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly uses data lineage to reason about redundancy of support (or lack thereof) for system outcomes.
  • 58.
  • 59. Case study: commit protocols Agent a Agent a Coordinator Agent d 2 2 1 3 CRASHED 2 v v p p p v 2-Phase commit Agent a Agent b Coordinator Agent d 2 3 4 5 6 2 3 4 5 6 1 2 3 CRASHED 2 3 4 5 6 vote decision_req decision_req vote decision_req decision_req prepare prepare prepare vote decision_req decision_req Collaborative termination Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg 3-Phase commit
  • 60. 3PC in an asynchronous network Process a Process b Process C Process d 2 4 7 8 2 4 7 8 1 3 5 6 7 8 2 CRASHED vote_msg ack commit vote_msg ack commit cancommit cancommit cancommit precommit precommit precommit abort (LOST) abort (LOST) abort abort vote_msg Brief network partition Agent crash Agents learn commit decision d is dead; coordinator decides to abort Agents A & B decide to commit