Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems

Resiliency,
Failures vs Errors,
Isolation, Delegation
and Replication
in Reactive Systems
Jonas Bonér
CTO TypEsafe
@jboner

“But it ain’t how hard you’re hit;
it’s about how hard you can get
hit, and keep moving forward.
How much you can take, and
keep moving forward. That’s
how winning is done.”
- Rocky Balboa

“But it ain’t how hard you’re hit;
it’s about how hard you can get
hit, and keep moving forward.
How much you can take, and
keep moving forward. That’s
how winning is done.”
- Rocky Balboa
This is Fault Tolerance

Resilience
is Beyond
Fault Tolerance

Resilience
“The ability of a substance or
object to spring back into shape.
The capacity to recover quickly
from difficulties.”
-Merriam Webster

Without Resilience
Nothing Else Matters

“We can model and understand in isolation.  
But, when released into competitive nominally
regulated societies, their connections proliferate,  
their interactions and interdependencies multiply,  
their complexities mushroom.  
And we are caught short.”
- Sidney Dekker
Drift into Failure - Sidney Dekker

We Need to Study
Resilience in
Complex Systems

“Counterintuitive. That’s [Jay] Forrester’s word
to describe complex systems. Leverage points
are not intuitive. Or if they are, we intuitively
use them backward, systematically worsening
whatever problems we are trying to solve.”
- Donella Meadows
Leverage Points: Places to Intervene in a System - Donella Meadows

‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen
Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure

Economic
Failure
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Operating Point
Accident
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
FAILURE
Accident
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Towards Economic Efficiency

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Gradient Towards
Least Effort

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Management Pressure
Gradient Towards
Least Effort
Counter Gradient
For More Resilience

Economic
Failure
Boundary
Unacceptable
Workload
Boundary
Accident
Boundary
Error Margin
Marginal
Boundary

Accident
Boundary
Marginal
Boundary

Marginal
Boundary
?

Accident
Boundary
Marginal
Boundary

Simple Critical Infrastructure Map
Understanding vital services, and how they keep you safe
6ways to die
3sets of essential services
7layers of PROTECTION
Dealing in Security - Mike Bennet, Vinay Gupta

7 Principles for Building
Resilience in Social Systems
1. Maintain diversity & Redundancy
2. Manage connectivity
3. Manage slow variables & feedback
4. Foster complex adaptive systems thinking
5. Encourage learning
6. Broaden participation
7. Promote polycentric governance
Principles for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - Reinette Biggs et. al.

Resilience in
Biological Systems

MeerkatsPuppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

What We Can Learn
From Biological Systems
1. Feature Diversity and redundancy
2. Inter-Connected network structure
3. Wide distribution across all scales
4. Capacity to self-adapt & self-organize
Toward Resilient Architectures 1: Biology Lessons - Michael Mehaffy, Nikos A. Salingaros

“Animals show extraordinary social complexity,
and this allows them to adapt and  
respond to changes in their environment.
In three words, in the animal kingdom,
simplicity leads to complexity  
which leads to resilience.”
- Nicolas Perony
Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk

Resilience in
Computer Systems

“Complex systems run in degraded mode.”
“Complex systems run as broken systems.”
- richard Cook
How Complex Systems Fail - Richard Cook

Resilience
is by
Design
Photo courtesy of FEMA/Joselyne Augustino

“Post-accident attribution to a  
‘root cause’ is fundamentally wrong:  
Because overt failure requires multiple faults,
there is no isolated ‘cause’ of an accident.”
- richard Cook
How Complex Systems Fail - Richard Cook

Crash Only
Software
Crash-Only Software - George Candea, Armando Fox
Stop = Crash Safely
Start = Recover Fast

“To make a system of interconnected components
crash-only, it must be designed so that components
can tolerate the crashes and temporary unavailability
of their peers. This means we require: [1] strong
modularity with relatively impermeable component
boundaries, [2] timeout-based communication and
lease-based resource allocation, and [3] self-
describing requests that carry a time-to-live and
information on whether they are idempotent.”
- George Candea, Armando Fox
Crash-Only Software - George Candea, Armando Fox

Recursive Restartability
Turning the Crash-Only Sledgehammer into a Scalpel
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox

Services need to accept
NO for an answer

"Software components should be designed such
that they can deny service for any request or call.
Then, if an underlying component can say No,
apps must be designed to take No for an answer
and decide how to proceed: give up, wait and
retry, reduce fidelity, etc.”
- George Candea, Armando Fox
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox
Learn to take
NO for an answer

“The explosive growth of software has
added greatly to systems’ interactive
complexity. With software, the possible
states that a system can end up in become
mind-boggling.”
- Sidney Dekker

Classification
of State
• Static Data
• Scratch Data
• Dynamic Data
• Recomputable
• not recomputable

Classification
of State
• Static Data
• Scratch Data
• Dynamic Data
• Recomputable
• not recomputable
Critical

We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks

Essential
State
We Need a
Way Out of the
State Tar Pit

Essential
State
We Need a
Way Out of the
State Tar Pit
Essential
Logic

Essential
State
We Need a
Way Out of the
State Tar Pit
Essential
Logic
Accidental
State and
Control

Traditional
State Management
Object
Critical state
that needs protection
Client
Thread boundary

Traditional
State Management
Object
Critical state
Client
Thread boundary
Synchronous dispatch Thread boundary

Traditional
State Management
Object
Critical state
Client
Thread boundary
?

Traditional
State Management
Object
Critical state
Client
Thread boundary
?
Utterly broken

Requirements for a
Sane Failure Mode
1. Contained
2. Reified—as messages
3. Signalled—Asynchronously
4. Observed—by 1-N
5. Managed
Failures need to be

Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging {
def receive = {
case Greeting(who) => log.info(s”Hello ${who}")
}
}

Akka Actors
def receive = {
}
}
Define the message(s) the Actor
should be able to respond to

Akka Actors
def receive = {
}
}
Define the Actor class

Akka Actors
def receive = {
}
}
Define the Actor class
Define the Actor’s behavior

Akka Actors
def receive = {
}
}
val system = ActorSystem("MySystem")
val greeter = system.actorOf(Props[GreetingActor], "greeter")

Akka Actors
def receive = {
}
}
Create an Actor system

Akka Actors
def receive = {
}
}
Actor configuration

Akka Actors
def receive = {
}
}
Give it a name
Actor configuration

Akka Actors
def receive = {
}
}
Give it a nameCreate the Actor
Actor configuration

Akka Actors
def receive = {
}
}
Give it a nameCreate the ActorYou get an ActorRef back
Actor configuration

Akka Actors
def receive = {
}
}
greeter ! Greeting("Charlie Parker")

Akka Actors
def receive = {
}
}
greeter ! Greeting("Charlie Parker")
Send the message asynchronously

Think Vending Machine
Coffee
Machine
Programmer

Coffee
Machine
Programmer
Inserts coins

Coffee
Machine
Programmer
Inserts coins
Add more coins

Coffee
Machine
Programmer
Inserts coins
Gets coffee
Add more coins

Programmer
Coffee
Machine

Programmer
Inserts coins
Coffee
Machine

Programmer
Inserts coins
Out of coffee beans error Coffee
Machine

Programmer
Inserts coins
Out of coffee beans error
WRONG
Coffee
Machine

Programmer
Inserts coins
Out of
coffee beans
failure
Coffee
Machine

Programmer
Service
Guy
Inserts coins
Out of
coffee beans
failure
Coffee
Machine

Programmer
Service
Guy
Inserts coins
Out of
coffee beans
failure
Adds
more
beans
Coffee
Machine

Programmer
Service
Guy
Inserts coins
Gets coffee
Out of
coffee beans
failure
Adds
more
beans
Coffee
Machine

ServiceClient

ServiceClient
Request

ServiceClient
Request
Response

ServiceClient
Request
Response
Validation Error

ServiceClient
Request
Response
Validation Error
Application
Failure

ServiceClient
Supervisor
Request
Response
Validation Error
Application
Failure

ServiceClient
Supervisor
Request
Response
Validation Error
Application
Failure
Manages
Failure

“Accidents come from relationships
not broken parts.”
- Sidney dekker

Error Kernel
Pattern
Onion-layered state & Failure management
Making reliable distributed systems in the presence of software errors - Joe Armstrong
On Erlang, State and Crashes - Jesper Louis Andersen

Onion Layered
State Management
Object
Critical state
Client
Thread boundary

Onion Layered
State Management
Error Kernel
Object
Critical state
Client
Thread boundary

Onion Layered
State Management
Error Kernel
Object
Critical state
Client
Supervision
Thread boundary

Onion Layered
State Management
Error Kernel
Object
Critical state
Client
Supervision
Supervision
Thread boundary

Supervision in Akka
Every actor has a default supervisor strategy.
Which can, and often should, be overridden.
class Supervisor extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume
case _: NullPointerException => Restart
case _: Exception => Escalate
}
val worker = context.actorOf(Props[Worker], name = "worker")
def receive = {
case number: Int => worker.forward(number)
}
}

Supervision in Akka
}
def receive = {
}
}
Parent actor

Supervision in Akka
}
def receive = {
}
}
Parent actor
All its children have their life-cycle managed
through this declarative supervision strategy

Supervision in Akka
}
def receive = {
}
}
Create a supervised child actor
Parent actor
All its children have their life-cycle managed
through this declarative supervision strategy

Monitor through
Death Watch
class Watcher extends Actor {
val child = context.actorOf(Props.empty, "child")
context.watch(child)
def receive = {
case Terminated(`child`) => … // handle child termination
}
}

Monitor through
Death Watch
def receive = {
}
}
Create a child actor

Monitor through
Death Watch
def receive = {
}
}
Watch it

Monitor through
Death Watch
def receive = {
}
}
Watch it
Receive termination signal

Maintain Diversity
and Redundancy

Akka Routing
akka {
actor {
deployment {
/service/router {
router = round-robin-pool
resizer {
lower-bound = 12
upper-bound = 15
}
}
}
}
}

Akka Cluster
akka {
actor {
deployment {
/service/router {
router = round-robin-pool
resizer {
lower-bound = 12
upper-bound = 15
}
}
}
provider = "akka.cluster.ClusterActorRefProvider"
}

cluster {
seed-nodes = [
“akka.tcp://ClusterSystem@127.0.0.1:2551",
“akka.tcp://ClusterSystem@127.0.0.1:2552"
]
}
}

We are living in the
Looming
Shadowof
Impossibility
Theorems

Looming
Shadowof
Impossibility
Theorems
CAP: Consistency is impossible

Looming
Shadowof
Impossibility
Theorems
CAP: Consistency is impossible
FLP: Consensus is impossible

Strong
Consistency
Is the wrong default

We need
Decoupling IN
Time and Space

Resilient Protocols
Depend on
Asynchronous Communication
Eventual Consistency

Resilient Protocols
• are tolerant to
• Message loss
• Message reordering
• Message duplication
Depend on

Resilient Protocols
• are tolerant to
• Message loss
• Message reordering
• Message duplication
• Embrace ACID 2.0
• Associative
• Commutative
• Idempotent
• Distributed
Depend on

Akka Distributed Data
class DataBot extends Actor with ActorLogging {
val replicator = DistributedData(context.system).replicator
implicit val node = Cluster(context.system)
val tickTask = context.system.scheduler.schedule(
5.seconds, 5.seconds, self, Add)
val DataKey = ORSetKey[String]("key")
replicator ! Subscribe(DataKey, self)

Akka Distributed Data
class DataBot extends Actor with ActorLogging {
val replicator = DistributedData(context.system).replicator
implicit val node = Cluster(context.system)
val tickTask = context.system.scheduler.schedule(
5.seconds, 5.seconds, self, Add)
val DataKey = ORSetKey[String]("key")
replicator ! Subscribe(DataKey, self)
def receive = {
case Tick =>
val s = ThreadLocalRandom.current().nextInt(97, 123).toChar.toString
if (ThreadLocalRandom.current().nextBoolean())
replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ + s)
else
replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ - s)
case c @ Changed(DataKey) =>
val data = c.get(DataKey)
case _: UpdateResponse[_] => // ignore
}
}

“An escalator can never break: it can only
become stairs. You should never see an
Escalator Temporarily Out Of Order sign, just
Escalator Temporarily Stairs. Sorry for the
convenience.”
- Mitch Hedberg

Always Rely on
Asynchronous
Communication

Always Rely on
Asynchronous
Communication
…And If not pOssiblE
Always use
Timeouts

Circuit Breaker in Akka
val breaker =
new CircuitBreaker(
context.system.scheduler,
maxFailures = 5,
callTimeout = 10.seconds,
resetTimeout = 1.minute
).onOpen(…)
.onHalfOpen(…)
val result = breaker.withCircuitBreaker(Future(dangerousCall()))

Little’s Law
W: Response Time
L: Queue Length

Little’s Law
Queue Length = Arrival Rate * Response Time
W: Response Time
L: Queue Length

Little’s Law
Response Time = Queue Length / Arrival Rate
W: Response Time
L: Queue Length

Flow Control
Always Apply Back Pressure

Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out
bcast ~> f4 ~> merge

Akka Streams
val g = FlowGraph.closed() {
implicit builder: FlowGraph.Builder =>
import FlowGraph.Implicits._
val in = Source(1 to 10)
val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2))
val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}

Akka Streams
}
Set up the context

Akka Streams
}
Set up the context
Create the
Source and Sink

Akka Streams
}
Set up the context
Create the
Source and Sink
Create the fan out
and fan in stages

Akka Streams
}
Set up the context
Create the
Source and Sink
Create the fan out
and fan in stages
Create a set of
transformations

Akka Streams
}
Set up the context
Create the
Source and Sink
Create the fan out
and fan in stages
Create a set of
transformations
Define the graph
processing blueprint

What can we learn from Arnold?

What can we learn from Arnold?
Blow things up

Pull the Plug
…and see what happens

ReferencesAntifragile: Things That Gain from Disorder - http://www.amazon.com/Antifragile-Things-that-Gain-Disorder-ebook/dp/B009K6DKTS

Drift into Failure - http://www.amazon.com/Drift-into-Failure-Components-Understanding-ebook/dp/B009KOKXKY

How Complex Systems Fail - http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf

Leverage Points: Places to Intervene in a System - http://www.donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/
Going Solid: A Model of System Dynamics and Consequences for Patient Safety - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1743994/

Resilience in Complex Adaptive Systems: Operating at the Edge of Failure - https://www.youtube.com/watch?v=PGLYEDpNu60

Dealing in Security - http://resiliencemaps.org/files/Dealing_in_Security.July2010.en.pdf

Principles for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - http://www.amazon.com/Principles-
Building-Resilience-Sustaining-Social-Ecological/dp/110708265X

Puppies! Now that I’ve got your attention, Complexity Theory - https://www.ted.com/talks/
nicolas_perony_puppies_now_that_i_ve_got_your_attention_complexity_theory

How Bacteria Becomes Resistant - http://www.abc.net.au/science/slab/antibiotics/resistance.htm

Towards Resilient Architectures: Biology Lessons - http://www.metropolismag.com/Point-of-View/March-2013/Toward-Resilient-
Architectures-1-Biology-Lessons/

Crash-Only Software - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdf

Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - http://roc.cs.berkeley.edu/papers/recursive_restartability.pdf

Out of the Tar Pit - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928

Bulkhead Pattern - http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html

Making Reliable Distributed Systems in the Presence of Software Errors - http://www.erlang.org/download/armstrong_thesis_2003.pdf

On Erlang, State and Crashes - http://jlouisramblings.blogspot.be/2010/11/on-erlang-state-and-crashes.html

Akka Supervision - http://doc.akka.io/docs/akka/snapshot/general/supervision.html

Release It!: Design and Deploy Production-Ready Software - https://pragprog.com/book/mnee/release-it

Hystrix - https://github.com/Netflix/Hystrix

Akka Circuit Breaker - http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html

Reactive Streams - http://reactive-streams.org

Akka Streams - http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-introduction.html

RxJava - https://github.com/ReactiveX/RxJava

Feedback Control for Computer Systems - http://www.amazon.com/Feedback-Control-Computer-Systems-Philipp/dp/1449361692

Simian Army - https://github.com/Netflix/SimianArmy

Gatling - http://gatling.io

Akka MultiNode Testing - http://doc.akka.io/docs/akka/snapshot/dev/multi-node-testing.html

EXPERT TRAINING
Delivered On-site For Spark, Scala, Akka And Play
Help is just a click away. Get in touch with
Typesafe about our training courses.
• Intro Workshop to Apache Spark
• Fast Track & Advanced Scala
• Fast Track to Akka with Java or Scala
• Fast Track to Play with Java or Scala
• Advanced Akka with Java or Scala
Ask us about local trainings available by 24
Typesafe partners in 14 countries around
the world.
CONTACT US Learn more about on-site training

Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems

Recommandé

Recommandé

Contenu connexe

Similaire à Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems

Similaire à Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems (6)

Plus de Legacy Typesafe (now Lightbend)

Plus de Legacy Typesafe (now Lightbend) (16)

Dernier

Dernier (20)

Reactive Revealed Part 3 of 3: Resiliency, Failures vs Errors, Isolation, Delegation and Replication in Reactive Systems