SlideShare une entreprise Scribd logo
1  sur  87
Télécharger pour lire hors ligne
The 7 quests of resilient software design
A guide for the adventurous software engineer

Uwe Friedrichsen – codecentric AG – 2012-2017
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried
You want to do resilient software design ...
... and you expect everything to be like this
But somehow it feels more like that ...
... or even that
What the **** went wrong?
The road to resilience is a twisted one
“7 quests you must complete!”
Quest #1
Understand the business case
“How much money will we earn with it?”
“Does it improve our velocity?”
Resilience is not about making money
Resilience is about not losing money
Lack of resilient software design
Reduced system availability
Users cannot do what they intend to do
Less transactions per time period
Immediate lost revenue
Users get annoyed
Churn rate increases
Delayed lost revenue
Due to non-determinism
of distributed systems
This is at most your
resilience budget
Quest #2
Embrace distributed systems
Everything fails, all the time.
-- Werner Vogels
If X then Y
What we learned in our IT education
If X then maybe Y



This changes
everything!
What we need for distributed systems












We are good at this (due to how our brains work)
Inside process thinking
Reasoning about
deterministic behavior
Designing a complicated system












We are poor at that (due to how our brains work)
Reasoning about
non-deterministic behavior
Across process thinking
Designing a complex system
Yet, we usually use deterministic thinking
to reason about distributed systems
Failures in distributed systems ...

•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzantine failure
... turn seemingly simple issues into very hard ones
Time & Ordering

Leslie Lamport

"Time, clocks, and the
ordering of events in
distributed systems"
Consensus

Leslie Lamport

”The part-time
parliament”
(Paxos)
CAP

Eric A. Brewer

"Towards robust
distributed systems"
Faulty processes

Leslie Lamport,
Robert Shostak,
Marshall Pease

"The Byzantine
generals problem"
Consensus

Michael J. Fischer,
Nancy A. Lynch,
Michael S. Paterson

"Impossibility of
distributed consensus
with one faulty
process” (FLP)
Impossibility

Nancy A. Lynch

”A hundred
impossibility proofs
for distributed
computing"
Embrace distributed systems

•  Distributed systems introduce non-determinism regarding
•  Execution completeness
•  Message ordering
•  Communication timing
•  You will be affected by this at the application level
•  Don’t expect your infrastructure to hide all effects from you
•  Better have a plan to detect and recover from inconsistencies
But do I really need to care?
(The system, I am working on, is not a distributed system)
(Almost) every system is a distributed system
-- Chas Emerick


http://www.infoq.com/presentations/problems-distributed-systems
… and it’s getting “worse”


•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web
Quest #3
Avoid the “100% available” trap
The “100% available” trap, version #1

You: “How should the application respond if a technical failure occurs?”

Business owner: “This must not happen! It is your responsibility to make

 
sure that this will not happen.”
The “100% available” trap, version #2

You: “How do you handle the situation if the service you call does not

 
respond (or does not respond timely)?”

Developer 1: “We did not implement any extra measures. The other service

 
is so important and thus needs to be so highly available that it is

 
not worth any extra effort.”

Developer 2: “Actually, if that service should be down, we would not be able

 
to do anything useful anyway. Thus, it just needs to be up.”
The question is not, if a failure will happen
The question is, when a failure will happen
A short note about availability

Assume a service availability of 99,5% (incl. planned downtimes)

•  10 services involved in a request à 95,1% probability of success
•  50 services involved in a request à 77,8% probability of success
Quest #4
Establish the ops-dev feedback loop
The big wall between Dev and Ops
In a distributed environment, you cannot solve
availability issues on an infrastructure level only
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
All developer activities towards
improving robustness are basically
“shooting at the dark” which is neither
effective not sustainable
Having a wall between Dev and Ops
breaks the cycle required to implement
effective robustness measures
For effective resilient software design
you need a working ops-dev feedback loop
Establishing the feedback loop


•  Adopt DevOps
•  Adopt Site Reliability Engineering (SRE)
•  Or do it your own way if you know a better way ...
•  ... but make sure you establish the required feedback loops!
Quest #5
Master functional design
Without proper functional design
nothing else matters
Isolation

•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid cascading failures
•  Foundation of resilient software design
Bulkhead

•  Bulkheads implement the “parts” that need to be isolated
•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
•  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ...
•  Shaping good bulkheads is a pure functional design issue (and extremely hard)
Hmm, sound easy. Why should that be hard?
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
How do we avoid this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... without ending up with this?
Let us apply our well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yourself)
•  Design for reusability
•  Layered architecture
•  …
Unfortunately ...
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
... this usually leads to this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... and in the end also often to this.
Welcome to distributed hell!
Caches to the rescue!
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
CacheofB
Break tight service coupling
by caching data/responses
of downstream service
Caches to the rescue?
Do you really think
that copying stale data all over your system
is a suitable measure
to fix an inherently broken design? *

* Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
We have to re-learn design
for distributed system
No silver bullet
Yet, a few guiding thoughts ...
Foundations of design

•  “High cohesion, low coupling” & “separation of concerns”
•  “Crucial across process boundaries
•  Still poorly understood issue
•  Start with
•  Understanding organizational boundaries
•  Understanding use cases and flows
•  Identifying functional domains (à DDD)
•  Finding areas that change independently
•  Do not start with a data model!
Short activation paths

•  Long activation paths affect availability
•  Increase likelihood of failures
•  Minimize remote calls per request
•  Need to balance opposing forces
•  Avoid monolith à clear separation of concerns
•  Minimize requests à cluster functionality & data
•  Caches can sometimes help, but stale data as trade-off
Be (extremely) wary of reusability

•  Reusability increases coupling
•  Reusability usually leads to bad service design
•  Reusability compromises availability
•  Reusability rarely pays
•  Do not strive for reusable services
•  Strive for replaceable services instead
•  Try to tackle reusability issues with libraries
Quest #6
Know your toolbox
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
 Queue for
resources
Bounded queue
Finish work in
progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine
checks
Health
check
Fail fast
Let sleeping dogs lie
Small releases
Hot deployments
Routine maintenance
Backup request
Anti-fragility
Diversity
 Jitter
Error
injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
 Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Confinement
Acknowledgement
Using resilience patterns

•  Patterns are options, not obligations
•  Don’t pick too many patterns
•  Each pattern increases complexity
•  Complexity is the enemy of robustness
•  Each pattern costs money in dev & ops
•  You only have a limited resilience budget
•  Look for complementary patterns
How other people did it
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Escalation
Communication
paradigm
Isolation
System level
Monitor
Heartbeat
Either level
Hot deployments
Restart
 Let it crash!
Node level
Actor
Messaging
Erlang (Akka)

Core patterns
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Fallback
Share load
Bounded
queue
Communication
paradigm
Isolation
System level
Monitor
Either level
Error
injection
Retry
Limit retries
Node level
Circuit breaker
Timeout
Zero downtime
deployment
Canary releases
Redundancy
Several variants
(Micro)service
Request/
response
Netflix

Core patterns
Quest #7
Preserve the collective memory
We face a new generation of developers
every 5 years
We loose our collective memory
every 5 years *


* Mean time until a topic discussion in the community starts over form scratch
Time working in IT
Growth of
knowledge
Depth of
insights
What do we do to compensate this effect?
We look for the new & shiny stuff ...
... as anything not new must be useless crap!
We need to rediscover our insights
every 5 years
In IT, we suffer from
continuous collective amnesia
and we are even proud of it
How can we become better?
Wrap-up
The 7 quests at a glance
Wrap-up

•  The road to resilient software design is a twisted one!
•  Most challenges are only indirectly related to RSD
•  Most challenges are not coding related
•  Mastering functional design is extremely hard ...
•  ... while learning the patterns is relatively easy
•  How do we preserve our collective memory?
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried

Contenu connexe

Tendances

Cloud native application 입문
Cloud native application 입문Cloud native application 입문
Cloud native application 입문Seong-Bok Lee
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
MicroCPH - Managing data consistency in a microservice architecture using Sagas
MicroCPH - Managing data consistency in a microservice architecture using SagasMicroCPH - Managing data consistency in a microservice architecture using Sagas
MicroCPH - Managing data consistency in a microservice architecture using SagasChris Richardson
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Productionconfluent
 
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...Amazon Web Services Korea
 
Presentation citrix cloud platform for infrastructure as a service
Presentation   citrix cloud platform for infrastructure as a servicePresentation   citrix cloud platform for infrastructure as a service
Presentation citrix cloud platform for infrastructure as a servicexKinAnx
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRed Hat Developers
 
Messaging in CQRS with MassTransit
Messaging in CQRS with MassTransitMessaging in CQRS with MassTransit
Messaging in CQRS with MassTransitGeorge Tourkas
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaAraf Karsh Hamid
 
Microservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration PatternsMicroservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration PatternsAraf Karsh Hamid
 
Open infradays 2019_msa_k8s
Open infradays 2019_msa_k8sOpen infradays 2019_msa_k8s
Open infradays 2019_msa_k8sHyoungjun Kim
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10Kenny Gryp
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid
 
Dockers and containers basics
Dockers and containers basicsDockers and containers basics
Dockers and containers basicsSourabh Saxena
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesRadosław Maziarka
 

Tendances (20)

Cloud native application 입문
Cloud native application 입문Cloud native application 입문
Cloud native application 입문
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
MicroCPH - Managing data consistency in a microservice architecture using Sagas
MicroCPH - Managing data consistency in a microservice architecture using SagasMicroCPH - Managing data consistency in a microservice architecture using Sagas
MicroCPH - Managing data consistency in a microservice architecture using Sagas
 
Deploying Confluent Platform for Production
Deploying Confluent Platform for ProductionDeploying Confluent Platform for Production
Deploying Confluent Platform for Production
 
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...
Airbnb가 직접 들려주는 Kubernetes 환경 구축 이야기 - Melanie Cebula 소프트웨어 엔지니어, Airbnb :: A...
 
Presentation citrix cloud platform for infrastructure as a service
Presentation   citrix cloud platform for infrastructure as a servicePresentation   citrix cloud platform for infrastructure as a service
Presentation citrix cloud platform for infrastructure as a service
 
Resilience4J
Resilience4J Resilience4J
Resilience4J
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Messaging in CQRS with MassTransit
Messaging in CQRS with MassTransitMessaging in CQRS with MassTransit
Messaging in CQRS with MassTransit
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
 
Quarkus k8s
Quarkus   k8sQuarkus   k8s
Quarkus k8s
 
Microservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration PatternsMicroservices Architecture, Monolith Migration Patterns
Microservices Architecture, Monolith Migration Patterns
 
Open infradays 2019_msa_k8s
Open infradays 2019_msa_k8sOpen infradays 2019_msa_k8s
Open infradays 2019_msa_k8s
 
MySQL Database Architectures - 2020-10
MySQL Database Architectures -  2020-10MySQL Database Architectures -  2020-10
MySQL Database Architectures - 2020-10
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
Nutanix basic
Nutanix basicNutanix basic
Nutanix basic
 
Dockers and containers basics
Dockers and containers basicsDockers and containers basics
Dockers and containers basics
 
Domain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and MicroservicesDomain Driven Design - Strategic Patterns and Microservices
Domain Driven Design - Strategic Patterns and Microservices
 

Similaire à The 7 quests of resilient software design

Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service DesignUwe Friedrichsen
 
Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native worldUwe Friedrichsen
 
DevOps and Microservice
DevOps and MicroserviceDevOps and Microservice
DevOps and MicroserviceInho Kang
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskUwe Friedrichsen
 
Microsoft Microservices
Microsoft MicroservicesMicrosoft Microservices
Microsoft MicroservicesChase Aucoin
 
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Amazon Web Services
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native FutureJulie Coonce
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"Daniel Bryant
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleJAXLondon_Conference
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAmazon Web Services
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!Andrew Shafer
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!VMware Tanzu
 
Software Architecture for Agile Development
Software Architecture for Agile DevelopmentSoftware Architecture for Agile Development
Software Architecture for Agile DevelopmentHayim Makabee
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...distributed matters
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAmazon Web Services Korea
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!Bert Ertman
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayRandy Shoup
 

Similaire à The 7 quests of resilient software design (20)

Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service Design
 
Timeless design in a cloud-native world
Timeless design in a cloud-native worldTimeless design in a cloud-native world
Timeless design in a cloud-native world
 
DevOps and Microservice
DevOps and MicroserviceDevOps and Microservice
DevOps and Microservice
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
Microsoft Microservices
Microsoft MicroservicesMicrosoft Microservices
Microsoft Microservices
 
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
Smaller is Better - Exploiting Microservice Architectures on AWS - Technical 201
 
Cloud Native Future
Cloud Native FutureCloud Native Future
Cloud Native Future
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve PooleDevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWS
 
devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!devops, microservices, and platforms, oh my!
devops, microservices, and platforms, oh my!
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
 
Software Architecture for Agile Development
Software Architecture for Agile DevelopmentSoftware Architecture for Agile Development
Software Architecture for Agile Development
 
The Road to SaaS
The Road to SaaSThe Road to SaaS
The Road to SaaS
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
 
No silver bullet
No silver bulletNo silver bullet
No silver bullet
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!
 
Evolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBayEvolving Architecture and Organization - Lessons from Google and eBay
Evolving Architecture and Organization - Lessons from Google and eBay
 

Plus de Uwe Friedrichsen

The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerUwe Friedrichsen
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of softwareUwe Friedrichsen
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explainedUwe Friedrichsen
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsUwe Friedrichsen
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"Uwe Friedrichsen
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservicesUwe Friedrichsen
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextUwe Friedrichsen
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architecturesUwe Friedrichsen
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITUwe Friedrichsen
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITUwe Friedrichsen
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) ITUwe Friedrichsen
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesUwe Friedrichsen
 

Plus de Uwe Friedrichsen (20)

Deep learning - a primer
Deep learning - a primerDeep learning - a primer
Deep learning - a primer
 
Life after microservices
Life after microservicesLife after microservices
Life after microservices
 
The hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developerThe hitchhiker's guide for the confused developer
The hitchhiker's guide for the confused developer
 
Digitization solutions - A new breed of software
Digitization solutions - A new breed of softwareDigitization solutions - A new breed of software
Digitization solutions - A new breed of software
 
Real-world consistency explained
Real-world consistency explainedReal-world consistency explained
Real-world consistency explained
 
Excavating the knowledge of our ancestors
Excavating the knowledge of our ancestorsExcavating the knowledge of our ancestors
Excavating the knowledge of our ancestors
 
The truth about "You build it, you run it!"
The truth about "You build it, you run it!"The truth about "You build it, you run it!"
The truth about "You build it, you run it!"
 
The promises and perils of microservices
The promises and perils of microservicesThe promises and perils of microservices
The promises and perils of microservices
 
Watch your communication
Watch your communicationWatch your communication
Watch your communication
 
Life, IT and everything
Life, IT and everythingLife, IT and everything
Life, IT and everything
 
DevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader contextDevOps is not enough - Embedding DevOps in a broader context
DevOps is not enough - Embedding DevOps in a broader context
 
Production-ready Software
Production-ready SoftwareProduction-ready Software
Production-ready Software
 
Towards complex adaptive architectures
Towards complex adaptive architecturesTowards complex adaptive architectures
Towards complex adaptive architectures
 
Conway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective ITConway's law revisited - Architectures for an effective IT
Conway's law revisited - Architectures for an effective IT
 
Modern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of ITModern times - architectures for a Next Generation of IT
Modern times - architectures for a Next Generation of IT
 
The Next Generation (of) IT
The Next Generation (of) ITThe Next Generation (of) IT
The Next Generation (of) IT
 
Why resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudesWhy resilience - A primer at varying flight altitudes
Why resilience - A primer at varying flight altitudes
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Resilience with Hystrix
Resilience with HystrixResilience with Hystrix
Resilience with Hystrix
 
Self healing data
Self healing dataSelf healing data
Self healing data
 

Dernier

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

The 7 quests of resilient software design

  • 1. The 7 quests of resilient software design A guide for the adventurous software engineer Uwe Friedrichsen – codecentric AG – 2012-2017
  • 2. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried
  • 3. You want to do resilient software design ...
  • 4. ... and you expect everything to be like this
  • 5. But somehow it feels more like that ...
  • 6. ... or even that
  • 7. What the **** went wrong?
  • 8. The road to resilience is a twisted one
  • 9. “7 quests you must complete!”
  • 12. “How much money will we earn with it?”
  • 13. “Does it improve our velocity?”
  • 14. Resilience is not about making money Resilience is about not losing money
  • 15. Lack of resilient software design Reduced system availability Users cannot do what they intend to do Less transactions per time period Immediate lost revenue Users get annoyed Churn rate increases Delayed lost revenue Due to non-determinism of distributed systems This is at most your resilience budget
  • 18. Everything fails, all the time. -- Werner Vogels
  • 19. If X then Y What we learned in our IT education If X then maybe Y This changes everything! What we need for distributed systems We are good at this (due to how our brains work) Inside process thinking Reasoning about deterministic behavior Designing a complicated system We are poor at that (due to how our brains work) Reasoning about non-deterministic behavior Across process thinking Designing a complex system
  • 20. Yet, we usually use deterministic thinking to reason about distributed systems
  • 21. Failures in distributed systems ... •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • 22. ... turn seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  • 23. Embrace distributed systems •  Distributed systems introduce non-determinism regarding •  Execution completeness •  Message ordering •  Communication timing •  You will be affected by this at the application level •  Don’t expect your infrastructure to hide all effects from you •  Better have a plan to detect and recover from inconsistencies
  • 24. But do I really need to care? (The system, I am working on, is not a distributed system)
  • 25. (Almost) every system is a distributed system -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  • 26. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web
  • 28. Avoid the “100% available” trap
  • 29. The “100% available” trap, version #1 You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure that this will not happen.”
  • 30. The “100% available” trap, version #2 You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  • 31. The question is not, if a failure will happen The question is, when a failure will happen
  • 32. A short note about availability Assume a service availability of 99,5% (incl. planned downtimes) •  10 services involved in a request à 95,1% probability of success •  50 services involved in a request à 77,8% probability of success
  • 34. Establish the ops-dev feedback loop
  • 35. The big wall between Dev and Ops
  • 36. In a distributed environment, you cannot solve availability issues on an infrastructure level only
  • 37. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect
  • 38. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect All developer activities towards improving robustness are basically “shooting at the dark” which is neither effective not sustainable Having a wall between Dev and Ops breaks the cycle required to implement effective robustness measures
  • 39. For effective resilient software design you need a working ops-dev feedback loop
  • 40. Establishing the feedback loop •  Adopt DevOps •  Adopt Site Reliability Engineering (SRE) •  Or do it your own way if you know a better way ... •  ... but make sure you establish the required feedback loops!
  • 43. Without proper functional design nothing else matters
  • 44. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  • 45. Bulkhead •  Bulkheads implement the “parts” that need to be isolated •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ... •  Shaping good bulkheads is a pure functional design issue (and extremely hard)
  • 46. Hmm, sound easy. Why should that be hard?
  • 47. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  • 48. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  • 50. Let us apply our well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  • 52. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this …
  • 53. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 54. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  • 56. Caches to the rescue!
  • 57. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  • 58. Caches to the rescue?
  • 59. Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
  • 60. We have to re-learn design for distributed system
  • 62. Yet, a few guiding thoughts ...
  • 63. Foundations of design •  “High cohesion, low coupling” & “separation of concerns” •  “Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  • 64. Short activation paths •  Long activation paths affect availability •  Increase likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches can sometimes help, but stale data as trade-off
  • 65. Be (extremely) wary of reusability •  Reusability increases coupling •  Reusability usually leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reusable services •  Strive for replaceable services instead •  Try to tackle reusability issues with libraries
  • 68. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  • 69. Using resilience patterns •  Patterns are options, not obligations •  Don’t pick too many patterns •  Each pattern increases complexity •  Complexity is the enemy of robustness •  Each pattern costs money in dev & ops •  You only have a limited resilience budget •  Look for complementary patterns
  • 71. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Escalation Communication paradigm Isolation System level Monitor Heartbeat Either level Hot deployments Restart Let it crash! Node level Actor Messaging Erlang (Akka) Core patterns
  • 72. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Fallback Share load Bounded queue Communication paradigm Isolation System level Monitor Either level Error injection Retry Limit retries Node level Circuit breaker Timeout Zero downtime deployment Canary releases Redundancy Several variants (Micro)service Request/ response Netflix Core patterns
  • 75. We face a new generation of developers every 5 years
  • 76. We loose our collective memory every 5 years * * Mean time until a topic discussion in the community starts over form scratch
  • 77. Time working in IT Growth of knowledge Depth of insights
  • 78. What do we do to compensate this effect?
  • 79. We look for the new & shiny stuff ...
  • 80. ... as anything not new must be useless crap!
  • 81. We need to rediscover our insights every 5 years
  • 82. In IT, we suffer from continuous collective amnesia and we are even proud of it
  • 83. How can we become better?
  • 85. The 7 quests at a glance
  • 86. Wrap-up •  The road to resilient software design is a twisted one! •  Most challenges are only indirectly related to RSD •  Most challenges are not coding related •  Mastering functional design is extremely hard ... •  ... while learning the patterns is relatively easy •  How do we preserve our collective memory?
  • 87. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried