Resilient Functional Service Design

Resilient Functional Service Design
The usually forgotten parts of resilient software design

Uwe Friedrichsen – codecentric AG – 2015-2017

@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

What’s that “resilience” thing?

Business
Production
Availability

(Almost) every system is a distributed system

Chas Emerick

http://www.infoq.com/presentations/problems-distributed-systems

A distributed system is one in which the failure
of a computer you didn't even know existed
can render your own computer unusable.

Leslie Lamport

Failures in todays complex, distributed and
interconnected systems are not the exception.

•  They are the normal case

•  They are not predictable

… and it’s getting “worse”

•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web

à Ever-increasing complexity and connectivity

Do not try to avoid failures. Embrace them.

resilience (IT)

the ability of a system to handle unexpected situations
-  without the user noticing it (best case)
-  with a graceful degradation of service (worst case)

Beware of the “100% available” trap!

Designing for resilience
The pitfall

First, you learn about resilience …

Complement
Core
Detect
Prevent
Recover
Mitigate
Treat

Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
Queue for
resources
Bounded queue
Finish work
in progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Acknowledgement
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine 
checks
Health
check
Fail fast
Let sleeping 
dogs lie
Small 
releases
Hot 
deployments
Routine
maintenance
Backup 
request
Anti-fragility
Diversity
Jitter
Error
injection
Spread
the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit
breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Confinement

... then, you digest the stuff just learned

Confinement
Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
Queue for
resources
Bounded queue
Finish work
in progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Acknowledgement
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine 
checks
Health
check
Fail fast
Let sleeping 
dogs lie
Small 
releases
Hot 
deployments
Routine
maintenance
Backup 
request
Anti-fragility
Diversity
Jitter
Error
injection
Spread
the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit
breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Oh, my! 
Theoretical blah!
Uncool! 

Know that anyway
for eons. So, let’s
move on to the cool
parts …
Ah, now we’re talkin’! Here’s the cool stuff!

That‘s practical, applicable. Don‘t you have more code examples?
Or even better: Can‘t we turn that all into a live hacking session?
Offline
activities? 

Hmm, let‘s
focus on the
other stuff.
Uh, sounds like
one-off, tough
stuff … 

Better start with the
easier stuff, best
with library support
Yeah, more cool stuff!

Aren‘t there more libs like Hystrix
that we can drag into our projects
with a line of configuration?
Well, neat …

I’ll come back to
that stuff whenever
I really need it

Core
Detect
Recover
Mitigate
Treat
Prevent
Complement
Developer
priority
Relevance for
application robustness
Ye be warned!

If you don’t get this part
right, nothing else matters
Here be dragons!

This is extremely hard
and poorly understood

The core parts are

•  extremely important
•  poorly understood
•  massively underestimated

Let’s have a closer look at the core parts

Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Isolation

Isolation
•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid cascading failures
•  Foundation of resilient software design

Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Isolation
Bulkhead

Bulkheads are not about thread pools!

Bulkheads

•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
•  Diverse implementation choices available, e.g., µservice, actor, scs, ...
•  Implementation choice impacts system and resilience design a lot
•  Shaping good bulkheads is extremely hard (pure design issue)

Sounds easy. Where is the problem?

Service A
Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
How do we avoid this …

Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service

Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

... without ending up with this?

Let’s use the well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yourself)
•  Design for reusability
•  Layered architecture
•  …

Service A
Service B
Request

... this usually leads to this ...

Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

... and in the end also often to this.

Service A
Service B
Request

CacheofB
Break tight service coupling
by caching data/responses
of downstream service

Do you really think 
that copying stale data all over your system
is a suitable measure 
to fix an inherently broken design?

We have to re-learn design
for distributed systems!

A works-out-of-the-box-in-all-contexts,
just-add-water-and-stir,
three-bullet-point panacea
for designing perfect bulkheads

Then it is a lot of hard work …

... and there is no silver bullet

Yet, a few guiding thoughts 
about bulkhead design …

Foundations of design

•  High cohesion, low coupling
•  Separation of concerns
•  Crucial across process boundaries
•  Still poorly understood issue
•  Start with
•  Understanding organizational boundaries
•  Understanding use cases and flows
•  Identifying functional domains (à DDD)
•  Finding areas that change independently
•  Do not start with a data model!

Short activation paths

•  Long activation paths affect availability
•  Increase latency and likelihood of failures
•  Minimize remote calls per request
•  Need to balance opposing forces
•  Avoid monolith à clear separation of concerns
•  Minimize requests à cluster functionality & data
•  Caches sometimes help, but stale data as trade-off

Dismiss reusability

•  Reusability increases coupling
•  Reusability leads to bad service design
•  Reusability compromises availability
•  Reusability rarely pays
•  Do not strive for reuse
•  Strive for replaceability instead

Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Isolation
Communication
paradigm
Bulkhead

Communication paradigm

•  Request-response <-> messaging <-> events <-> …
•  Heavily influences resilience patterns to be used
•  Also heavily influences functional bulkhead design
•  Very fundamental decision which is often underestimated

µS
Request/Response : Horizontal slicing
Flow / Process
µS
µS
µS
µS
µS
µS
Event-driven : Vertical slicing
µS
µS
µS
µS
µS
Flow / Process

Synchronous R/R vs. asynchronous events

•  Decomposition
•  Vertically divide-and-conquer vs. horizontally go-with-the-flow
•  Coordination
•  Coordination logic/services and orchestration vs. event chains and choreography
•  Transactions
•  Built-in transaction handling vs. external supervision
•  Error handling
•  Built into service vs. escalation/supervision strategy
•  Separation of concerns
•  Multiple responsibilities service vs. single responsibility services
•  Encapsulation
•  Domain logic distributed across services vs. domain logic in one place
•  Reusability vs. Replaceability
•  Complexity
•  A draw …

The communication paradigm influences 
the functional service design a lot

and also the resilience patterns to be used

Example: order fulfillment

•  Simple order, credit card, non-digital items
•  Add coupons incl. validation
•  Add promotions incl. usage notification
•  Add bonus card incl. purchase notification
•  Customer accounts as payment type
•  PayPal as payment type
•  Integrate digital music library
•  Integrate digital video library
•  Integrate e-book library

Design exercise – Part 1

Create a bulkhead design for the case study

•  Use one communication paradigm
•  Synchronous request/response (e.g., REST)
•  Asynchronous messaging (e.g., Akka)
•  Asynchronous events (e.g., Pub/Sub)
•  Assume incremental requirements
•  How many services do you need to touch
•  What about the functional isolation of the services
•  How big/maintainable are the resulting services
•  Take a few notes

Online Shop Checkout
Credit Card
Provider
Warehouse
System
Coupon
Management
Campaign
Management
PayPal
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
Customer
pressed
“Buy now”
?

Order Fulfillment 
Service
Online Shop
Payment 
Service
Credit Card
Provider
Shipment 
Service
Warehouse
System
<Foreign Service>
<Own Service>
Coupon
Management
Promotion
Campaign
Management
Loyalty
Account
Service
Payment
Provider
PayPal
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
Coupon
Credit Card
Coordinate
Warehouse
Coordinate
Assets
Notify Cust.
PayPal
Coordinate

Order confirmed
Online Shop
Credit Card
Provider
Warehouse
System
<Foreign Service>
<Own Service>
Coupon
Management
Campaign
Management
Account
service
Credit Card
Service
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
PayPal
PayPal
Service
Warehouse
Service
Promotion
Service
Bonus Card
Service
Coupon
Service
Music Library
Service
Video Library
Service
E-Book Library
Service
Notification
Service
Payment authorized
Digital asset provisioned
Payment failed
<Event>
Order fulfillment
supervisor
Track flow of events
Reschedule events
in case of failure
Services are responsible to eventually succeed
or fail for good, usually incorporating a
supervision/escalation hierarchy for that

Do not limit your design options upfront
without an important reason

Wrap-up

•  Today’s systems are distributed
•  Failures are not avoidable, nor predictable
•  Resilient software design needed
•  Bulkhead design is
•  crucial for application robustness
•  poorly understood
•  massively underrated
•  different from traditional design best practices
•  Communication paradigms broaden 
your bulkhead design options

We have to re-learn design
for distributed systems

Resilient Functional Service Design

Resilient Functional Service Design

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Resilient Functional Service Design

Similaire à Resilient Functional Service Design (20)

Plus de Uwe Friedrichsen

Plus de Uwe Friedrichsen (9)

Dernier

Dernier (20)

Resilient Functional Service Design