This slide deck addresses the importance of proper functional design for creating resilient distributed systems (not only, but also microservice-based systems).
It starts by explaining the pitfall that many developers fall into when getting started with resilience: Quite often the effects of the fundamentals, i.e., creating bulkheads and choosing the communication paradigm, on system robustness at runtime are heavily underrated. Instead, the pure technical measures like circuit breakers, backpressure, etc. are often overestimated.
Unfortunately, all technical measures will not help to create a robust system if the functional design leads to highly coupled services where the availability of one service functionally totally depends on the availability of another service. The same is true if you need to call many services that all need to be available to answer a client request.
To make it worse, most of the wide-spread design approaches like functional decomposition, DRY (don't repeat yourself), design for re-use or layered architecture exactly lead to those problems, i.e., they are not suitable for designing distributed systems.
This slide deck does not offer any silver bullets to solve the problem (and actually I believe there is no silver bullet), but at least a few guiding principles. Additionally, it shows how the choice of the communication paradigm influences the bulkhead design and this way creates more options to create a good service design that also supports resiliency on a functional level.
As always this slide deck is without the voice track, i.e., most of the information is missing. But I hope that the slides on their own also provide some helpful hints.
Remark: As the "dismiss reusability" slide tends to get quite some attention, one more remark about that slide: On the voice track I usually add "If you find something that is worth being made reusable, i.e., that it satisfies the commercial constraints of Fred Brooks 'Rule of 9' (see https://blog.codecentric.de/en/2015/10/the-broken-promise-of-re-use/ for details), do not put it in a service. Instead create a library and put the functionality in there. And make sure that changes to the library do not mean that all users have to upgrade at the same time, but that any library user can update whenever it fits the user's schedule. Otherwise, you would have introduced tight coupling through the back door again."
5. (Almost) every system is a distributed system
Chas Emerick
http://www.infoq.com/presentations/problems-distributed-systems
6. A distributed system is one in which the failure
of a computer you didn't even know existed
can render your own computer unusable.
Leslie Lamport
7. Failures in todays complex, distributed and
interconnected systems are not the exception.
• They are the normal case
• They are not predictable
8. … and it’s getting “worse”
• Cloud-based systems
• Microservices
• Zero Downtime
• Mobile & IoT
• Social Web
à Ever-increasing complexity and connectivity
10. resilience (IT)
the ability of a system to handle unexpected situations
- without the user noticing it (best case)
- with a graceful degradation of service (worst case)
17. Confinement
Core
Detect
Treat
Prevent
Recover
Mitigate
Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
Queue for
resources
Bounded queue
Finish work
in progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Acknowledgement
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine
checks
Health
check
Fail fast
Let sleeping
dogs lie
Small
releases
Hot
deployments
Routine
maintenance
Backup
request
Anti-fragility
Diversity
Jitter
Error
injection
Spread
the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit
breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Oh, my!
Theoretical blah!
Uncool!
Know that anyway
for eons. So, let’s
move on to the cool
parts …
Ah, now we’re talkin’! Here’s the cool stuff!
That‘s practical, applicable. Don‘t you have more code examples?
Or even better: Can‘t we turn that all into a live hacking session?
Offline
activities?
Hmm, let‘s
focus on the
other stuff.
Uh, sounds like
one-off, tough
stuff …
Better start with the
easier stuff, best
with library support
Yeah, more cool stuff!
Aren‘t there more libs like Hystrix
that we can drag into our projects
with a line of configuration?
Well, neat …
I’ll come back to
that stuff whenever
I really need it
25. Isolation
• System must not fail as a whole
• Split system in parts and isolate parts against each other
• Avoid cascading failures
• Foundation of resilient software design
28. Bulkheads
• Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
• Diverse implementation choices available, e.g., µservice, actor, scs, ...
• Implementation choice impacts system and resilience design a lot
• Shaping good bulkheads is extremely hard (pure design issue)
30. Service A
Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,
i.e. the isolation is broken by design
How do we avoid this …
31. Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,
i.e. availability is broken by design
... and this ...
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
32. Mothership Service
(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service
i.e. the isolation is broken by design
... without ending up with this?
33. Let’s use the well-known best practices
• Divide & conquer a.k.a. functional decomposition
• DRY (Don’t Repeat Yourself)
• Design for reusability
• Layered architecture
• …
35. Service A
Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,
i.e. the isolation is broken by design
... this usually leads to this ...
36. Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,
i.e. availability is broken by design
... and this ...
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
Service
37. Mothership Service
(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service
i.e. the isolation is broken by design
... and in the end also often to this.
40. Service A
Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,
i.e. the isolation is broken by design
CacheofB
Break tight service coupling
by caching data/responses
of downstream service
49. Yet, a few guiding thoughts
about bulkhead design …
50. Foundations of design
• High cohesion, low coupling
• Separation of concerns
• Crucial across process boundaries
• Still poorly understood issue
• Start with
• Understanding organizational boundaries
• Understanding use cases and flows
• Identifying functional domains (à DDD)
• Finding areas that change independently
• Do not start with a data model!
51. Short activation paths
• Long activation paths affect availability
• Increase latency and likelihood of failures
• Minimize remote calls per request
• Need to balance opposing forces
• Avoid monolith à clear separation of concerns
• Minimize requests à cluster functionality & data
• Caches sometimes help, but stale data as trade-off
52. Dismiss reusability
• Reusability increases coupling
• Reusability leads to bad service design
• Reusability compromises availability
• Reusability rarely pays
• Do not strive for reuse
• Strive for replaceability instead
55. Communication paradigm
• Request-response <-> messaging <-> events <-> …
• Heavily influences resilience patterns to be used
• Also heavily influences functional bulkhead design
• Very fundamental decision which is often underestimated
57. Synchronous R/R vs. asynchronous events
• Decomposition
• Vertically divide-and-conquer vs. horizontally go-with-the-flow
• Coordination
• Coordination logic/services and orchestration vs. event chains and choreography
• Transactions
• Built-in transaction handling vs. external supervision
• Error handling
• Built into service vs. escalation/supervision strategy
• Separation of concerns
• Multiple responsibilities service vs. single responsibility services
• Encapsulation
• Domain logic distributed across services vs. domain logic in one place
• Reusability vs. Replaceability
• Complexity
• A draw …
58. The communication paradigm influences
the functional service design a lot
and also the resilience patterns to be used
59. Example: order fulfillment
• Simple order, credit card, non-digital items
• Add coupons incl. validation
• Add promotions incl. usage notification
• Add bonus card incl. purchase notification
• Customer accounts as payment type
• PayPal as payment type
• Integrate digital music library
• Integrate digital video library
• Integrate e-book library
60. Design exercise – Part 1
Create a bulkhead design for the case study
• Use one communication paradigm
• Synchronous request/response (e.g., REST)
• Asynchronous messaging (e.g., Akka)
• Asynchronous events (e.g., Pub/Sub)
• Assume incremental requirements
• How many services do you need to touch
• What about the functional isolation of the services
• How big/maintainable are the resulting services
• Take a few notes
61. Online Shop Checkout
Credit Card
Provider
Warehouse
System
Coupon
Management
Campaign
Management
PayPal
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
Customer
pressed
“Buy now”
?
62. Order Fulfillment
Service
Online Shop
Payment
Service
Credit Card
Provider
Shipment
Service
Warehouse
System
<Foreign Service>
<Own Service>
Coupon
Management
Promotion
Campaign
Management
Loyalty
Account
Service
Payment
Provider
PayPal
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
Coupon
Credit Card
Coordinate
Warehouse
Coordinate
Assets
Notify Cust.
PayPal
Coordinate
63. Order confirmed
Online Shop
Credit Card
Provider
Warehouse
System
<Foreign Service>
<Own Service>
Coupon
Management
Campaign
Management
Account
service
Credit Card
Service
Loyalty
Management
Accounts
Receivables
Music Library
E-Book
Library
Video Library
E-Mail Server
PayPal
PayPal
Service
Warehouse
Service
Promotion
Service
Bonus Card
Service
Coupon
Service
Music Library
Service
Video Library
Service
E-Book Library
Service
Notification
Service
Payment authorized
Digital asset provisioned
Payment failed
<Event>
Order fulfillment
supervisor
Track flow of events
Reschedule events
in case of failure
Services are responsible to eventually succeed
or fail for good, usually incorporating a
supervision/escalation hierarchy for that
64. Do not limit your design options upfront
without an important reason
65. Wrap-up
• Today’s systems are distributed
• Failures are not avoidable, nor predictable
• Resilient software design needed
• Bulkhead design is
• crucial for application robustness
• poorly understood
• massively underrated
• different from traditional design best practices
• Communication paradigms broaden
your bulkhead design options
66. We have to re-learn design
for distributed systems