A well-designed internet application should be able to scale seamlessly as demand increases and decreases, and be resilient enough to withstand software and hardware failures. In this talk we will look at different ways internet applications are designed for scalability, use-cases suitable to each approach, and pros and cons of the different approaches.
We will also discuss patterns for bringing resiliency/stability in complex distributed systems.
2. MillionEyes Healthcare Technologies
About Me
● Founder @ MillionEyes Healthcare, AI and wellness
● Was SVP Engineering Trusting Social, Credit Scoring for unbanked.
● Was VP Engineering @ Flipkart. Ran the Customer Platform (website),
Relevance Platform (search/recommendations) and Data Platform
● Was Head of India Dev center Retrevo (review aggregation and sentiment
analysis online)
● Director of Engineering @ Kosmix (categorisation and search, seeded
WalmartLabs)
11. MillionEyes Healthcare Technologies
Robust architecture isn’t all about software
It starts at the infrastructure layer
Progresses to the network and data
Influences application design
Extends to people and culture.Scalable
Resilient+
30. MillionEyes Healthcare Technologies
Resiliency Pattern #1
Redundancy
availability set → Multi zone/multi-region deployment
Let N be the composite SLA for the application deployed in one region. The expected chance that the application will fail in
both regions at the same time is (1 − N) × (1 − N). Therefore,
● Combined SLA for both regions = 1 − (1 − N)(1 − N) = N + (1 − N)N
31. MillionEyes Healthcare Technologies
Composition of services
The composite SLA 99.94%. Application that
relies on multiple services has more potential
failure points.
composite SLA for combined path is 99.99999%
more complex, you are paying for the queue, and there
may be data consistency issues to consider.
32. MillionEyes Healthcare Technologies
Retries
Transient failures can be caused by momentary loss of network connectivity, a dropped database connection, or a timeout
when a service is busy. Often, a transient failure can be resolved simply by retrying the request.
Resiliency Pattern #2
33. MillionEyes Healthcare Technologies
Requirements for Retries
Idempotent Operations
For retry mechanism to be safe, need ability to repeat without side effects. use unique traceable identifiers in
requests to your application and reject those that have been processed successfully.
Backoff Algos
To avoid network flooding and congestion, gradually increase the rate at which retries are performed
35. MillionEyes Healthcare Technologies
Mechanism to avoid Cascading failure
Throttling
When a single client makes an excessive number of requests, the application throttles the client for a certain period
of time, refusing some or all of the requests from that client. Helps minimise impact on other users, and thereby
avoiding an impact to overall availability of our application
Timeouts
With one request holding a resource,pool of connections quickly runs out. Timeouts prevent this from cascading. The
importance of thinking, planning and implementing timeouts is frequently underestimated.
Rejection
Final act of ‘self-defense’: start dropping requests deliberately when the service begins to overload. Can happen
server side, on load-balancers or even on the client’s side.
37. MillionEyes Healthcare Technologies
Strategies for degrading gracefully
Offering a variant of the service which is easier to compute
and deliver to the user
● Return an estimated value.
● Use locally cached data.
● Put a work item on a queue, to be handled later.
or Dropping unimportant traffic.
38. MillionEyes Healthcare Technologies
Circuit breakers
Applying circuit breakers to potentially-failing method calls, prevent an application from
repeatedly trying an operation that is likely to fail
Resiliency Pattern #5
39. MillionEyes Healthcare Technologies
Semantic Logging
Application logs are an important source of diagnostics data, to monitor the error rate. Generate structured logs that enable
automated analysis and actioning
Resiliency Pattern #6
40. MillionEyes Healthcare Technologies
Logging Best practices
Log in production.
Otherwise, you lose insight where you need it most.
Log events at service boundaries.
Include a correlation ID that flows across service boundaries.
Use asynchronous logging
Non blocking log writes, preventing any request backup
41. MillionEyes Healthcare Technologies
Automated Deployments
Manual deployments are prone to error. Need an automated, idempotent process → run on demand, and re-run if
something fails. Immutable infrastructure: Avoid modifying infrastructure after production deployment - hard to track and
reason (e.g. automate LB inclusion)
Resiliency Pattern #7
42. MillionEyes Healthcare Technologies
Best Practices
Blue Green Deployment: deploy updates into a production environment separate from the live application. Switch to
updated deployment after validation. This can further be done in a rolling manner to control the impact of an erroneous
build, this becomes a Canary Deployment. Rolling Deployments can also be done using feature flags
44. MillionEyes Healthcare Technologies
Summary
Resiliency does not happen by accident. It must be
designed and built in from the start.
Resiliency leads to higher availability, and lower mean time
to recover from failures.
1
2
Resiliency touches every part of the application lifecycle,
from planning and coding to operations
3
47. MillionEyes Healthcare Technologies
Load Balancers + Shared nothing
Units
Reduce any kind of contention among nodes as there is no scope for data or any other kind of resource sharing
Scalability Pattern #1
49. MillionEyes Healthcare Technologies
LB + Stateless Nodes + Scalable
Storage
Decouple compute and data. Several stateless nodes talking to a scalable
storage, and a load balancer distributes load among the nodes
Scalability Pattern #2
53. MillionEyes Healthcare Technologies
Caching
All static resources → images, stylesheets, javascripts, etc. Also dynamic content
that is not user specific (e.g. landing pages)
Scalability Pattern #5
55. MillionEyes Healthcare Technologies
Offload work to client
Not every user action should require a request to the server, handle locally
anything that doesn’t need new data.
Scalability Pattern #7