SlideShare une entreprise Scribd logo
1  sur  51
P U B L I C S E C T O R
S U M M I T
Wa shingto n, D C
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Failure is not an Option
Designing Highly Resilient AWS
Systems
Tim Griesbach
Manager, Solutions Architecture
AWS WWPS
3 0 2 9 5 5
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Agenda
What are we planning for? Risk and Resiliency requirements.
Think resiliently. Principles of Resiliency
Resilient design. System, Test, and Operations patterns & best practices
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
“Everything fails, all the time”
- Werner Vogels
(CTO, Amazon.com)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Resiliency is the ability for a
system to recover quickly and
continue operating even when a
failure occurs
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Push for resiliency
IT failures lead to broad society impact
• Government Services, Airline, Financial, Communications
Reputation / Legal
• More and more people depend on IT systems for everything.
$$$
• Lost productivity, idle time of people dependent on system
• Lost productivity, time putting out fires and recovering
• Lost revenue from system
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
What are we planning for?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Consider each applications
significance to your business,
and the potential impact if a disruption
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Cause Examples Probability
Operator error Manual operator error HIGH
Deployment induced Software, hardware, network, or configuration deployment
Both automated and manual changes
HIGH
Load induced Change in behavior, either of a specific caller or aggregate
Service reaching a tipping point
Load failures can occur in the network
Denial of service (DDoS)
HIGH
Data induced Data accepted by the system that it can’t process (“poison
pill”)
MED
Credential expiration Expiration of a certificate or credentials MED
Hardware failure Any hardware component in the system, i.e. hosts, storage,
network, or elsewhere.
LOW
Infrastructure Power feed or environmental conditions LOW
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
On-premise data center realities
• Traditional DR “weekend tests”
• Connected to the internet? Exposed to same external attacks
• DR site is always “ACTIVE” and hence you are paying for resources
• Data is constantly replicated
• Datacenter security compliance is expensive and resource intensive
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Think Resiliently. Principles of
Resiliency
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
A V A I L A B I L I T Y , R E L I A B I L I T Y ,
A N D R E S I L I E N C E
IN 21ST CENTURY ARCHITECTURES
Test recovery procedures
Automatically recover from failure
Scale horizontally to improve availability
Stop guessing capacity
Manage change through automation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
What do those 9’s really mean?
Availability
Max disruption
(per year)
Max disruption
(per month)
Max disruption
(per month)
99% 3 days 15 hours 7.31 hours 14.4 minutes
99.9% 8 hours 45
minutes
43.83 minutes 1.44 minutes
99.95% 4 hours 22
minutes
21.92 minutes 43.2 seconds
99.99% 52 minutes 4.38 minutes 8.64 seconds
99.999% 5 minutes 26.3 seconds 864 milliseconds
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Disaster
Recovery point
Data loss
Recovery time
Down time
Time
Recovery Point and Recovery Time Objective
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Classification
Security Policy
Customer Provided and Managed Controls
Encryption
Governance
ITDaM
ITSM
Monitoring
Operations
Malware
Risk
Management
You control how you manage your own risks
AWS Managed and Audited Controls
SOC 1 SOC 2 PCI-DSS NIST 800-53 ISO 27001
AWS Provided, Customer Configured and Managed Controls
Virtual Private
Cloud
Key
Management
Logging Other AWS features and services
Customer Risk Appetite and Desired Control Environment
Business Risks Sourcing Risks
Technology
Risks
Security Risks ComplianceAWSCustomers
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Design Systems Resiliently
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
A complex system that works is invariably found
to have evolved from a simple system that worked.
G A L L ’ S L A W
It’s not binary.
Start somewhere
and scale up.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
“There is no compression algorithm for experience.”
AWS has had 13+ years to build the world’s most reliable, secure, scalable, and
cost-effective infrastructure.
• Your operational DNA has to be crafted for reliability.
• Service SLAs between 99.9% and 100% availability
• Amazon S3 is designed for 99.999999999% durability
• AWS Availability Zones exist on isolated fault lines, flood plains, networks, and local electrical grids to
substantially reduce the chance of simultaneous failure.
• Disaster is inevitable; automation + redundancy = availability.
We are driven to remove any and all causes of failure. Our goal is to make our operational
performance indistinguishable from perfect.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
A W S
R E G I O N A L
E X PA N S I O N
23 Regions and 67 AZs 4 New Regions and 12 AZs
2 GovCloud Regions Today New GovCloud, TS, and Secret Regions
Coming Soon
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T 25
Amazon Global Network
AWS Region with Multiple Edge Locations
Amazon CloudFront PoPs
AWS Direct Connect Location
96 AWS Direct Connect locations
Customers can reach every public AWS Region from
the local Direct Connect location (except China)
A W S
C O N N E C T I V I T Y
O P T I O N S
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Resilient AWS Cloud
Infrastructure
Regions, AZs, Networking
Service Design
Cell-based architecture
Multi-Az architecture
Micro-service architecture
Distributed systems best practices
Understand the AWS Services scope
Single AZ, Regional, Global, Cross-Reginal
capability Figure 3
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Resilient Networking
Networking is foundation
Packets must get from point-a to point-b
Ensure network supporting your applications is appropriately
redundant, always available, and seamlessly routed.
AWS provides a global infrastructure with 20 Regions and
61 Availability Zones (at the time of publication)
AWS services
Amazon EC2 networking
Amazon Virtual Private Cloud (VPC), VPC Peering, VPC Sharing
AWS Gateways for external, internal and back to on-premise routing (VPN, Transit)
DNS (Route53)
Elastic Load Balancer (ELB)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Resilient Data
Must have confidence in the resilience of your data
Many forms: filesystem, block storage, databases, in memory caches
Consider how eventual consistency impacts design
AWS services
Amazon S3 cross-region replication
Cross region snapshots (Amazon EBS volumes)
Amazon RDS cross region replicas
AWS Storage & File Gateway
Amazon FSx for Windows and Lustre
Figure 10
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Self-Healing applications
Highly resilient applications must be able to self-heal.
How
Leverage Microservices app architecture
Decouple inter-dependencies, loose coupling
Remove state from app components
AWS services
Elastic Load
Balancing
AWS Auto Scaling Amazon Simple
Queue Service
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Single Region: Multi AZ
Start here before adopting more complex architecture
Only consider multi-region if requirements dictate
Pros
Availability of AWS region-wide services include
Amazon S3, Amazon DynamoDB, Amazon EFS,
Amazon SQS, Amazon Kinesis
Much less complexity in design, implementation, and
operations.
Cons
If you need >99.9% resiliency, consider multi-region.
May not meet needs of regulators
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Multi-Region: Active-Standby
Traditional DR Pattern
Backup env used in event of failure only
Pros
For Apps which cannot use native AWS features
Least # changes to the application
Cons
Delays while Standby becomes Active (hrs)
RPO limited by replication lag
AWS Services
Amazon RDS Amazon Route 53
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Multi-Region: Active-active
Both stacks active, traffic distributed
Data replication critical, must consider latency impacts
Pros
Zero RTO
Works well for apps that can partition users
Cons
Data replication must be handled by Applications
AWS Services
Storage replication from APN partners
Amazon RDS Amazon DynamoDB Amazon Aurora AWS Database
Migration Service
Amazon Route 53
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Multi-Region: Dual-write
Shared nothing architecture – all TX processed in
duplicate/parallel
Good for legacy applications
Pros
Zero RPO
Little/No change to apps in each region
Cons
Requires checkpointing
Reconciliation jobs to ensure sites in sync
Downstream apps must avoid duplicates
AWS Service
AWS Lambda Amazon Route 53
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Anti-Patterns
• Replicate on-premise problems & patterns to the cloud
• Use of Non-redundant architectures to meet schedules
• Single datacenter (Availability Zones) architectures
• Reusing manual processes
• Data retention practices, Failover & Scaling
• Responding to monitoring alerts and metrics (vs self-healing, auto scaling)
• Assuming data is safe in your data center
Don't sacrifice long-term value
for short-term results
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Resilient operations, often
overlooked
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Operations is key pillar in resiliency
Success in operations means you
• Successful & consistent implementation of changes
• Have insight to operational health
• Have insight to achievement of business outcomes
• Respond in timely and effectively to events impacting the application
How?
• Perform operations as code
• Annotated documentation
• Make frequent, small, reversible changes
• Refine operations procedures frequently
• Anticipate failure
• Learn from operational failures
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Operations monitoring
Must detect failures fast
Applications emit telemetry to detect
Processes defined and understood
AWS Services
Amazon CloudWatch
AWS Personal
Health Dashboard
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Operations automation
“A key mechanism to achieve this is to automate
the management as much as possible, removing
error prone, manual operations.” - Werner Vogels
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Application deployment
Infrastructure as code – integrates infrastructure and application change processes
Examples include staged deployment, canary deployments, isolation zone deployments, and
automatic roll back
AWS Services
AWS CodeBuild
AWS CodeCommit
AWS CodeDeploy
AWS CodePipeline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Testing enforces resiliency
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Application testing & certification
Resilient system continue to operate successfully in the presence of failures.
Failure Mode Effect Analysis (FMEA) – industry standard technique
Estimate risk priority number (RPN) between 1 and 1000
Rank probability, severity, and observability on a 1-10 scale, where 1 is good and 10 is bad, and multiplying them.
Perfectly low probability, low impact, easy to measure risk has an RPN of 1.
Extremely frequent, permanently damaging, impossible to detect risk has an RPN of 1000
Failure impact analysis
Failure Effect Mitigation Result
Failure of an AZ
Temporary capacity
reduction
Automatic failover to secondary
AZ
Temporary performance
degradation
Total failure of satellite region Data replication offline
Repair/reconfigure replication
using alternate region
No service interruption
Partition of network between regions Data replication offline
Auto recovery when network is
available
No service interruption
Total failure of primary region Service Offline Failover to secondary region
Service restored within two
hours
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Chaos engineering
Cloud has ushered in new method of testing
Principles of Chaos Engineering – “Chaos Engineering can be thought of as the facilitation of
experiments to uncover systemic weaknesses.” https://principlesofchaos.org/
Principles
Building a hypothesis around steady state behavior
Applying variations to simulate real world events
Run experiments in production
Automate the experiments to run continuously
Minimize blast radius of failures
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Continuous Testing of Infrastructure
Regularly execute tests in stable, production & production-like test environments.
Treat Infrastructure as Code
• CI/CD Test in Infrastructure Build Pipeline
• Testing of infrastructure during Integration Test
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Future Considerations
• Consider serverless, reduces maintenance and moves the responsibility of the
resilient design to AWS.
• Take advantage of our distributed systems by building on top of them – Amazon
S3/AWS Lambda/Amazon ECS.
• Break systems down into smaller pieces along logical seams. Reduce the blast
radius of a failure of any individual piece of the system
• Leverage Well Architected tool to assess your applications -
https://aws.amazon.com/well-architected-tool
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Additional Resources
• AWS Well Architected https://aws.amazon.com/architecture/well-architected
• AWS Whitepaper: Building Mission-Critical Financial Services Applications on
AWS, April 2019
• re:Invent 2018: Close Loops & Opening Minds: How to Take Control of Systems,
Big & Small-https://www.youtube.com/watch?v=O8xLxNje30M
• Building Microservices: Designing Fine-Grained Systems
• AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active
Applications (ARC209-R2)
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R
S U M M I T
Tim Griesbach
awstim@amazon.com

Contenu connexe

Tendances

Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...Amazon Web Services
 
Azure Container Apps
Azure Container AppsAzure Container Apps
Azure Container AppsKen Sykora
 
Simplify & Standardise Your Migration to AWS with a Migration Landing Zone
Simplify & Standardise Your Migration to AWS with a Migration Landing ZoneSimplify & Standardise Your Migration to AWS with a Migration Landing Zone
Simplify & Standardise Your Migration to AWS with a Migration Landing ZoneAmazon Web Services
 
Microservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesMicroservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesAmazon Web Services
 
Introduction To AWS & AWS Lambda
Introduction To AWS & AWS LambdaIntroduction To AWS & AWS Lambda
Introduction To AWS & AWS LambdaAn Nguyen
 
AWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applicationsAWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applicationsAmazon Web Services
 
An Introduction to AWS
An Introduction to AWSAn Introduction to AWS
An Introduction to AWSIan Massingham
 
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Amazon Web Services
 
Introduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingIntroduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingAmazon Web Services
 
Azure Application Modernization
Azure Application ModernizationAzure Application Modernization
Azure Application ModernizationKarina Matos
 
The Ideal Approach to Application Modernization; Which Way to the Cloud?
The Ideal Approach to Application Modernization; Which Way to the Cloud?The Ideal Approach to Application Modernization; Which Way to the Cloud?
The Ideal Approach to Application Modernization; Which Way to the Cloud?Codit
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?Amazon Web Services
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native ObservabilityTyler Treat
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Azure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftAzure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftDavid J Rosenthal
 

Tendances (20)

Setting Up a Landing Zone
Setting Up a Landing ZoneSetting Up a Landing Zone
Setting Up a Landing Zone
 
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
 
Azure Container Apps
Azure Container AppsAzure Container Apps
Azure Container Apps
 
Simplify & Standardise Your Migration to AWS with a Migration Landing Zone
Simplify & Standardise Your Migration to AWS with a Migration Landing ZoneSimplify & Standardise Your Migration to AWS with a Migration Landing Zone
Simplify & Standardise Your Migration to AWS with a Migration Landing Zone
 
App Modernization
App ModernizationApp Modernization
App Modernization
 
Microservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web ServicesMicroservices Architectures on Amazon Web Services
Microservices Architectures on Amazon Web Services
 
Introduction To AWS & AWS Lambda
Introduction To AWS & AWS LambdaIntroduction To AWS & AWS Lambda
Introduction To AWS & AWS Lambda
 
AWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applicationsAWS Webinar 201: Designing scalable, available & resilient cloud applications
AWS Webinar 201: Designing scalable, available & resilient cloud applications
 
An Introduction to AWS
An Introduction to AWSAn Introduction to AWS
An Introduction to AWS
 
Cloud Migration Workshop
Cloud Migration WorkshopCloud Migration Workshop
Cloud Migration Workshop
 
Migration Planning
Migration PlanningMigration Planning
Migration Planning
 
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
Introduction to AWS Cloud Computing | AWS Public Sector Summit 2016
 
Cloud Migration Strategy Framework
Cloud Migration Strategy FrameworkCloud Migration Strategy Framework
Cloud Migration Strategy Framework
 
Introduction to AWS Cloud Computing
Introduction to AWS Cloud ComputingIntroduction to AWS Cloud Computing
Introduction to AWS Cloud Computing
 
Azure Application Modernization
Azure Application ModernizationAzure Application Modernization
Azure Application Modernization
 
The Ideal Approach to Application Modernization; Which Way to the Cloud?
The Ideal Approach to Application Modernization; Which Way to the Cloud?The Ideal Approach to Application Modernization; Which Way to the Cloud?
The Ideal Approach to Application Modernization; Which Way to the Cloud?
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Azure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftAzure Arc Overview from Microsoft
Azure Arc Overview from Microsoft
 

Similaire à Failure is not an Option - Designing Highly Resilient AWS Systems

Scale - Failure is not an Option: Designing Highly Resilient AWS Systems
Scale - Failure is not an Option: Designing Highly Resilient AWS SystemsScale - Failure is not an Option: Designing Highly Resilient AWS Systems
Scale - Failure is not an Option: Designing Highly Resilient AWS SystemsAmazon Web Services
 
NIST Compliance, AWS Federal Pop-Up Loft
NIST Compliance, AWS Federal Pop-Up LoftNIST Compliance, AWS Federal Pop-Up Loft
NIST Compliance, AWS Federal Pop-Up LoftAmazon Web Services
 
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...Amazon Web Services
 
Cybersecurity: A Drive Force Behind Cloud Adoption
Cybersecurity: A Drive Force Behind Cloud AdoptionCybersecurity: A Drive Force Behind Cloud Adoption
Cybersecurity: A Drive Force Behind Cloud AdoptionAmazon Web Services
 
AWS PROTECTED - Why This Matters to Australia.
AWS PROTECTED - Why This Matters to Australia.AWS PROTECTED - Why This Matters to Australia.
AWS PROTECTED - Why This Matters to Australia.Amazon Web Services
 
Innovate - Cybersecurity: A Drive Force Behind Cloud Adoption
Innovate - Cybersecurity: A Drive Force Behind Cloud AdoptionInnovate - Cybersecurity: A Drive Force Behind Cloud Adoption
Innovate - Cybersecurity: A Drive Force Behind Cloud AdoptionAmazon Web Services
 
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...Amazon Web Services
 
2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloudReham Maher El-Safarini
 
Cost Optimization on AWS (REPEAT)
Cost Optimization on AWS (REPEAT)Cost Optimization on AWS (REPEAT)
Cost Optimization on AWS (REPEAT)Amazon Web Services
 
Breaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersBreaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersAmazon Web Services
 
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMSLeaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMSAmazon Web Services
 
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...Amazon Web Services
 
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native Desktops
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native DesktopsDesktop-as-a-Service: Flexible Application Delivery to Cloud-Native Desktops
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native DesktopsAmazon Web Services
 
以容器技術為基礎的混合雲設計架構
以容器技術為基礎的混合雲設計架構以容器技術為基礎的混合雲設計架構
以容器技術為基礎的混合雲設計架構Amazon Web Services
 
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS SummitHow Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS SummitAmazon Web Services
 

Similaire à Failure is not an Option - Designing Highly Resilient AWS Systems (20)

Scale - Failure is not an Option: Designing Highly Resilient AWS Systems
Scale - Failure is not an Option: Designing Highly Resilient AWS SystemsScale - Failure is not an Option: Designing Highly Resilient AWS Systems
Scale - Failure is not an Option: Designing Highly Resilient AWS Systems
 
NIST Compliance, AWS Federal Pop-Up Loft
NIST Compliance, AWS Federal Pop-Up LoftNIST Compliance, AWS Federal Pop-Up Loft
NIST Compliance, AWS Federal Pop-Up Loft
 
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...
Hybrid Solutions at the Edge – Go Global Faster, Efficiently, and More Secure...
 
Cybersecurity: A Drive Force Behind Cloud Adoption
Cybersecurity: A Drive Force Behind Cloud AdoptionCybersecurity: A Drive Force Behind Cloud Adoption
Cybersecurity: A Drive Force Behind Cloud Adoption
 
Automated Security Remediation
Automated Security RemediationAutomated Security Remediation
Automated Security Remediation
 
AWS PROTECTED - Why This Matters to Australia.
AWS PROTECTED - Why This Matters to Australia.AWS PROTECTED - Why This Matters to Australia.
AWS PROTECTED - Why This Matters to Australia.
 
Innovate - Cybersecurity: A Drive Force Behind Cloud Adoption
Innovate - Cybersecurity: A Drive Force Behind Cloud AdoptionInnovate - Cybersecurity: A Drive Force Behind Cloud Adoption
Innovate - Cybersecurity: A Drive Force Behind Cloud Adoption
 
Cost Optimization on AWS
Cost Optimization on AWSCost Optimization on AWS
Cost Optimization on AWS
 
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...
Innovate - Become Migration Ready: Accelerate and Optimise your Cloud Adoptio...
 
Cost Optimisation
Cost OptimisationCost Optimisation
Cost Optimisation
 
From Monolith to Microservices
From Monolith to MicroservicesFrom Monolith to Microservices
From Monolith to Microservices
 
2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud2. migration, disaster recovery and business continuity in the cloud
2. migration, disaster recovery and business continuity in the cloud
 
Keynote: Introduction to AWS
Keynote: Introduction to AWS Keynote: Introduction to AWS
Keynote: Introduction to AWS
 
Cost Optimization on AWS (REPEAT)
Cost Optimization on AWS (REPEAT)Cost Optimization on AWS (REPEAT)
Cost Optimization on AWS (REPEAT)
 
Breaking Up the Monolith with Containers
Breaking Up the Monolith with ContainersBreaking Up the Monolith with Containers
Breaking Up the Monolith with Containers
 
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMSLeaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
 
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...
Continuous Diagnostics and Mitigation (CDM) at Cloud Scale: How Federal Agenc...
 
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native Desktops
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native DesktopsDesktop-as-a-Service: Flexible Application Delivery to Cloud-Native Desktops
Desktop-as-a-Service: Flexible Application Delivery to Cloud-Native Desktops
 
以容器技術為基礎的混合雲設計架構
以容器技術為基礎的混合雲設計架構以容器技術為基礎的混合雲設計架構
以容器技術為基礎的混合雲設計架構
 
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS SummitHow Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
How Nubank is building a customer-obsessed bank - FSV201 - New York AWS Summit
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Failure is not an Option - Designing Highly Resilient AWS Systems

  • 1. P U B L I C S E C T O R S U M M I T Wa shingto n, D C
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Failure is not an Option Designing Highly Resilient AWS Systems Tim Griesbach Manager, Solutions Architecture AWS WWPS 3 0 2 9 5 5
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Agenda What are we planning for? Risk and Resiliency requirements. Think resiliently. Principles of Resiliency Resilient design. System, Test, and Operations patterns & best practices
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T “Everything fails, all the time” - Werner Vogels (CTO, Amazon.com)
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Resiliency is the ability for a system to recover quickly and continue operating even when a failure occurs
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Push for resiliency IT failures lead to broad society impact • Government Services, Airline, Financial, Communications Reputation / Legal • More and more people depend on IT systems for everything. $$$ • Lost productivity, idle time of people dependent on system • Lost productivity, time putting out fires and recovering • Lost revenue from system
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T What are we planning for?
  • 8.
  • 9.
  • 10.
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Consider each applications significance to your business, and the potential impact if a disruption
  • 12.
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Cause Examples Probability Operator error Manual operator error HIGH Deployment induced Software, hardware, network, or configuration deployment Both automated and manual changes HIGH Load induced Change in behavior, either of a specific caller or aggregate Service reaching a tipping point Load failures can occur in the network Denial of service (DDoS) HIGH Data induced Data accepted by the system that it can’t process (“poison pill”) MED Credential expiration Expiration of a certificate or credentials MED Hardware failure Any hardware component in the system, i.e. hosts, storage, network, or elsewhere. LOW Infrastructure Power feed or environmental conditions LOW
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T On-premise data center realities • Traditional DR “weekend tests” • Connected to the internet? Exposed to same external attacks • DR site is always “ACTIVE” and hence you are paying for resources • Data is constantly replicated • Datacenter security compliance is expensive and resource intensive
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Think Resiliently. Principles of Resiliency
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T A V A I L A B I L I T Y , R E L I A B I L I T Y , A N D R E S I L I E N C E IN 21ST CENTURY ARCHITECTURES Test recovery procedures Automatically recover from failure Scale horizontally to improve availability Stop guessing capacity Manage change through automation
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T What do those 9’s really mean? Availability Max disruption (per year) Max disruption (per month) Max disruption (per month) 99% 3 days 15 hours 7.31 hours 14.4 minutes 99.9% 8 hours 45 minutes 43.83 minutes 1.44 minutes 99.95% 4 hours 22 minutes 21.92 minutes 43.2 seconds 99.99% 52 minutes 4.38 minutes 8.64 seconds 99.999% 5 minutes 26.3 seconds 864 milliseconds
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Disaster Recovery point Data loss Recovery time Down time Time Recovery Point and Recovery Time Objective
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Classification Security Policy Customer Provided and Managed Controls Encryption Governance ITDaM ITSM Monitoring Operations Malware Risk Management You control how you manage your own risks AWS Managed and Audited Controls SOC 1 SOC 2 PCI-DSS NIST 800-53 ISO 27001 AWS Provided, Customer Configured and Managed Controls Virtual Private Cloud Key Management Logging Other AWS features and services Customer Risk Appetite and Desired Control Environment Business Risks Sourcing Risks Technology Risks Security Risks ComplianceAWSCustomers
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Design Systems Resiliently
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T A complex system that works is invariably found to have evolved from a simple system that worked. G A L L ’ S L A W
  • 22. It’s not binary. Start somewhere and scale up.
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T “There is no compression algorithm for experience.” AWS has had 13+ years to build the world’s most reliable, secure, scalable, and cost-effective infrastructure. • Your operational DNA has to be crafted for reliability. • Service SLAs between 99.9% and 100% availability • Amazon S3 is designed for 99.999999999% durability • AWS Availability Zones exist on isolated fault lines, flood plains, networks, and local electrical grids to substantially reduce the chance of simultaneous failure. • Disaster is inevitable; automation + redundancy = availability. We are driven to remove any and all causes of failure. Our goal is to make our operational performance indistinguishable from perfect.
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T A W S R E G I O N A L E X PA N S I O N 23 Regions and 67 AZs 4 New Regions and 12 AZs 2 GovCloud Regions Today New GovCloud, TS, and Secret Regions Coming Soon
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T 25 Amazon Global Network AWS Region with Multiple Edge Locations Amazon CloudFront PoPs AWS Direct Connect Location 96 AWS Direct Connect locations Customers can reach every public AWS Region from the local Direct Connect location (except China) A W S C O N N E C T I V I T Y O P T I O N S
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Resilient AWS Cloud Infrastructure Regions, AZs, Networking Service Design Cell-based architecture Multi-Az architecture Micro-service architecture Distributed systems best practices Understand the AWS Services scope Single AZ, Regional, Global, Cross-Reginal capability Figure 3
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Resilient Networking Networking is foundation Packets must get from point-a to point-b Ensure network supporting your applications is appropriately redundant, always available, and seamlessly routed. AWS provides a global infrastructure with 20 Regions and 61 Availability Zones (at the time of publication) AWS services Amazon EC2 networking Amazon Virtual Private Cloud (VPC), VPC Peering, VPC Sharing AWS Gateways for external, internal and back to on-premise routing (VPN, Transit) DNS (Route53) Elastic Load Balancer (ELB)
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Resilient Data Must have confidence in the resilience of your data Many forms: filesystem, block storage, databases, in memory caches Consider how eventual consistency impacts design AWS services Amazon S3 cross-region replication Cross region snapshots (Amazon EBS volumes) Amazon RDS cross region replicas AWS Storage & File Gateway Amazon FSx for Windows and Lustre Figure 10
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Self-Healing applications Highly resilient applications must be able to self-heal. How Leverage Microservices app architecture Decouple inter-dependencies, loose coupling Remove state from app components AWS services Elastic Load Balancing AWS Auto Scaling Amazon Simple Queue Service
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Single Region: Multi AZ Start here before adopting more complex architecture Only consider multi-region if requirements dictate Pros Availability of AWS region-wide services include Amazon S3, Amazon DynamoDB, Amazon EFS, Amazon SQS, Amazon Kinesis Much less complexity in design, implementation, and operations. Cons If you need >99.9% resiliency, consider multi-region. May not meet needs of regulators
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Multi-Region: Active-Standby Traditional DR Pattern Backup env used in event of failure only Pros For Apps which cannot use native AWS features Least # changes to the application Cons Delays while Standby becomes Active (hrs) RPO limited by replication lag AWS Services Amazon RDS Amazon Route 53
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Multi-Region: Active-active Both stacks active, traffic distributed Data replication critical, must consider latency impacts Pros Zero RTO Works well for apps that can partition users Cons Data replication must be handled by Applications AWS Services Storage replication from APN partners Amazon RDS Amazon DynamoDB Amazon Aurora AWS Database Migration Service Amazon Route 53
  • 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Multi-Region: Dual-write Shared nothing architecture – all TX processed in duplicate/parallel Good for legacy applications Pros Zero RPO Little/No change to apps in each region Cons Requires checkpointing Reconciliation jobs to ensure sites in sync Downstream apps must avoid duplicates AWS Service AWS Lambda Amazon Route 53
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Anti-Patterns • Replicate on-premise problems & patterns to the cloud • Use of Non-redundant architectures to meet schedules • Single datacenter (Availability Zones) architectures • Reusing manual processes • Data retention practices, Failover & Scaling • Responding to monitoring alerts and metrics (vs self-healing, auto scaling) • Assuming data is safe in your data center Don't sacrifice long-term value for short-term results
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Resilient operations, often overlooked
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Operations is key pillar in resiliency Success in operations means you • Successful & consistent implementation of changes • Have insight to operational health • Have insight to achievement of business outcomes • Respond in timely and effectively to events impacting the application How? • Perform operations as code • Annotated documentation • Make frequent, small, reversible changes • Refine operations procedures frequently • Anticipate failure • Learn from operational failures
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Operations monitoring Must detect failures fast Applications emit telemetry to detect Processes defined and understood AWS Services Amazon CloudWatch AWS Personal Health Dashboard
  • 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Operations automation “A key mechanism to achieve this is to automate the management as much as possible, removing error prone, manual operations.” - Werner Vogels
  • 40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Application deployment Infrastructure as code – integrates infrastructure and application change processes Examples include staged deployment, canary deployments, isolation zone deployments, and automatic roll back AWS Services AWS CodeBuild AWS CodeCommit AWS CodeDeploy AWS CodePipeline
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Testing enforces resiliency
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Application testing & certification Resilient system continue to operate successfully in the presence of failures. Failure Mode Effect Analysis (FMEA) – industry standard technique Estimate risk priority number (RPN) between 1 and 1000 Rank probability, severity, and observability on a 1-10 scale, where 1 is good and 10 is bad, and multiplying them. Perfectly low probability, low impact, easy to measure risk has an RPN of 1. Extremely frequent, permanently damaging, impossible to detect risk has an RPN of 1000 Failure impact analysis Failure Effect Mitigation Result Failure of an AZ Temporary capacity reduction Automatic failover to secondary AZ Temporary performance degradation Total failure of satellite region Data replication offline Repair/reconfigure replication using alternate region No service interruption Partition of network between regions Data replication offline Auto recovery when network is available No service interruption Total failure of primary region Service Offline Failover to secondary region Service restored within two hours
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Chaos engineering Cloud has ushered in new method of testing Principles of Chaos Engineering – “Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses.” https://principlesofchaos.org/ Principles Building a hypothesis around steady state behavior Applying variations to simulate real world events Run experiments in production Automate the experiments to run continuously Minimize blast radius of failures
  • 45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Continuous Testing of Infrastructure Regularly execute tests in stable, production & production-like test environments. Treat Infrastructure as Code • CI/CD Test in Infrastructure Build Pipeline • Testing of infrastructure during Integration Test
  • 47.
  • 48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Future Considerations • Consider serverless, reduces maintenance and moves the responsibility of the resilient design to AWS. • Take advantage of our distributed systems by building on top of them – Amazon S3/AWS Lambda/Amazon ECS. • Break systems down into smaller pieces along logical seams. Reduce the blast radius of a failure of any individual piece of the system • Leverage Well Architected tool to assess your applications - https://aws.amazon.com/well-architected-tool
  • 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Additional Resources • AWS Well Architected https://aws.amazon.com/architecture/well-architected • AWS Whitepaper: Building Mission-Critical Financial Services Applications on AWS, April 2019 • re:Invent 2018: Close Loops & Opening Minds: How to Take Control of Systems, Big & Small-https://www.youtube.com/watch?v=O8xLxNje30M • Building Microservices: Designing Fine-Grained Systems • AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
  • 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T
  • 51. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C TO R S U M M I T Tim Griesbach awstim@amazon.com