SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Operations @ Scale
Anurag Gupta, VP
AWS Database Services
Dev/Ops
How I learned to stop worrying and love my pager
Dev/Ops - your dev org is your ops org
I get a pager! You get a pager! Everyone gets a pager!
Why would I possibly want this?
It motivates design for operability
It aligns your interests w/ your customer experience
It improves the feedback loop to customer needs
Monitor everything
Every API call to your service,
Every API call you make to a dependent service
Canary traffic for things that vary (eg SQL statements)
Most of the metrics won’t be meaningful. That’s OK
Page on your high signal-to-noise metrics
Monitor these metrics during deployments
Median/Average, Fleet-wide, coarse time grain are obscuring
Measure TP90, TP99 (99th percentile response time)
Measure at finer and finer grain
Evaluate per-customer metrics
Look for the needles in the haystack
Correction-of-Error (COE) Reporting
Meet weekly on operations (execs, service operators)
Review each issue that happened.
“Spin the wheel” to review a service’s metrics
Support a “truth-seeking” culture
Looking for data, process improvements
COE
- Customer impact
- Timeline: incidence to detection to response to resolution
- 5 Whys? Get to actionable changes to extinguish cause
- Actions
Ops is Dev
Humans are fallible
circa 1% defect injection rate
Error rate changes based on time of day (3am vs 3pm)
New ones show up, have unique issues
Limit human access to machines
Use code/scripts/tools instead
Scripts are code
unit test, code review, deploy, automate
Ops load correlates to business growth
As your business does well, your
operations needs to become great
Growing 100-200% YoY is hard.
Improving ops 100-200% YoY is really
hard.
Improving ops 2% each week is possible.
Use Pareto analysis to prioritize work
Bonus – each customer gets a better
experience even as your own ops load
stays constant
Amazon Redshift has grown rapidly since it became generally
available in February 2013. While our guiding principles have
served us well over the past two years, we now manage many
thousands of database instances and below offer some lessons we
have learned from operating databases at scale.
Design escalators, not elevators: Failures are common when
operating large fleets with many service dependencies. A key
lesson for us has been to design systems that degrade on failures
rather than losing outright availability. These are a common
design pattern when working with hardware failures, for example,
replicating data blocks to mask issues with disks. They are less
common when working with software or service dependencies,
though still necessary when operating in a dynamic environment.
Amazon overall (including AWS) had 50 million code
deployments over the past 12 months. Inevitably, at this scale, a
small number of regressions will occur and cause issues until
reverted. It is helpful to make one’s own service resilient to an
underlying service outage. For example, we support the ability to
preconfigure nodes in each data center, allowing us to continue to
provision and replace nodes for a period of time if there is an
Amazon EC2 provisioning interruption. One can locally increase
replication to withstand an Amazon S3 or network interruption.
We are adding similar mitigation strategies for other external
understanding that, even if not a widespread concern, each issue is
meaningful to the customer experiencing it. In Figure 5, Sev 2
refers to a severity 2 alarm that causes an engineer to get paged.
This means operational load roughly correlates to business
success. Within Amazon Redshift, we collect error logs across our
fleet and monitor tickets to understand top ten causes of error,
with the aim of extinguishing one of the top ten causes of error
each week.
Figure 5: Tickets per cluster over time
Pareto analysis is equally useful in understanding customer
functional requirements. However, it is more difficult to collect.
Escalators, not elevators
Failures happen.
Durability failures are “easy”
mirroring, quorums, well understood techniques
Availability failures are “hard” –
want to degrade on unavailability not cascade failures
tolerate 1-2 hours of unavailability (time to detect, fix)
- eg caching IP addresses when DNS is unavailable
- eg maintaining instance warm pools rather than provisioning
- eg losing the ability to restore a backup, not lose writes
Ship often
Continuous delivery should be to the
customer
Benefits
Customers prefer small patches
Rollback is easier
Rollback is less likely
Faster response to customer issues
We push a new database engine
version, including both features and
bug fixes, every two weeks.
dependencies that can fail independently from the database itself.
Continuous delivery should be to the customer: Many
engineering organizations now use continuous build and
automated test pipelines to a releasable staging environment.
However, few actually push the release itself at a frequent pace.
While customers would prefer small patches to large ones for the
same reasons engineering organizations prefer to build and test
continuously, patching is an onerous process. This often leads to
special-case, one-off patches per customer that are limited in
scope – while necessary, they make patching yet more fragile.
Figure 4: Cumulative features deployed over time
Amazon Redshift is set up to automatically patch customer
clusters on a weekly basis in a 30-minute window specified by the
Cumulative features deployed over time

Contenu connexe

Tendances

Five (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous DeliveryFive (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous DeliveryEberhard Wolff
 
AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS Amazon Web Services
 
4 extreme performance - part ii
4   extreme performance - part ii4   extreme performance - part ii
4 extreme performance - part iisqlserver.co.il
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerMaik Wiesmüller
 
Veeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome waysVeeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome waysTanawit Chansuchai
 
MySQL HA Presentation
MySQL HA PresentationMySQL HA Presentation
MySQL HA Presentationpapablues
 
Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoffMaik Wiesmüller
 
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...VMware
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
Rapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud SandboxesRapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud SandboxesElastra
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScaleRightScale
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScaleRightScale
 
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce LabsHow To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce LabsSauce Labs
 
Continuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedContinuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedAtlassian
 
Aug NYC July 12 event
Aug NYC July 12 eventAug NYC July 12 event
Aug NYC July 12 eventAUGNYC
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyMaik Wiesmüller
 
Divide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-ServicesDivide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-ServicesCarlos Sanchez
 
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsMaik Wiesmüller
 

Tendances (20)

Five (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous DeliveryFive (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous Delivery
 
AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS
 
4 extreme performance - part ii
4   extreme performance - part ii4   extreme performance - part ii
4 extreme performance - part ii
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
 
Veeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome waysVeeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome ways
 
MySQL HA Presentation
MySQL HA PresentationMySQL HA Presentation
MySQL HA Presentation
 
Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
 
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Rapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud SandboxesRapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud Sandboxes
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
 
Harper Reed: Cloud Contraints
Harper Reed: Cloud ContraintsHarper Reed: Cloud Contraints
Harper Reed: Cloud Contraints
 
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce LabsHow To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
 
Continuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedContinuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumed
 
Aug NYC July 12 event
Aug NYC July 12 eventAug NYC July 12 event
Aug NYC July 12 event
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
 
Divide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-ServicesDivide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-Services
 
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal Meetup
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
 

Similaire à Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup

Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Database performance management
Database performance managementDatabase performance management
Database performance managementscottaver
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the CloudRightScale
 
Synopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhanSynopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhanJatin Chauhan
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
 
Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?Leonid Grinshpan, Ph.D.
 
ConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web appsConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web appsPierre-Luc Maheu
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)Dilip Sharma
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Amazon Web Services
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web appsDirecti Group
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailabilitywebuploader
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real worlddominion
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsRavi Yogesh
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsDynatrace
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep DiveRightScale
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningAmazon Web Services
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web ApplicationsDavid Mitzenmacher
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Prolifics
 

Similaire à Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup (20)

Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the Cloud
 
Synopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhanSynopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhan
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?
 
ConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web appsConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web apps
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real world
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep Dive
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup

  • 1. Operations @ Scale Anurag Gupta, VP AWS Database Services
  • 2. Dev/Ops How I learned to stop worrying and love my pager Dev/Ops - your dev org is your ops org I get a pager! You get a pager! Everyone gets a pager! Why would I possibly want this? It motivates design for operability It aligns your interests w/ your customer experience It improves the feedback loop to customer needs
  • 3. Monitor everything Every API call to your service, Every API call you make to a dependent service Canary traffic for things that vary (eg SQL statements) Most of the metrics won’t be meaningful. That’s OK Page on your high signal-to-noise metrics Monitor these metrics during deployments Median/Average, Fleet-wide, coarse time grain are obscuring Measure TP90, TP99 (99th percentile response time) Measure at finer and finer grain Evaluate per-customer metrics Look for the needles in the haystack
  • 4. Correction-of-Error (COE) Reporting Meet weekly on operations (execs, service operators) Review each issue that happened. “Spin the wheel” to review a service’s metrics Support a “truth-seeking” culture Looking for data, process improvements COE - Customer impact - Timeline: incidence to detection to response to resolution - 5 Whys? Get to actionable changes to extinguish cause - Actions
  • 5. Ops is Dev Humans are fallible circa 1% defect injection rate Error rate changes based on time of day (3am vs 3pm) New ones show up, have unique issues Limit human access to machines Use code/scripts/tools instead Scripts are code unit test, code review, deploy, automate
  • 6. Ops load correlates to business growth As your business does well, your operations needs to become great Growing 100-200% YoY is hard. Improving ops 100-200% YoY is really hard. Improving ops 2% each week is possible. Use Pareto analysis to prioritize work Bonus – each customer gets a better experience even as your own ops load stays constant Amazon Redshift has grown rapidly since it became generally available in February 2013. While our guiding principles have served us well over the past two years, we now manage many thousands of database instances and below offer some lessons we have learned from operating databases at scale. Design escalators, not elevators: Failures are common when operating large fleets with many service dependencies. A key lesson for us has been to design systems that degrade on failures rather than losing outright availability. These are a common design pattern when working with hardware failures, for example, replicating data blocks to mask issues with disks. They are less common when working with software or service dependencies, though still necessary when operating in a dynamic environment. Amazon overall (including AWS) had 50 million code deployments over the past 12 months. Inevitably, at this scale, a small number of regressions will occur and cause issues until reverted. It is helpful to make one’s own service resilient to an underlying service outage. For example, we support the ability to preconfigure nodes in each data center, allowing us to continue to provision and replace nodes for a period of time if there is an Amazon EC2 provisioning interruption. One can locally increase replication to withstand an Amazon S3 or network interruption. We are adding similar mitigation strategies for other external understanding that, even if not a widespread concern, each issue is meaningful to the customer experiencing it. In Figure 5, Sev 2 refers to a severity 2 alarm that causes an engineer to get paged. This means operational load roughly correlates to business success. Within Amazon Redshift, we collect error logs across our fleet and monitor tickets to understand top ten causes of error, with the aim of extinguishing one of the top ten causes of error each week. Figure 5: Tickets per cluster over time Pareto analysis is equally useful in understanding customer functional requirements. However, it is more difficult to collect.
  • 7. Escalators, not elevators Failures happen. Durability failures are “easy” mirroring, quorums, well understood techniques Availability failures are “hard” – want to degrade on unavailability not cascade failures tolerate 1-2 hours of unavailability (time to detect, fix) - eg caching IP addresses when DNS is unavailable - eg maintaining instance warm pools rather than provisioning - eg losing the ability to restore a backup, not lose writes
  • 8. Ship often Continuous delivery should be to the customer Benefits Customers prefer small patches Rollback is easier Rollback is less likely Faster response to customer issues We push a new database engine version, including both features and bug fixes, every two weeks. dependencies that can fail independently from the database itself. Continuous delivery should be to the customer: Many engineering organizations now use continuous build and automated test pipelines to a releasable staging environment. However, few actually push the release itself at a frequent pace. While customers would prefer small patches to large ones for the same reasons engineering organizations prefer to build and test continuously, patching is an onerous process. This often leads to special-case, one-off patches per customer that are limited in scope – while necessary, they make patching yet more fragile. Figure 4: Cumulative features deployed over time Amazon Redshift is set up to automatically patch customer clusters on a weekly basis in a 30-minute window specified by the Cumulative features deployed over time