SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building SRE from Scratch at
Coinbase during Hypergrowth
Niall O’Higgins
Engineering Manager, SRE
Coinbase, Inc.
D E V 3 1 5 - S
Daniel Maher
Developer Advocate
Datadog
“Our goal is to make Coinbase the
most trusted and easiest to use digital
currency exchange.”
Brian Armstrong
Co-Founder & CEO of Coinbase
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is SRE?
• New field
• Definitions vary
• Many misconceptions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Is this SRE?
• Endless firefighting?
• Being on-call?
• Operational toil?
Those are the symptoms; SRE is the cure.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What about DevOps?
SRE satisfies many, if not all, of the operational and
cultural elements of DevOps.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #1
Measure and improve human,
organisational, and machine
systems.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #2
Move from reactive to
proactive. Go find sources of
toil and eliminate them!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key SRE insight #3
Provide an organisational back-
pressure mechanism.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set early expectations
• New language and fresh set of concepts.
• Takes time to absorb – no instant results!
• The best way to begin, is to begin.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Coinbase strategy
• Service-level Indicators
• The “Four Golden Signals”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics: The core of SLIs
• Natural tendency to over-engineer.
• Lots of data, none of it actionable or useful.
Optimise for KPIs, or high signal/noise.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Four Golden
Signals
• Latency
• Traffic
• Errors
• Saturation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency
• Direct impact on customer experience.
• Where and how you measure it is important.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Traffic
• The amount of work being done – or attempted.
• Direct relationship with business value.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Errors
• A nice, defined target to aim at.
• Direct impact on customer experience.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Saturation
• Real talk: This is a tricky one.
• Direct relationship to both scaling and capacity planning.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Humans and the Four
Golden Signals
• Latency
• Traffic
• Errors
• Saturation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Start, then iterate
• Start with an initial specification – even if it’s not ideal.
• Iterating on feedback is the key to getting it right.
• Keep it simple!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spreadsheets are simple, right?
Service Latency Errors Saturation Traffic
Foo foo.latency foo.error_rate Disc space TPS
Bar bar.response_time bar.error_rate Memory TPS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SLIs: Defining “done”
• Per-service dashboard in Datadog with timeseries chart for each
indicator.
• Document describing the indicators and why they are important.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Datadog dashboards
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Datadog dashboards
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Specification
documentation
• Spec vs. implementation
• Where do you want to
instrument?
• Where is it easy to
instrument?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
First SLIs, then promises
• Plain-language statements; easy to parse, easy to understand.
• Plenty of potential stakeholders.
• Start simple!
“You can rely on us to buy or sell
crypto whenever you want.”
Coinbase’s “Prime Promise”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Essential reading
Thinking in Promises
by Mark Burgess
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Concerning promises
• Promises have two parties.
• Promises can be human to machine, human to human, or machine to
machine.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example promises
• You service promises to respond to clients within 50ms.
• A service you depend on promises that its error rate will be < 1%.
• On-call promises they will engage an incident within 15 minutes.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Promise enumeration
• Each team must formalise the promises they are willing to keep.
• They must also understand the promises they rely upon to function.
• When a promise you rely upon is broken, what should you do? Who
should you contact?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Promises:
Defining “done”
Promises are done when they
have a Datadog monitor (alert).
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
When promises are broken …
• It is inevitable that a promise will be broken at some point.
• What to do when that happens?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Blameless post-mortems
• Is “blameless post-mortem” a real thing?
• What about data-driven post-mortems?
• How does this relate to promises – specifically broken ones?
• https://v.gd/jpr_post_mortems
• https://v.gd/jyee_datadriven
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Interpreting incidents
• Build a shared language.
• Practise communicating.
• Understand that incidents
and outages are broken
promises.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Measuring incident response
• Quantify and measure the quality of your incident responses.
• Quantitative: Time to detect, time to engage, time to fix.
• Qualitative: Quality of communication.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The end game
• Have a clear answer for
“Why SRE?”
• Start with instrumentation –
Keep it simple to start.
• Enumerate your promises.
• Measure your response when
promises are broken.
• Transparency
• Understanding
• Confidence
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Niall O’Higgins Daniel Maher
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Contenu connexe

Tendances

Exploiting IoT & Machine Learning to transform Power and Utilities
Exploiting IoT & Machine Learning to transform Power and UtilitiesExploiting IoT & Machine Learning to transform Power and Utilities
Exploiting IoT & Machine Learning to transform Power and UtilitiesAmazon Web Services
 
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018Amazon Web Services
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Amazon Web Services
 
ENT205 Preparing Your Team for a Cloud Transformation
ENT205 Preparing Your Team for a Cloud TransformationENT205 Preparing Your Team for a Cloud Transformation
ENT205 Preparing Your Team for a Cloud TransformationAmazon Web Services
 
Five New Security Automations Using AWS Security Services & Open Source (SEC4...
Five New Security Automations Using AWS Security Services & Open Source (SEC4...Five New Security Automations Using AWS Security Services & Open Source (SEC4...
Five New Security Automations Using AWS Security Services & Open Source (SEC4...Amazon Web Services
 
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Amazon Web Services
 
Building transformational business value through broad organizational engagem...
Building transformational business value through broad organizational engagem...Building transformational business value through broad organizational engagem...
Building transformational business value through broad organizational engagem...Amazon Web Services
 
Security & Compliance in the Cloud
Security & Compliance in the CloudSecurity & Compliance in the Cloud
Security & Compliance in the CloudAmazon Web Services
 
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...Amazon Web Services
 
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018Amazon Web Services
 
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018Amazon Web Services
 
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech Talks
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech TalksManage IoT Devices throughout Their Lifecycle - AWS Online Tech Talks
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech TalksAmazon Web Services
 
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfCloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfAmazon Web Services
 
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018Amazon Web Services
 
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018Amazon Web Services
 
人工智能 (AI) 與機器學習概覽 (Level 200)
人工智能 (AI) 與機器學習概覽 (Level 200)人工智能 (AI) 與機器學習概覽 (Level 200)
人工智能 (AI) 與機器學習概覽 (Level 200)Amazon Web Services
 
Leadership Session: The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018
Leadership Session:  The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018Leadership Session:  The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018
Leadership Session: The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018Amazon Web Services
 
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...Amazon Web Services
 
Improve Productivity with Continuous Integration & Delivery
Improve Productivity with Continuous Integration & DeliveryImprove Productivity with Continuous Integration & Delivery
Improve Productivity with Continuous Integration & DeliveryAmazon Web Services
 

Tendances (20)

Exploiting IoT & Machine Learning to transform Power and Utilities
Exploiting IoT & Machine Learning to transform Power and UtilitiesExploiting IoT & Machine Learning to transform Power and Utilities
Exploiting IoT & Machine Learning to transform Power and Utilities
 
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
Digital Transformation Through APIs (SRV323) - AWS re:Invent 2018
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
 
DevOps: The Amazon Story
DevOps: The Amazon StoryDevOps: The Amazon Story
DevOps: The Amazon Story
 
ENT205 Preparing Your Team for a Cloud Transformation
ENT205 Preparing Your Team for a Cloud TransformationENT205 Preparing Your Team for a Cloud Transformation
ENT205 Preparing Your Team for a Cloud Transformation
 
Five New Security Automations Using AWS Security Services & Open Source (SEC4...
Five New Security Automations Using AWS Security Services & Open Source (SEC4...Five New Security Automations Using AWS Security Services & Open Source (SEC4...
Five New Security Automations Using AWS Security Services & Open Source (SEC4...
 
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
Trends in Digital Transformation (ARC212) - AWS re:Invent 2018
 
Building transformational business value through broad organizational engagem...
Building transformational business value through broad organizational engagem...Building transformational business value through broad organizational engagem...
Building transformational business value through broad organizational engagem...
 
Security & Compliance in the Cloud
Security & Compliance in the CloudSecurity & Compliance in the Cloud
Security & Compliance in the Cloud
 
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...
Serverless Architectural Patterns and Best Practices (ARC305-R2) - AWS re:Inv...
 
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
AWS and Symantec: Cyber Defense at Scale (SEC311-S) - AWS re:Invent 2018
 
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018
Continuous Integration Best Practices (DEV319-R1) - AWS re:Invent 2018
 
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech Talks
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech TalksManage IoT Devices throughout Their Lifecycle - AWS Online Tech Talks
Manage IoT Devices throughout Their Lifecycle - AWS Online Tech Talks
 
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdfCloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
Cloud Choices- Quantifying the Cost and Risk Implications of Cloud.pdf
 
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018
Building Volkswagen Group's Digital Ecosystem (AMT304) - AWS re:Invent 2018
 
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018
Foundations of AWS Global Cloud Infrastructure (ARC217) - AWS re:Invent 2018
 
人工智能 (AI) 與機器學習概覽 (Level 200)
人工智能 (AI) 與機器學習概覽 (Level 200)人工智能 (AI) 與機器學習概覽 (Level 200)
人工智能 (AI) 與機器學習概覽 (Level 200)
 
Leadership Session: The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018
Leadership Session:  The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018Leadership Session:  The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018
Leadership Session: The Future of Enterprise IT (ENT220-L) - AWS re:Invent 2018
 
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...
Find All the Threats: AWS Threat Detection and Remediation (SEC331) - AWS re:...
 
Improve Productivity with Continuous Integration & Delivery
Improve Productivity with Continuous Integration & DeliveryImprove Productivity with Continuous Integration & Delivery
Improve Productivity with Continuous Integration & Delivery
 

Similaire à Building SRE from Scratch at Coinbase during Hypergrowth (DEV315-S) - AWS re:Invent 2018

Operating at Scale: Preparing for the Journey
Operating at Scale: Preparing for the JourneyOperating at Scale: Preparing for the Journey
Operating at Scale: Preparing for the JourneyAmazon Web Services
 
Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Amazon Web Services
 
The seven habits of highly successful builders - AWS Summit Cape Town 2018
The seven habits of highly successful builders - AWS Summit Cape Town 2018The seven habits of highly successful builders - AWS Summit Cape Town 2018
The seven habits of highly successful builders - AWS Summit Cape Town 2018Amazon Web Services
 
Culture of Innovation - AWS Transformation Day Boston 2018
Culture of Innovation - AWS Transformation Day Boston 2018Culture of Innovation - AWS Transformation Day Boston 2018
Culture of Innovation - AWS Transformation Day Boston 2018Amazon Web Services
 
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...Amazon Web Services
 
AI and IoT innovation - an industry focus
AI and IoT innovation - an industry focusAI and IoT innovation - an industry focus
AI and IoT innovation - an industry focusAmazon Web Services
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...Amazon Web Services
 
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...Amazon Web Services
 
Dev348 ReInvent Corteva Agriscience
Dev348   ReInvent Corteva AgriscienceDev348   ReInvent Corteva Agriscience
Dev348 ReInvent Corteva AgriscienceRandy Black
 
Innovation for Everyone - Transformation Day Montreal 2018
Innovation for Everyone - Transformation Day Montreal 2018Innovation for Everyone - Transformation Day Montreal 2018
Innovation for Everyone - Transformation Day Montreal 2018Amazon Web Services
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDDAmazon Web Services
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by dddKim Kao
 
An Agile Approach to Cloud Adoption
An Agile Approach to Cloud AdoptionAn Agile Approach to Cloud Adoption
An Agile Approach to Cloud AdoptionAmazon Web Services
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksAmazon Web Services
 
AWS Startup Day Kyiv: AWS Security Best Practices
AWS Startup Day Kyiv: AWS Security Best PracticesAWS Startup Day Kyiv: AWS Security Best Practices
AWS Startup Day Kyiv: AWS Security Best PracticesAmazon Web Services
 
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...Amazon Web Services
 

Similaire à Building SRE from Scratch at Coinbase during Hypergrowth (DEV315-S) - AWS re:Invent 2018 (20)

Operating at Scale: Preparing for the Journey
Operating at Scale: Preparing for the JourneyOperating at Scale: Preparing for the Journey
Operating at Scale: Preparing for the Journey
 
Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]Operating at Scale- Preparing for the Journey [Portuguese]
Operating at Scale- Preparing for the Journey [Portuguese]
 
The seven habits of highly successful builders - AWS Summit Cape Town 2018
The seven habits of highly successful builders - AWS Summit Cape Town 2018The seven habits of highly successful builders - AWS Summit Cape Town 2018
The seven habits of highly successful builders - AWS Summit Cape Town 2018
 
Culture of Innovation - AWS Transformation Day Boston 2018
Culture of Innovation - AWS Transformation Day Boston 2018Culture of Innovation - AWS Transformation Day Boston 2018
Culture of Innovation - AWS Transformation Day Boston 2018
 
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...
Continuously Delivering Your Software on AWS - Adrian White - AWS TechShift A...
 
AI and IoT innovation - an industry focus
AI and IoT innovation - an industry focusAI and IoT innovation - an industry focus
AI and IoT innovation - an industry focus
 
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
From Idea to Customers: Developing Modern Cloud-Enabled Apps with AWS (MOB201...
 
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...
Security Observability: Democratizing Security in the Cloud (DEV206-S) - AWS ...
 
Dev348 ReInvent Corteva Agriscience
Dev348   ReInvent Corteva AgriscienceDev348   ReInvent Corteva Agriscience
Dev348 ReInvent Corteva Agriscience
 
Innovation for Everyone - Transformation Day Montreal 2018
Innovation for Everyone - Transformation Day Montreal 2018Innovation for Everyone - Transformation Day Montreal 2018
Innovation for Everyone - Transformation Day Montreal 2018
 
Containers for Startups
Containers for StartupsContainers for Startups
Containers for Startups
 
Culture of Innovation at Amazon
Culture of Innovation at AmazonCulture of Innovation at Amazon
Culture of Innovation at Amazon
 
Implementing Microservices by DDD
Implementing Microservices by DDDImplementing Microservices by DDD
Implementing Microservices by DDD
 
2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd2019 03-13-implementing microservices by ddd
2019 03-13-implementing microservices by ddd
 
An Agile Approach to Cloud Adoption
An Agile Approach to Cloud AdoptionAn Agile Approach to Cloud Adoption
An Agile Approach to Cloud Adoption
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
 
AWS Security Best Practices
AWS Security Best PracticesAWS Security Best Practices
AWS Security Best Practices
 
AWS Startup Day Kyiv: AWS Security Best Practices
AWS Startup Day Kyiv: AWS Security Best PracticesAWS Startup Day Kyiv: AWS Security Best Practices
AWS Startup Day Kyiv: AWS Security Best Practices
 
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...
Leading Your Team Through a Cloud Transformation - Virtual Transformation Day...
 
TECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim HarneyTECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim Harney
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building SRE from Scratch at Coinbase during Hypergrowth (DEV315-S) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building SRE from Scratch at Coinbase during Hypergrowth Niall O’Higgins Engineering Manager, SRE Coinbase, Inc. D E V 3 1 5 - S Daniel Maher Developer Advocate Datadog
  • 3. “Our goal is to make Coinbase the most trusted and easiest to use digital currency exchange.” Brian Armstrong Co-Founder & CEO of Coinbase
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is SRE? • New field • Definitions vary • Many misconceptions
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Is this SRE? • Endless firefighting? • Being on-call? • Operational toil? Those are the symptoms; SRE is the cure.
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What about DevOps? SRE satisfies many, if not all, of the operational and cultural elements of DevOps.
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key SRE insight #1 Measure and improve human, organisational, and machine systems.
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key SRE insight #2 Move from reactive to proactive. Go find sources of toil and eliminate them!
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key SRE insight #3 Provide an organisational back- pressure mechanism.
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set early expectations • New language and fresh set of concepts. • Takes time to absorb – no instant results! • The best way to begin, is to begin.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Coinbase strategy • Service-level Indicators • The “Four Golden Signals”
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metrics: The core of SLIs • Natural tendency to over-engineer. • Lots of data, none of it actionable or useful. Optimise for KPIs, or high signal/noise.
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Four Golden Signals • Latency • Traffic • Errors • Saturation
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Latency • Direct impact on customer experience. • Where and how you measure it is important.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Traffic • The amount of work being done – or attempted. • Direct relationship with business value.
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Errors • A nice, defined target to aim at. • Direct impact on customer experience.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Saturation • Real talk: This is a tricky one. • Direct relationship to both scaling and capacity planning.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Humans and the Four Golden Signals • Latency • Traffic • Errors • Saturation
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start, then iterate • Start with an initial specification – even if it’s not ideal. • Iterating on feedback is the key to getting it right. • Keep it simple!
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spreadsheets are simple, right? Service Latency Errors Saturation Traffic Foo foo.latency foo.error_rate Disc space TPS Bar bar.response_time bar.error_rate Memory TPS
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SLIs: Defining “done” • Per-service dashboard in Datadog with timeseries chart for each indicator. • Document describing the indicators and why they are important.
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Datadog dashboards
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Datadog dashboards
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Specification documentation • Spec vs. implementation • Where do you want to instrument? • Where is it easy to instrument?
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. First SLIs, then promises • Plain-language statements; easy to parse, easy to understand. • Plenty of potential stakeholders. • Start simple!
  • 26. “You can rely on us to buy or sell crypto whenever you want.” Coinbase’s “Prime Promise”
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Essential reading Thinking in Promises by Mark Burgess
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Concerning promises • Promises have two parties. • Promises can be human to machine, human to human, or machine to machine.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example promises • You service promises to respond to clients within 50ms. • A service you depend on promises that its error rate will be < 1%. • On-call promises they will engage an incident within 15 minutes.
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Promise enumeration • Each team must formalise the promises they are willing to keep. • They must also understand the promises they rely upon to function. • When a promise you rely upon is broken, what should you do? Who should you contact?
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Promises: Defining “done” Promises are done when they have a Datadog monitor (alert).
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. When promises are broken … • It is inevitable that a promise will be broken at some point. • What to do when that happens?
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blameless post-mortems • Is “blameless post-mortem” a real thing? • What about data-driven post-mortems? • How does this relate to promises – specifically broken ones? • https://v.gd/jpr_post_mortems • https://v.gd/jyee_datadriven
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Interpreting incidents • Build a shared language. • Practise communicating. • Understand that incidents and outages are broken promises.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Measuring incident response • Quantify and measure the quality of your incident responses. • Quantitative: Time to detect, time to engage, time to fix. • Qualitative: Quality of communication.
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The end game • Have a clear answer for “Why SRE?” • Start with instrumentation – Keep it simple to start. • Enumerate your promises. • Measure your response when promises are broken. • Transparency • Understanding • Confidence
  • 37. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Niall O’Higgins Daniel Maher
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.