SlideShare une entreprise Scribd logo
1  sur  103
Télécharger pour lire hors ligne
Operational InsightJune 15, 2015
Roy Rapoport
@royrapoport / linkedin.com/in/royrapoport / rrapoport@netflix.com
Oh, The Places
We’ll Go!
Today, I want to propose a general framework for how to think about operational insight products and features. I’m hoping that this framework is applicable to anyone who performs operations in production. After I propose thinking about operational insight this
way, I’ll demonstrate some applications of it within our own operational environments at Netflix.
The template we were supposed to use had me start with a slide with the speaker bio, but I want to start with something more relevant and interesting to you: The Korean War, and specifically dogfights during the war.
John Boyd
John Boyd was an air force pilot at the time; he studied dogfights and came to the conclusion every fighter pilot went through the same four steps:
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Orient
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Orient
Decide
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Orient
Decide
Act
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Orient
Decide
Act
OODA
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
Observe
Orient
Decide
Act
OODA
“This approach favors agility over raw power in dealing with human
opponents in any endeavor” - Wikipedia
Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with
conflict against other humans, but I’d like to suggest it has much broader applicability.
This Is What We
Do
Because even when not dealing with human opponents, anyone dealing with any aspect of operations — dealing with availability events, making decisions about promoting software in production, or … well, making decisions in general — does this all. the. time.
For example, this pair of graphs represent the two KPIs by which we know if we have a high-level serious problem. The top one is the rate of calls into our customer
service group; the second one is the rate at which people are actually streaming. Both are over the last seven days. When these dip …
Like here, for example.
We know we have a problem. We don’t exactly know what’s causing it, or what we’ll do to fix it. We’ll need to understand more about the problem to come to a decision, and then execute on that decision — OODA.
OODA KPI
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
OODA KPI
Speed
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
OODA KPI
Speed Effort
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
OODA KPI
Speed Effort Reliability
So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right
decision, execute it well).
Winning
Speed Effort Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
Winning
Speed
Effort Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
Winning
Speed
Effort
Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
Winning
Speed
Effort
Reliability
So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work
or not, and whether or not they’re likely going to continue to be happy working for you.
Implications …
for Observation (aka measurement, telemetry, metrics)
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
Implications …
for Observation (aka measurement, telemetry, metrics)
• Make It Easy
• Make It Scalable
• Make it pluggable
• (Eventually) Ruthlessly Cull
“What decision will this help me make?”
A Joke
I’d like to tell a very very long joke. It started at Velocity 2011, when I heard someone at a presentation “monitor all the things, because you never know what you might
find useful one of these days.”
This is a graph representing about 380K datapoints, collected once every five minutes since June 2011. It’s a bit mysterious, I know.
52
48
It may help you to see the lower and upper bounds of this graph are 48 to 52.
% of servers in major region
with an even IP address
This graph represents the percent of our cloud instances in a given production region which had a public IP address.
We can — and should (and I hope we do) — laugh about this graph, but I’d bet you your monitoring system is chock full of similarly useless data — I know mine is. It
impacts the cost of the system, but also literally makes your job — and your customers’ jobs, if you’re responsible for the telemetry system — harder, because there’s
much much more chaff to wade through.
Implications …
for Orientation (aka graphing, visualization)
Implications …
for Orientation (aka graphing, visualization)
• First-class product
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
Implications …
for Orientation (aka graphing, visualization)
• First-class product
• Different decisions require different viz
• Low cognitive load better than
• High refresh rates
• Deep data density
Better Like This …
Or Better Like That …
Implications …
for Decisions (aka alerting, real-time analytics, etc)
Alerts are a basic, primitive decision. Build on that.
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
Alerts are a basic, primitive decision. Build on that.
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
Alerts are a basic, primitive decision. Build on that.
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
Alerts are a basic, primitive decision. Build on that.
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
Alerts are a basic, primitive decision. Build on that.
Implications …
for Decisions (aka alerting, real-time analytics, etc)
• You already have (some of) this
• Incremental improvement
• Sky’s the limit
• For benefits
• For cost
Alerts are a basic, primitive decision. Build on that.
Implications …
for Action
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
Implications …
for Action
1. Humans beat bureaucracy
2. Machines beat humans
3. Repeatability beats one-offs
4. Start with humans
5. If IFTTT, deprecate humans
Repeatable machine processes TROUNCE one-off human
bureaucracy
If you’re thinking of creating a runbook, AUTOMATE IT.
Decision:
Do I Have Enough
Instances?
So let’s talk about a basic capacity quandry: Do I have enough instances in my cluster?
I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling
group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing
them, of course, to override it whenever they want to)
I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling
group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing
them, of course, to override it whenever they want to)
Decision:
Is My Canary Good?
We use a deployment pattern called canary, where we compare the new version of the software to the baseline, also in production, and seek to answer a very simple question: Is our canary at least as good as our baseline system?
25
Been there.
Done that.
Manually.Artisanally.
25
Been there.
• Started in the Data Center
Done that.
Manually.Artisanally.
25
Been there.
• Started in the Data Center
• Manual, dashboard-driven
Done that.
Manually.Artisanally.
25
Been there.
Done that.
Manually.
26
CPURequestsErrors
Been there.
Done that.
Manually.
27
Been there.
Done that.
Manually.
• Context vs Precision
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
• Trending
27
Been there.
Done that.
Manually.
• Context vs Precision
• No …
• Repeatability
• Trending
• Manual effort is manual
27
So Now What?
28
So Now What?
• Automate Analysis
28
So Now What?
• Automate Analysis
• Took Some Effort
28
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
28
So Now What?
• Automate Analysis
• Took Some Effort
• Approach and analytics
• Presentation matters
28
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
1 server
@ 1.0.2
Automated
Canary
Analysis
Pretty Pictures
29
10 servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
1000
servers
@ 1.0.2
Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
Analysis
Pretty Pictures
29
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers
Versi
on
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
30
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Customers
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
31
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Versi
on
1000
servers
@ 1.0.1
Custome
Build &
Deployment
Automat
ed
1000
servers
@ 1.0.2
Pretty Pictures
31
Version
Control
System
Build &
Deployment
System
Automated
Canary
Analysis
Just The Stats
4-Week View
Just The Stats
4-Week View
6309 canary analysis cycles
Just The Stats
4-Week View
6309 canary analysis cycles
16% canaries failed
Decision:
Do I Have an Outlier?
Outlier Detection
In an environment where you have a bunch of potentially-undifferentiated resources that should all behave approximately similarly, it becomes easy — and necessary, in a sufficiently large ecosystem — to notice outliers. If your cost for culling the outliers is low, you
can also do it automatically. If not, you can at least alert that One Of These Things Is No Longer Like The Others.
Would You Like to Play a
Game?
Can I have a volunteer from the audience to run an experiment with me?
Spot the Outlier
So for training, imagine I’m giving you this information about nine servers, named A through I. Each row is a minute’s data for these servers — let’s say it’s load average, or error rates. I’m going to ask you to point out the server — or column — that looks materially
different from the others. This should be a relatively easy case, of course. Can you pick the server?
OK. Now, I’m going to time you doing the same with more interesting data.
Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
OK. Now, I’m going to time you doing the same with more interesting data.
Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
It probably is easier, isn’t it? Can you easily point out the outlier?
OK, one last test. At the next slide, I’m going to show you some information (you can assume it’s true) and I want you to tell me which is the outlier, OK?
The
Outlier Is
“A”That was … much easier, wasn’t it?
This is what happens when we let computers do this work. We could have spent more time and effort to give you a more powerful visualization that would have made it easier to notice the outlier, but we instead built the analytics system that lets us automatically
determine outliers so it won’t make it easier for you to do the work — it will do it for you.
Just The Stats
4-Week View
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
Just The Stats
4-Week View
739 Server Terminations
We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
In a Nutshell
Observe
Orient
Decide
Act
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines

do it
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
In a Nutshell
Observe
Orient
Decide
Act
Need This First
http://bit.ly/nflx-atlas-2013
http://metrics20.org
Understand the decision
http://bit.ly/nflx-qcon-aca-2014
Make it easier for humans
Make machines

do it
Higher speed
Lower effort
Higher reliability
So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work
hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
Questions, Attributions, Feedback
42
Questions, Attributions, Feedback
@royrapoport
rsr@netflix.com
linkedin.com/in/royrapoport
?42

Contenu connexe

Tendances

Testing within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris GollopTesting within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris GollopJAXLondon2014
 
The Art of Better
The Art of BetterThe Art of Better
The Art of BetterArty Starr
 
Esteem and Estimates (Ti Stimo Fratello)
Esteem and Estimates (Ti Stimo Fratello)Esteem and Estimates (Ti Stimo Fratello)
Esteem and Estimates (Ti Stimo Fratello)Gaetano Mazzanti
 
Data-Driven Software Mastery @Open Mastery Austin
Data-Driven Software Mastery @Open Mastery AustinData-Driven Software Mastery @Open Mastery Austin
Data-Driven Software Mastery @Open Mastery AustinArty Starr
 
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?Paul Gerrard
 
Sww 2006 Redesigning Processes For Solid Works
Sww 2006   Redesigning Processes For Solid WorksSww 2006   Redesigning Processes For Solid Works
Sww 2006 Redesigning Processes For Solid WorksRazorleaf Corporation
 
Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Arty Starr
 
Agile tales of creative customer collaboration
Agile tales of creative customer collaborationAgile tales of creative customer collaboration
Agile tales of creative customer collaborationClaudio Perrone
 
Devops at scale is a hard problem challenges, insights and lessons learned
Devops at scale is a hard problem  challenges, insights and lessons learnedDevops at scale is a hard problem  challenges, insights and lessons learned
Devops at scale is a hard problem challenges, insights and lessons learnedkjalleda
 
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014Yuval Yeret
 
The Lego Lean Game (XP 2009 version)
The Lego Lean Game (XP 2009 version)The Lego Lean Game (XP 2009 version)
The Lego Lean Game (XP 2009 version)frankmt
 
141015 Discovering Scrum at Scrum Roma
141015 Discovering Scrum at Scrum Roma141015 Discovering Scrum at Scrum Roma
141015 Discovering Scrum at Scrum RomaPeter Stevens
 
Lean & agile 101 for Astute Entrepreneurs
Lean & agile 101 for Astute EntrepreneursLean & agile 101 for Astute Entrepreneurs
Lean & agile 101 for Astute EntrepreneursClaudio Perrone
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...dev2ops
 
Innovation, Lean, Agile. Myths and Misconception
Innovation, Lean, Agile. Myths and MisconceptionInnovation, Lean, Agile. Myths and Misconception
Innovation, Lean, Agile. Myths and MisconceptionGaetano Mazzanti
 
Lean Startup for Smart Entrepreneurs
Lean Startup for Smart EntrepreneursLean Startup for Smart Entrepreneurs
Lean Startup for Smart EntrepreneursClaudio Perrone
 

Tendances (19)

Testing within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris GollopTesting within an Agile Environment - Beyza Sakir and Chris Gollop
Testing within an Agile Environment - Beyza Sakir and Chris Gollop
 
The Art of Better
The Art of BetterThe Art of Better
The Art of Better
 
Value stream mapping
Value stream mapping  Value stream mapping
Value stream mapping
 
Esteem and Estimates (Ti Stimo Fratello)
Esteem and Estimates (Ti Stimo Fratello)Esteem and Estimates (Ti Stimo Fratello)
Esteem and Estimates (Ti Stimo Fratello)
 
Data-Driven Software Mastery @Open Mastery Austin
Data-Driven Software Mastery @Open Mastery AustinData-Driven Software Mastery @Open Mastery Austin
Data-Driven Software Mastery @Open Mastery Austin
 
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?
The Pursuit of Quality - Chasing Tornadoes or Just Hot Air?
 
Sww 2006 Redesigning Processes For Solid Works
Sww 2006   Redesigning Processes For Solid WorksSww 2006   Redesigning Processes For Solid Works
Sww 2006 Redesigning Processes For Solid Works
 
ABC's of Problem Solving
ABC's of Problem SolvingABC's of Problem Solving
ABC's of Problem Solving
 
Let's Make the PAIN Visible!
Let's Make the PAIN Visible!Let's Make the PAIN Visible!
Let's Make the PAIN Visible!
 
Agile tales of creative customer collaboration
Agile tales of creative customer collaborationAgile tales of creative customer collaboration
Agile tales of creative customer collaboration
 
Devops at scale is a hard problem challenges, insights and lessons learned
Devops at scale is a hard problem  challenges, insights and lessons learnedDevops at scale is a hard problem  challenges, insights and lessons learned
Devops at scale is a hard problem challenges, insights and lessons learned
 
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014
Agile Intro and 2014 trends for AgileSparks open day at John-Bryce - March 2014
 
The Lego Lean Game (XP 2009 version)
The Lego Lean Game (XP 2009 version)The Lego Lean Game (XP 2009 version)
The Lego Lean Game (XP 2009 version)
 
141015 Discovering Scrum at Scrum Roma
141015 Discovering Scrum at Scrum Roma141015 Discovering Scrum at Scrum Roma
141015 Discovering Scrum at Scrum Roma
 
Lean & agile 101 for Astute Entrepreneurs
Lean & agile 101 for Astute EntrepreneursLean & agile 101 for Astute Entrepreneurs
Lean & agile 101 for Astute Entrepreneurs
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
 
Innovation, Lean, Agile. Myths and Misconception
Innovation, Lean, Agile. Myths and MisconceptionInnovation, Lean, Agile. Myths and Misconception
Innovation, Lean, Agile. Myths and Misconception
 
CTQ Tree Webinar 11-17-2020
CTQ Tree Webinar 11-17-2020CTQ Tree Webinar 11-17-2020
CTQ Tree Webinar 11-17-2020
 
Lean Startup for Smart Entrepreneurs
Lean Startup for Smart EntrepreneursLean Startup for Smart Entrepreneurs
Lean Startup for Smart Entrepreneurs
 

En vedette

Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...
Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...
Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...Techsylvania
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Sid Anand
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionAdrian Cockcroft
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Thingsroyrapoport
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inceptionroyrapoport
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metricsroyrapoport
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014royrapoport
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...Adrian Cockcroft
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSAdrian Cockcroft
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Adrian Cockcroft
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapJosh Evans
 

En vedette (17)

Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...
Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...
Pascal von Rickenbach (GetYourGuide) – Product versus Engineering – Dawn of J...
 
Gluecon keynote
Gluecon keynoteGluecon keynote
Gluecon keynote
 
Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!Keeping Movies Running Amid Thunderstorms!
Keeping Movies Running Amid Thunderstorms!
 
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial IntroductionGluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
Gluecon 2013 - NetflixOSS Cloud Native Tutorial Introduction
 
SV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source PlatformSV Forum Platform Architecture SIG - Netflix Open Source Platform
SV Forum Platform Architecture SIG - Netflix Open Source Platform
 
Canary Analyze All the Things
Canary Analyze All the ThingsCanary Analyze All the Things
Canary Analyze All the Things
 
SSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's InceptionSSL Certificate Expiration and Howler Monkey's Inception
SSL Certificate Expiration and Howler Monkey's Inception
 
Cloud Tech III: Actionable Metrics
Cloud Tech III: Actionable MetricsCloud Tech III: Actionable Metrics
Cloud Tech III: Actionable Metrics
 
Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014Python Through the Back Door: Netflix Presentation at CodeMash 2014
Python Through the Back Door: Netflix Presentation at CodeMash 2014
 
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimi...
 
Cassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWSCassandra Performance and Scalability on AWS
Cassandra Performance and Scalability on AWS
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
 
Beyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the GapBeyond DevOps - How Netflix Bridges the Gap
Beyond DevOps - How Netflix Bridges the Gap
 
Culture
CultureCulture
Culture
 

Similaire à Operational Insight: Concepts and Examples

AgileLunch Meetup - Listen to your Board
AgileLunch Meetup - Listen to your BoardAgileLunch Meetup - Listen to your Board
AgileLunch Meetup - Listen to your BoardFernando Cuenca
 
Exploratory Testing Explained
Exploratory Testing ExplainedExploratory Testing Explained
Exploratory Testing ExplainedTechWell
 
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014Enterprise Devops Presentation @ Magentys Seminar London May 15 2014
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014Jwooldridge
 
Building on the Shoulders of Giants: the Story of Bitbucket Pipelines
Building on the Shoulders of Giants: the Story of Bitbucket PipelinesBuilding on the Shoulders of Giants: the Story of Bitbucket Pipelines
Building on the Shoulders of Giants: the Story of Bitbucket PipelinesAtlassian
 
Exploratory Testing Explained
Exploratory Testing ExplainedExploratory Testing Explained
Exploratory Testing ExplainedTechWell
 
DevOps and Security, a Match Made in Heaven
DevOps and Security, a Match Made in HeavenDevOps and Security, a Match Made in Heaven
DevOps and Security, a Match Made in HeavenDana Gardner
 
A Holistic View of Complex Systems and Organizational Change
A Holistic View of Complex Systems and Organizational ChangeA Holistic View of Complex Systems and Organizational Change
A Holistic View of Complex Systems and Organizational ChangeTechWell
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Rundeck
 
Visualizing Work: If you can't see it, you can't manage it
Visualizing Work: If you can't see it, you can't manage itVisualizing Work: If you can't see it, you can't manage it
Visualizing Work: If you can't see it, you can't manage itFernando Cuenca
 
Kanban discussion with David Anderson
Kanban discussion with David AndersonKanban discussion with David Anderson
Kanban discussion with David AndersonBusiness901
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Rundeck
 
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerCodemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerKevin Davis
 
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!South Tyrol Free Software Conference
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast DeliveryAdrian Cockcroft
 
What We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsWhat We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsSeniorStoryteller
 
What we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devopsWhat we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devopsNicole Forsgren
 
Get into bed with qa and keep testing agile
Get into bed with qa and keep testing agileGet into bed with qa and keep testing agile
Get into bed with qa and keep testing agileAgileCymru
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationKissmetrics on SlideShare
 

Similaire à Operational Insight: Concepts and Examples (20)

AgileLunch Meetup - Listen to your Board
AgileLunch Meetup - Listen to your BoardAgileLunch Meetup - Listen to your Board
AgileLunch Meetup - Listen to your Board
 
Exploratory Testing Explained
Exploratory Testing ExplainedExploratory Testing Explained
Exploratory Testing Explained
 
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014Enterprise Devops Presentation @ Magentys Seminar London May 15 2014
Enterprise Devops Presentation @ Magentys Seminar London May 15 2014
 
Building on the Shoulders of Giants: the Story of Bitbucket Pipelines
Building on the Shoulders of Giants: the Story of Bitbucket PipelinesBuilding on the Shoulders of Giants: the Story of Bitbucket Pipelines
Building on the Shoulders of Giants: the Story of Bitbucket Pipelines
 
Exploratory Testing Explained
Exploratory Testing ExplainedExploratory Testing Explained
Exploratory Testing Explained
 
DevOps and Security, a Match Made in Heaven
DevOps and Security, a Match Made in HeavenDevOps and Security, a Match Made in Heaven
DevOps and Security, a Match Made in Heaven
 
A Holistic View of Complex Systems and Organizational Change
A Holistic View of Complex Systems and Organizational ChangeA Holistic View of Complex Systems and Organizational Change
A Holistic View of Complex Systems and Organizational Change
 
Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)Self-Service Operations: Because Failure Still Happens (Developer Edition)
Self-Service Operations: Because Failure Still Happens (Developer Edition)
 
Visualizing Work: If you can't see it, you can't manage it
Visualizing Work: If you can't see it, you can't manage itVisualizing Work: If you can't see it, you can't manage it
Visualizing Work: If you can't see it, you can't manage it
 
Kanban discussion with David Anderson
Kanban discussion with David AndersonKanban discussion with David Anderson
Kanban discussion with David Anderson
 
Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens Operations as a Service: Because Failure Still Happens
Operations as a Service: Because Failure Still Happens
 
Agile is Dead :: Pixels Camp 2017
Agile is Dead :: Pixels Camp 2017Agile is Dead :: Pixels Camp 2017
Agile is Dead :: Pixels Camp 2017
 
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn CareerCodemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
Codemash 2.0.1.4: Tech Trends and Pwning Your Pwn Career
 
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!
SFScon21 - Paolo d’Incau - Going to production in a few months – How we did it!
 
Orchestration, the conductor's score
Orchestration, the conductor's scoreOrchestration, the conductor's score
Orchestration, the conductor's score
 
Monktoberfest Fast Delivery
Monktoberfest Fast DeliveryMonktoberfest Fast Delivery
Monktoberfest Fast Delivery
 
What We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOpsWhat We Learned from Three Years of Sciencing the Crap Out of DevOps
What We Learned from Three Years of Sciencing the Crap Out of DevOps
 
What we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devopsWhat we learned from three years sciencing the crap out of devops
What we learned from three years sciencing the crap out of devops
 
Get into bed with qa and keep testing agile
Get into bed with qa and keep testing agileGet into bed with qa and keep testing agile
Get into bed with qa and keep testing agile
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing Organization
 

Dernier

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 

Dernier (20)

AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 

Operational Insight: Concepts and Examples

  • 1. Operational InsightJune 15, 2015 Roy Rapoport @royrapoport / linkedin.com/in/royrapoport / rrapoport@netflix.com
  • 2. Oh, The Places We’ll Go! Today, I want to propose a general framework for how to think about operational insight products and features. I’m hoping that this framework is applicable to anyone who performs operations in production. After I propose thinking about operational insight this way, I’ll demonstrate some applications of it within our own operational environments at Netflix.
  • 3. The template we were supposed to use had me start with a slide with the speaker bio, but I want to start with something more relevant and interesting to you: The Korean War, and specifically dogfights during the war.
  • 4. John Boyd John Boyd was an air force pilot at the time; he studied dogfights and came to the conclusion every fighter pilot went through the same four steps:
  • 5. Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 6. Observe Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 7. Observe Orient Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 8. Observe Orient Decide Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 9. Observe Orient Decide Act Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 10. Observe Orient Decide Act OODA Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 11. Observe Orient Decide Act OODA “This approach favors agility over raw power in dealing with human opponents in any endeavor” - Wikipedia Observe your environment, orient (figure out what it means), decide what to do, execute that decision, and go back to observing. The pilot who did that faster than their opponent got to go home. OODA’s been used as a general framework for dealing with conflict against other humans, but I’d like to suggest it has much broader applicability.
  • 12. This Is What We Do Because even when not dealing with human opponents, anyone dealing with any aspect of operations — dealing with availability events, making decisions about promoting software in production, or … well, making decisions in general — does this all. the. time.
  • 13. For example, this pair of graphs represent the two KPIs by which we know if we have a high-level serious problem. The top one is the rate of calls into our customer service group; the second one is the rate at which people are actually streaming. Both are over the last seven days. When these dip …
  • 14. Like here, for example.
  • 15. We know we have a problem. We don’t exactly know what’s causing it, or what we’ll do to fix it. We’ll need to understand more about the problem to come to a decision, and then execute on that decision — OODA.
  • 16. OODA KPI So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).
  • 17. OODA KPI Speed So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).
  • 18. OODA KPI Speed Effort So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).
  • 19. OODA KPI Speed Effort Reliability So if we do OODA all the time, how can we think about doing OODA better or worse? There are three facets we can look at: How fast we execute the loop, how much effort it takes to execute the loop, and how reliably we execute the loop (make the right decision, execute it well).
  • 20. Winning Speed Effort Reliability So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.
  • 21. Winning Speed Effort Reliability So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.
  • 22. Winning Speed Effort Reliability So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.
  • 23. Winning Speed Effort Reliability So if we were trying to optimize the loop, we’d try to raise speed, drop effort, and raise reliability. speed and reliability are directly relevant to the consistency of your product’s delivery; effort is more relevant to whether or not your people are doing high-value work or not, and whether or not they’re likely going to continue to be happy working for you.
  • 24. Implications … for Observation (aka measurement, telemetry, metrics)
  • 25. Implications … for Observation (aka measurement, telemetry, metrics) • Make It Easy
  • 26. Implications … for Observation (aka measurement, telemetry, metrics) • Make It Easy • Make It Scalable
  • 27. Implications … for Observation (aka measurement, telemetry, metrics) • Make It Easy • Make It Scalable • Make it pluggable
  • 28. Implications … for Observation (aka measurement, telemetry, metrics) • Make It Easy • Make It Scalable • Make it pluggable • (Eventually) Ruthlessly Cull
  • 29. Implications … for Observation (aka measurement, telemetry, metrics) • Make It Easy • Make It Scalable • Make it pluggable • (Eventually) Ruthlessly Cull “What decision will this help me make?”
  • 30. A Joke I’d like to tell a very very long joke. It started at Velocity 2011, when I heard someone at a presentation “monitor all the things, because you never know what you might find useful one of these days.”
  • 31. This is a graph representing about 380K datapoints, collected once every five minutes since June 2011. It’s a bit mysterious, I know.
  • 32. 52 48 It may help you to see the lower and upper bounds of this graph are 48 to 52.
  • 33. % of servers in major region with an even IP address This graph represents the percent of our cloud instances in a given production region which had a public IP address. We can — and should (and I hope we do) — laugh about this graph, but I’d bet you your monitoring system is chock full of similarly useless data — I know mine is. It impacts the cost of the system, but also literally makes your job — and your customers’ jobs, if you’re responsible for the telemetry system — harder, because there’s much much more chaff to wade through.
  • 34. Implications … for Orientation (aka graphing, visualization)
  • 35. Implications … for Orientation (aka graphing, visualization) • First-class product
  • 36. Implications … for Orientation (aka graphing, visualization) • First-class product • Different decisions require different viz
  • 37. Implications … for Orientation (aka graphing, visualization) • First-class product • Different decisions require different viz • Low cognitive load better than
  • 38. Implications … for Orientation (aka graphing, visualization) • First-class product • Different decisions require different viz • Low cognitive load better than • High refresh rates
  • 39. Implications … for Orientation (aka graphing, visualization) • First-class product • Different decisions require different viz • Low cognitive load better than • High refresh rates • Deep data density
  • 41. Or Better Like That …
  • 42. Implications … for Decisions (aka alerting, real-time analytics, etc) Alerts are a basic, primitive decision. Build on that.
  • 43. Implications … for Decisions (aka alerting, real-time analytics, etc) • You already have (some of) this Alerts are a basic, primitive decision. Build on that.
  • 44. Implications … for Decisions (aka alerting, real-time analytics, etc) • You already have (some of) this • Incremental improvement Alerts are a basic, primitive decision. Build on that.
  • 45. Implications … for Decisions (aka alerting, real-time analytics, etc) • You already have (some of) this • Incremental improvement • Sky’s the limit Alerts are a basic, primitive decision. Build on that.
  • 46. Implications … for Decisions (aka alerting, real-time analytics, etc) • You already have (some of) this • Incremental improvement • Sky’s the limit • For benefits Alerts are a basic, primitive decision. Build on that.
  • 47. Implications … for Decisions (aka alerting, real-time analytics, etc) • You already have (some of) this • Incremental improvement • Sky’s the limit • For benefits • For cost Alerts are a basic, primitive decision. Build on that.
  • 48. Implications … for Action If you’re thinking of creating a runbook, AUTOMATE IT.
  • 49. Implications … for Action 1. Humans beat bureaucracy If you’re thinking of creating a runbook, AUTOMATE IT.
  • 50. Implications … for Action 1. Humans beat bureaucracy 2. Machines beat humans If you’re thinking of creating a runbook, AUTOMATE IT.
  • 51. Implications … for Action 1. Humans beat bureaucracy 2. Machines beat humans 3. Repeatability beats one-offs If you’re thinking of creating a runbook, AUTOMATE IT.
  • 52. Implications … for Action 1. Humans beat bureaucracy 2. Machines beat humans 3. Repeatability beats one-offs Repeatable machine processes TROUNCE one-off human bureaucracy If you’re thinking of creating a runbook, AUTOMATE IT.
  • 53. Implications … for Action 1. Humans beat bureaucracy 2. Machines beat humans 3. Repeatability beats one-offs 4. Start with humans Repeatable machine processes TROUNCE one-off human bureaucracy If you’re thinking of creating a runbook, AUTOMATE IT.
  • 54. Implications … for Action 1. Humans beat bureaucracy 2. Machines beat humans 3. Repeatability beats one-offs 4. Start with humans 5. If IFTTT, deprecate humans Repeatable machine processes TROUNCE one-off human bureaucracy If you’re thinking of creating a runbook, AUTOMATE IT.
  • 55. Decision: Do I Have Enough Instances? So let’s talk about a basic capacity quandry: Do I have enough instances in my cluster?
  • 56. I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing them, of course, to override it whenever they want to)
  • 57. I showed this graph earlier. Our work volume is highly diurnal. So we could, if we wanted, make sure our cluster sizes are big enough to support peak workload and just deal with the waste when the work load decreases; instead, we use Amazon’s auto-scaling group feature to automatically scale the clusters up and down in response to demand. So instead of trying to give users better telemetry on utilization, and making it easier for them to see if they need to increase capacity, we just automate that decision (allowing them, of course, to override it whenever they want to)
  • 58. Decision: Is My Canary Good? We use a deployment pattern called canary, where we compare the new version of the software to the baseline, also in production, and seek to answer a very simple question: Is our canary at least as good as our baseline system?
  • 59. 25
  • 61. Been there. • Started in the Data Center Done that. Manually.Artisanally. 25
  • 62. Been there. • Started in the Data Center • Manual, dashboard-driven Done that. Manually.Artisanally. 25
  • 65. Been there. Done that. Manually. • Context vs Precision 27
  • 66. Been there. Done that. Manually. • Context vs Precision • No … 27
  • 67. Been there. Done that. Manually. • Context vs Precision • No … • Repeatability 27
  • 68. Been there. Done that. Manually. • Context vs Precision • No … • Repeatability • Trending 27
  • 69. Been there. Done that. Manually. • Context vs Precision • No … • Repeatability • Trending • Manual effort is manual 27
  • 71. So Now What? • Automate Analysis 28
  • 72. So Now What? • Automate Analysis • Took Some Effort 28
  • 73. So Now What? • Automate Analysis • Took Some Effort • Approach and analytics 28
  • 74. So Now What? • Automate Analysis • Took Some Effort • Approach and analytics • Presentation matters 28
  • 76. Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System 1 server @ 1.0.2 Automated Canary Analysis Pretty Pictures 29
  • 77. 10 servers @ 1.0.2 Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated Canary Analysis Pretty Pictures 29
  • 78. 1000 servers @ 1.0.2 Version Control System 1000 servers @ 1.0.1 Customers Build & Deployment System Automated Canary Analysis Pretty Pictures 29
  • 79. Versi on 1000 servers @ 1.0.1 Custome Build & Deployment Automat ed 1000 servers @ 1.0.2 Pretty Pictures 30 Version Control System Build & Deployment System Automated Canary Analysis Customers
  • 80. Versi on Custome Build & Deployment Automat ed 1000 servers @ 1.0.2 Pretty Pictures 30 Version Control System Build & Deployment System Automated Canary Analysis Customers
  • 81. Versi on 1000 servers @ 1.0.1 Custome Build & Deployment Automat ed 1000 servers @ 1.0.2 Pretty Pictures 31 Version Control System Build & Deployment System Automated Canary Analysis
  • 82. Versi on 1000 servers @ 1.0.1 Custome Build & Deployment Automat ed 1000 servers @ 1.0.2 Pretty Pictures 31 Version Control System Build & Deployment System Automated Canary Analysis
  • 84. Just The Stats 4-Week View 6309 canary analysis cycles
  • 85. Just The Stats 4-Week View 6309 canary analysis cycles 16% canaries failed
  • 86. Decision: Do I Have an Outlier?
  • 87. Outlier Detection In an environment where you have a bunch of potentially-undifferentiated resources that should all behave approximately similarly, it becomes easy — and necessary, in a sufficiently large ecosystem — to notice outliers. If your cost for culling the outliers is low, you can also do it automatically. If not, you can at least alert that One Of These Things Is No Longer Like The Others.
  • 88. Would You Like to Play a Game? Can I have a volunteer from the audience to run an experiment with me?
  • 89. Spot the Outlier So for training, imagine I’m giving you this information about nine servers, named A through I. Each row is a minute’s data for these servers — let’s say it’s load average, or error rates. I’m going to ask you to point out the server — or column — that looks materially different from the others. This should be a relatively easy case, of course. Can you pick the server?
  • 90. OK. Now, I’m going to time you doing the same with more interesting data. Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
  • 91. OK. Now, I’m going to time you doing the same with more interesting data. Didn’t work so well? OK, let’s make it easier to orient and understand the numbers you’re looking at by showing this to you graphed.
  • 92. It probably is easier, isn’t it? Can you easily point out the outlier? OK, one last test. At the next slide, I’m going to show you some information (you can assume it’s true) and I want you to tell me which is the outlier, OK?
  • 93. The Outlier Is “A”That was … much easier, wasn’t it? This is what happens when we let computers do this work. We could have spent more time and effort to give you a more powerful visualization that would have made it easier to notice the outlier, but we instead built the analytics system that lets us automatically determine outliers so it won’t make it easier for you to do the work — it will do it for you.
  • 94. Just The Stats 4-Week View We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
  • 95. Just The Stats 4-Week View 739 Server Terminations We can use this for anything — pieces of content, or devices, or ISPs. Right now, we’ve been using it for about ten or so clusters of server and in the last four weeks have automatically identified — and terminated — 739 outliers.
  • 96. In a Nutshell Observe Orient Decide Act So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
  • 97. In a Nutshell Observe Orient Decide Act Need This First http://bit.ly/nflx-atlas-2013 http://metrics20.org So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
  • 98. In a Nutshell Observe Orient Decide Act Need This First http://bit.ly/nflx-atlas-2013 http://metrics20.org Understand the decision http://bit.ly/nflx-qcon-aca-2014 So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
  • 99. In a Nutshell Observe Orient Decide Act Need This First http://bit.ly/nflx-atlas-2013 http://metrics20.org Understand the decision http://bit.ly/nflx-qcon-aca-2014 Make it easier for humans So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
  • 100. In a Nutshell Observe Orient Decide Act Need This First http://bit.ly/nflx-atlas-2013 http://metrics20.org Understand the decision http://bit.ly/nflx-qcon-aca-2014 Make it easier for humans Make machines
 do it So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.
  • 101. In a Nutshell Observe Orient Decide Act Need This First http://bit.ly/nflx-atlas-2013 http://metrics20.org Understand the decision http://bit.ly/nflx-qcon-aca-2014 Make it easier for humans Make machines
 do it Higher speed Lower effort Higher reliability So really, that’s all there is: Start with observation — you need this in order to get anything done. Then focus on the kinds of decisions that need to take place in your operational environment. Where you continue to need humans to make these decisions, work hard to help them orient by providing better ways to understand the data; where you don’t … have machines do it.