When things go wrong, our judgement is clouded at best, blinded at worst.
In order to successfully navigate a large-scale outage, being aware of potentials gaps in knowledge and context can help make for a better outcome. The Human Factors and Systems Safety community have been studying how people situate themselves, coordinate amongst a team, use tooling, make decisions, and keep their cool under sometimes very stressful and escalating scenarios. We can learn from this research in order to adopt a more mature stance when the s*#t hits the fan.
We’re going to look closely at how people behave under these circumstances using real-world examples and scan what we can learn from High Reliability Organizations(HROs) and fields such as aviation, military, and trauma-driven healthcare.
Guide Complete Set of Residential Architectural Drawings PDF
Velocity EU 2012 Escalating Scenarios: Outage Handling Pitfalls
1. Escalating
Scenarios
A Deep Dive Into Outage
Pitfalls
John Allspaw
Velocity
London 2012
Wednesday, October 3, 12
2. TROUBLESHOOTING
This is NOT about troubleshooting
Or, not just about troubleshooting
Wednesday, October 3, 12
3. LAYOUT
• Criteria
• Situational Awareness
• HROs
• Decision Making
• Communication
• Team Coordination
• A little bit of psychology
Wednesday, October 3, 12
22. Dr. Richard Cook, Velocity US 2012
http://www.youtube.com/watch?v=R_PDc0HFdP0
Wednesday, October 3, 12
23. “The Self-Designing High-Reliability Organization:
Aircraft Carrier Flight Operations at Sea”
Rochlin, La Porte, and Roberts. Naval War College Review 1987
http://govleaders.org/reliability.htm
Wednesday, October 3, 12
26. What Goes On In Our Heads?
Wednesday, October 3, 12
27. Jens Rasmussen, 1983
Senior Member, IEEE
“Skills, Rules, and Knowledge; Signals, Signs,
and Symbols, and Other Distinctions in Human
Performance Models”
IEEE Transactions On Systems, Man, and Cybernetics, May 1983
Wednesday, October 3, 12
28. SKILL - BASED
Simple, routine
RULE - BASED
Knowable, but unfamiliar
KNOWLEDGE - BASED
(Reason, 1990)
WTF IS GOING ON?
Wednesday, October 3, 12
29. Situational Awareness
"the perception of elements in the environment within a volume of
time and space, the comprehension of their meaning, and the
projection of their status in the near future,” - (Endsley, 1995)
"keeping track of what is going on around you in a complex,
dynamic environment" (Moray, 2005, p. 4)
"knowing what is going on so you can figure out what to
do" (Adam, 1993)
Wednesday, October 3, 12
31. Canonical Work
“Towards a Theory of Situational Awareness”
Mica Endsley, Human Factors (1995)
http://www.satechnologies.com/Papers/pdf/Toward%20a%20Theory
%20of%20SA.pdf
Wednesday, October 3, 12
32. Situational Awareness
Level I
Perception
Level II
Comprehension
Level III
Projection
Wednesday, October 3, 12
33. System capability
Interface design
Stress and workload
Complexity
Automation
Task/System Factors
Feedback
Situational Awareness
Perception
Performance
State of the of elements
Comprehension Projection Decision of actions
environment in current
situation of current situation of future status
LEVEL I LEVEL II LEVEL III
Information processing
Individual Factors mechanisms
Long term
Goals and memory states Automaticity
objectives
Preconceptions
- Abilities
(expectations)
- Experience
- Training
(Endsley)
Wednesday, October 3, 12
46. Level Three
Common Clues you’re losing SA at this level
• Ambiguity
• Fixation
• Confusion
• Lack of Information
• Failure to maintain
• Failure to meet expected checkpoint or target
• Failure to resolve discrepancies
• A bad gut feeling that things are not quite right
Wednesday, October 3, 12
48. Characteristics of response to
escalating scenarios
...tend to neglect how processes
develop within time (awareness of
rates) versus assessing how things
are in the moment
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Wednesday, October 3, 12
49. Characteristics of response to
escalating scenarios
...have difficulty in dealing with
exponential developments (hard to
imagine how fast something can
change, or accelerate)
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Wednesday, October 3, 12
50. Characteristics of response to
escalating scenarios
...inclined to think in causal SERIES,
instead of causal NETS.
A therefore B,
instead of
A, therefore B and C (therefore D and
E), etc.
“On the Difficulties People Have in Dealing With Complexity” Dietrich Doerner, 1980
Wednesday, October 3, 12
51. SA
Pitfalls
Requisite Memory Trap
Wednesday, October 3, 12
52. SA
Pitfalls
Workload, anxiety,
fatigue, other stressors
Wednesday, October 3, 12
53. SA
Pitfalls
Data Overload
Wednesday, October 3, 12
54. SA
Pitfalls
Misplace Salience
Wednesday, October 3, 12
65. TEAMS
• Divide and conquer applied to problem space,
division of labor
• Incident resolution vs. Problem resolution
• Reproducibility
• Fault Tolerance Effects
Wednesday, October 3, 12
66. TEAMS
Shotgun debugging
Wednesday, October 3, 12
67. JOINT
ACTIVITY
• Interpredictability
• Common Ground
• Directability
http://csel.eng.ohio-state.edu/woods/distributed/CG%20final.pdf
Wednesday, October 3, 12
74. Improvisation
“...you can’t improvise on nothing; you got to
improvise on something.”
Charles Mingus
Wednesday, October 3, 12
75. Diagnose
the problem
Represent
the problem
Detect the Generate a Apply
course of Leverage
Problem/Opportunity action Points
Evaluate
Wednesday, October 3, 12
76. Communication
Recommendations
•Explicitness
•Assertiveness
•Timing
Wednesday, October 3, 12
104. ALERT DESIGN
• Signal:Noise can be difficult
• Easy to err on more false alarms
• Decay in trust
• Origins: Undetectable conditions
Wednesday, October 3, 12
105. ALERT DESIGN
Confirmation
Wednesday, October 3, 12
106. ALERT DESIGN
Expectancy
Wednesday, October 3, 12
108. ALERT DESIGN
• Don’t make people singularly reliant on alarms
• Support alarm confirmation activities
• Make alarms unambiguous
• Reduce, reduce, reduce false alerts
• Set missed/false alert trade-offs appropriately
Wednesday, October 3, 12
109. ALERT DESIGN
• Use multiple modalities
• Minimize alarm disruptions to ongoing activities
• Support the assessment/diagnosis of multiple alerts
• Support global SA of systems in an alarm state
Wednesday, October 3, 12
110. Mature Role of Automation
“Ironies of Automation” - Lisanne Bainbridge
http://www.bainbrdg.demon.co.uk/Papers/Ironies.html
Wednesday, October 3, 12
111. Mature Role of Automation
• Moves humans from manual operator to supervisor
• Extends and augments human abilities, doesn’t replace it
• Doesn’t remove “human error”
• Are brittle
• Recognize that there is always discretionary space for humans
• Recognizes the Law of Stretched Systems
Wednesday, October 3, 12
113. So what can we do?
“In preparing for battle, I have always
found that plans are useless but planning
is indispensable.”
- Eisenhower
Wednesday, October 3, 12
114. So what can we do?
We develop our Non-Technical Skills
• Situational Awareness
• Communication
• Decision Making
• Improvisation
• Crew Resource Management (CRM)
Wednesday, October 3, 12
115. So what can we do?
We tailor our environment to adapt
• Tooling to support SA
• Learning from outages (PostMortem)
• Anticipating problems (PreMortem)
• Gather Meta-Metrics
Wednesday, October 3, 12