http://dius.com.au/resources/game-day/
Agility has brought us iterative software development, independent feature teams, nimble architectures and distributed, scalable infrastructure. But how do you maintain confidence in these systems in the face of this emergent complexity and fast paced change? The answer is to anticipate and practice failure!
In this session we explore GameDays, a collaborative exercise where teams safely introduce chaos into their systems, in order to make them better.
12. User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Integration Issues
integration
13. User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Distributed Failures
distributed
14. User Interface Mobile
API Gateway
Mainframe / DB
Middleware /
APIs
VerticalSlice
Catastrophes
Customers
Engineers
Call Centre
bug
distributed
...
integration
Public Relations
15. Classes of issues
■ Bugs
■ Integration issues
■ Distributed failure
■ The squishy stuff: People + Process
16. So how do we avoid becoming front page news?
(the bad kind)
19. Embracing Failure
■ We need to practice failure
■ Software Engineering needs its Fire Drill
20. An exercise where we place our systems
- technology, people + processes -
under stress in order to
learn and improve resilience.
GameDay
21. A GameDay manifesto?
DR GameDays
Driver Process Continuous Improvement
Approach Run sheet + requirements Loose plan + a little chaos
Focus Infrastructure Customer
Who Operations Cross functional,
multi-disciplinary team
Assumption System is built to a
robust design
System is hazardous
22. Once you finally start succeeding at agile…
Iterative software development
Independent feature teams
Nimble architectures
Distributed, scalable infrastructure
27. Logistics - how to plan a GameDay
dius.com.au/resources/game-day
■ People and roles to get involved
■ Preparation workshops and planning
■ Templates and checklist
■ Physical space set up
34. Post Mortem
Load
Balancer
API API API API
Load Balancer Load Balancer Load Balancer Load Balancer
X X X XNo visibility!
✅ ✅ ✅ ✅
X
Release Dashboard
35. Ingredients for catastrophe
✓Introduction of a change to the system
✓Human error
✓Missing local controls (tests) to prevent syntax issue
✓Lack of salient information for operator (monitoring and alerting)
✓Opportunity to misinterpret data
✓Distance between expert and operator (process)
36. What did we learn?
■ Just getting teams together to discuss resilience
was worthwhile
■ We always found something
■ Our experiments reduced the impact of
hindsight bias
37. What matters:
■ Cross-functional team
■ Planning
■ Open to exposing failure
■ Customer focus
■ Bake it in - do GameDays frequently
What doesn’t matter:
■ Size of team/company
■ Waterfall/Agile
■ Language, technology...
38. Are GameDays the new hack days?
■ Collaboration
■ Problem solving
■ Creates business value
39.
40.
41. The journey towards automated resilience testing
Pre-Production:
■Create local experiments in Docker
■Manual chaos in integrated
environments
Production:
■Start small!
■Metrics-driven approach
Chaos Kong
pumba
42. Matt Fellows @matthewfellows mfellows@dius.com.au
Pete Cohen @petecohen pcohen@dius.com.au
For links, references, templates and your GameDay toolkit, head to:
dius.com.au/resources/game-day
Thank you!