** Recording available at https://www.youtube.com/watch?v=sHNOjUtbq2s **
Failure happens! It's our job to turn these disruptions into learning opportunities. As our software has become more distributed and complex, the "shift-left" movement brings reliability testing to earlier stages of development. Ensuring reliability goes beyond simple end-to-end tests.
To ensure the highest levels of reliability, you must perform a suite of testing types. Incorporate contract tests to validate APIs; load tests for scaling predictability. Let's learn from Chaos Engineering principles by incorporating disruptive behavior into your system _before_ production.
Join Paul as we learn ways to incorporate a plethora of testing into your software development pipeline. We'll discuss the pros and cons of each and what you can do to add these to your processes.
By embracing a little disruption, you can significantly improve the reliability of your system.
9. Overview
1
2
3
4
Why are we here?
A brief history of how we test
How fault injection can help us
Where do we go from here?
10. Release frequency
How it’s initiated
Testing environment
Testing frequency
Checklist / OLD WAY
Before releases
Test and Production
Manually
Quarterly or biannually
The way we test
● QA bottleneck
● Lower coverage
● Late in process
11. Release frequency
How it’s initiated
Testing environment
Testing frequency
DevOps / MODERN WAY
Weekly, Daily, As needed
Nightly, feature branches, continuous with
synthetic monitoring
Scheduled. Automatically as part of CI/CD
Staging (Long-lived) and ephemeral
environments (Short-lived)
Checklist / OLD WAY
Before releases
Test and Production
Manually
Quarterly or biannually
The way we test
12. ● Unit testing
● Integration testing
● Contract testing
● Functional testing
● E2E testing
● Load testing
The way we test
13. The way we test
● Applications
instrumented
● Observability
platform available
14. ● Start simple
● Test frequently
● Continually expand
● Evolve over time
The way we test
Time
19. Build more confidence to withstand failures?
Chaos
Testing
☑
☑
☒ ��
Shift left
Learn from
Incidents
Production systems
Development
20. From the distributed system perspective, almost all
interesting availability experiments can be driven by
affecting latency or response type.
Nora Jones
Casey Rosenthal
- Chaos Engineering, O’Reilly
21. ● Formerly known as Load Impact
● Open Source since 2016
● ~22.4k GitHub stars (as of January 2024)
● Promotes “shift-left” testing
● Acquired by Grafana Labs in 2021
github.com/grafana/k6
Introducing k6 and xk6-disruptor
k6, a reliability testing tool
● Becomes a project in August 2022, evolved
from previous experiments
github.com/grafana/xk6-disruptor
xk6-disruptor for fault injection
22. OSS is at the
heart of what
we do and
helps leave the
world a little
better than we
found it
CLI and API
designed for
automating
your tests with
pass/fail criteria
using JavaScript
syntax
A k6 engine
written in Go
making it one of
the the best
performing
tools available
Use Go(lang) code
to add support for
new outputs,
protocols, and
products from
within your test
scripts
OpenSource Scriptable Performant Extensible
k6: a reliability testing tool
24. 92% of the catastrophic system
failures are the result of incorrect
handling of non-fatal errors
In 58% of the cases the resulting
faults could have been detected
through simple testing of error
handling code
How effective is
testing known
errors?
“Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems”
Yuan et al. USENIX OSDI 2014
25. In 35% of the cases, error handling
code falls into one of three patterns:
Overreactive. Aborts the system under
non-fatal errors
Low Context. Was empty or only
contained a log printing statement
Incomplete. Related comments like
“FIXME” or “TODO”
How to improve
error handling?
26. Incorporate chaos engineering
principles early in the development
process
Emphasize verification over
experimentation
Change focus from uncovering
unknown faults to ensuring proper
handling of known faults
Introduce
chaos testing
33. ● Tests can be reused to validate the system under turbulent conditions
● Conditions are defined in familiar terms: latency and error rate
● Tests have a controlled effect on the target service
● Tests are repeatable with results that are predictable
● Fault injection is coordinated from the test code
● Fault injection should not add any operational complexity
Chaos testing principles in action
37. Final remarks
The ability to operate reliably should not
be a privilege of the technology elite
Chaos Engineering can be democratized
by promoting the adoption of Chaos
Testing
To be effective, Chaos Testing must be
compatible with the existing testing
practices used by development teams
38. Make Chaos Engineering practices
accessible to a broad spectrum of
organizations by building a solid
foundation from which they can
progress towards more reliable
applications
Our Goal
39. Connect with Paul as
@javaducky or linkedin/in/pabalogh
Thanks for participating!
k6.io/slack grafana/xk6-disruptor