Embracing Disruption: Adding a Bit of Chaos to Help You Grow

Paul Balogh
Developer Advocate, Grafana Labs
@javaducky
Embracing Disruption
Adding a Bit of Chaos to Help You Grow!

Overview
1
2
3
4
Why are we here?
A brief history of how we test
How fault injection can help us
Where do we go from here?

Complex
architecture and
infrastructure
Many potential
points of failure
Inadequate
tooling and
practices
Application reliability is hard

High demands on availability
SLOs

Distributed
Ever increasing complexity
(Example of Netflix services)

Fragility
Potentially fateful interdependency
https://xkcd.com/2347/

Release frequency
How it’s initiated
Testing environment
Testing frequency
Checklist / OLD WAY
Before releases
Test and Production
Manually
Quarterly or biannually
The way we test
● QA bottleneck
● Lower coverage
● Late in process

Release frequency
How it’s initiated
Testing environment
Testing frequency
DevOps / MODERN WAY
Weekly, Daily, As needed
Nightly, feature branches, continuous with
synthetic monitoring
Scheduled. Automatically as part of CI/CD
Staging (Long-lived) and ephemeral
environments (Short-lived)
Checklist / OLD WAY
Before releases
Test and Production
Manually
Quarterly or biannually
The way we test

● Unit testing
● Integration testing
● Contract testing
● Functional testing
● E2E testing
● Load testing
The way we test

The way we test
● Applications
instrumented
● Observability
platform available

● Start simple
● Test frequently
● Continually expand
● Evolve over time
The way we test
Time

Overview
1
2
3
4
A brief history of how we test
How fault injection can help us
Where do we go from here?
Why are we here?

A software testing technique which
introduces errors to a system to
ensure it can withstand and recover
from those conditions.
Fault Injection

Failure happens
Test Release Deploy Operate
Production
Monitor
e Build
DEV OPS
��
Resolve, Inform, Learn

Build more confidence to withstand failures?
Chaos
Testing
☑
☑
☒ ��
Shift left
Learn from
Incidents
Production systems
Development

From the distributed system perspective, almost all
interesting availability experiments can be driven by
affecting latency or response type.
Nora Jones
Casey Rosenthal
- Chaos Engineering, O’Reilly

● Formerly known as Load Impact
● Open Source since 2016
● ~22.4k GitHub stars (as of January 2024)
● Promotes “shift-left” testing
● Acquired by Grafana Labs in 2021
github.com/grafana/k6
Introducing k6 and xk6-disruptor
k6, a reliability testing tool
● Becomes a project in August 2022, evolved
from previous experiments
github.com/grafana/xk6-disruptor
xk6-disruptor for fault injection

OSS is at the
heart of what
we do and
helps leave the
world a little
better than we
found it
CLI and API
designed for
automating
your tests with
pass/fail criteria
using JavaScript
syntax
A k6 engine
written in Go
making it one of
the the best
performing
tools available
Use Go(lang) code
to add support for
new outputs,
protocols, and
products from
within your test
scripts
OpenSource Scriptable Performant Extensible
k6: a reliability testing tool

92% of the catastrophic system
failures are the result of incorrect
handling of non-fatal errors
In 58% of the cases the resulting
faults could have been detected
through simple testing of error
handling code
How effective is
testing known
errors?
“Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems”
Yuan et al. USENIX OSDI 2014

In 35% of the cases, error handling
code falls into one of three patterns:
Overreactive. Aborts the system under
non-fatal errors
Low Context. Was empty or only
contained a log printing statement
Incomplete. Related comments like
“FIXME” or “TODO”
How to improve
error handling?

Incorporate chaos engineering
principles early in the development
process
Emphasize verification over
experimentation
Change focus from uncovering
unknown faults to ensuring proper
handling of known faults
Introduce
chaos testing

Continually improve reliability
Chaos
Testing
☑
☑
☒
Chaos
Experiments
Incident
Enacting
��
Shift left
Production systems
Development
Progress
towards
Improve
Learn from
Incidents

Incremental
adoption
Application
Centric
Controlled
Chaos
Chaos as
Code
< >
⚙
Four tenets of Chaos Testing

OpenTelemetry Demo - Astronomy Shop
● Microservices architecture
● HTTP, gRPC, Kafka
between services
● Polyglot (Go, Java, JS, …)
● Kubernetes-ready
https://github.com/open-telemetry/opentelemetry-demo

https://github.com/open-telemetry/opentelemetry-demo
OpenTelemetry Demo - Astronomy Shop
● Microservices architecture
● HTTP, gRPC, Kafka
between services
● Polyglot (Go, Java, JS, …)
● Kubernetes-ready
��
How would an incident
affect our services?

● Tests can be reused to validate the system under turbulent conditions
● Conditions are defined in familiar terms: latency and error rate
● Tests have a controlled effect on the target service
● Tests are repeatable with results that are predictable
● Fault injection is coordinated from the test code
● Fault injection should not add any operational complexity
Chaos testing principles in action

Integration
Testing
Contract
Testing
Reliability testing strategy
Browser
Automation
(E2E)
Load
Testing
Functional
Testing
Chaos
Testing

PRE-PRODUCTION PRODUCTION
Virtual User
traffic
Real User
traffic
Virtual User
traffic
SUT SUT
Proactively improve reliability

Final remarks
The ability to operate reliably should not
be a privilege of the technology elite
Chaos Engineering can be democratized
by promoting the adoption of Chaos
Testing
To be effective, Chaos Testing must be
compatible with the existing testing
practices used by development teams

Make Chaos Engineering practices
accessible to a broad spectrum of
organizations by building a solid
foundation from which they can
progress towards more reliable
applications
Our Goal

Connect with Paul as
@javaducky or linkedin/in/pabalogh
Thanks for participating!
k6.io/slack grafana/xk6-disruptor

Embracing Disruption: Adding a Bit of Chaos to Help You Grow

Recommandé

Recommandé

Contenu connexe

Similaire à Embracing Disruption: Adding a Bit of Chaos to Help You Grow

Similaire à Embracing Disruption: Adding a Bit of Chaos to Help You Grow (20)

Plus de Paul Balogh

Plus de Paul Balogh (7)

Dernier

Dernier (20)

Embracing Disruption: Adding a Bit of Chaos to Help You Grow