Service Management in a DevOps World - by Helen Beal

Service Management
in a DevOps World
By Helen Beal
A Guide to Evolving Light-Weight Service Management
Processes for Value Streams Using DevOps Principles

2Service Management in a DevOps World
The Origins of Services 4
Where ITSM Has Been Painful 6
Value Stream Centric Service Management 8
Identifying Current Condition: Value Stream Mapping 11
1.0 The DevOps Approach to Change 13
1.1 Long Term Vision: Lightweight, peer-reviewed change 14
1.2 Current Condition: Change Advisory/Approval Boards 14
1.3 Next Target Conditions 15
1.3.1 Reduce Batch Size 15
1.3.2 Classify Changes and Make Them Visible 15
1.3.3 Automate the Change ‘Checklist’ 17
1.3.4 Limit the Blast Radius: Canary Testing/Deployment 18
1.3.5 Address System Dependencies 18
1.4 Example Experiments 19
2.0 The DevOps Approach to Release 20
2.1 Long Term Goal: Teams autonomously release on demand (CD) 21
2.2 Current Condition: Release weekends, calendars and managers 21
2.3.1 Define Release and Deploy and Reduce Batch Size 22
2.3.2 Defer the Release Management Role to the Team 23
2.3.3 Increase the Availability of Release Slots 23
2.3.4 Automate the Release ‘Checklist’ and Deployment Process 24
2.3.5 Limit the Deployment Blast Radius: Blue/Green Deployments 24
2.3.6 Reducing the Route to Live, Leveraging Cloud 26
3.0 The DevOps Approach to Security 27
3.1 Long Term Goal: Checks in the IDE 28
3.2 Current Condition: Pen tests in Prod 28
Table of Contents

3.3.1 Shifting Left and Automation 29
3.3.2 DevSecOps Culture and Behaviors 30
3.3.3 Customer Feedback and Bug Bounties 31
3.3.4 Software is Like Milk, Not Wine 32
3.3.5 Belt and Braces 32
4.0 The DevOps Approach to Support 34
4.1 Long Term Goal: You build it, you own it, and/or swarming 35
4.2 Current Condition: 3 tiers 35
4.3.1 Arrange Around Products 37
4.3.2 Automating Support: ChatOps and Bots 38
4.3.3 Automating Support: Knowledge and Self-Service 39
4.3.4 Telemetry Everywhere and Viewability/Observability 39
4.3.5 CICD and Intelligent Risk Management 40
5.0 The DevOps Approach to Incidents 42
5.1 Long Term Goal: ChatOps in/across teams 43
5.2 Current Condition: War rooms, incident managers 44
5.3.1 Blameless Retrospectives and Experimentation 45
5.3.2 Reframing Failure, Safety Culture and The Andon Cord 45
5.3.3 Automation: Site Reliability and Chaos Engineering 46
Conclusion: Flow & Value Stream Management 48
Further Reading 51

The Origins of Services
The origins of DevOps lie in agile system administration
and the recognition that whilst software development
teams were taking advantage of agile methodologies
to become more responsive to change and uncertainty,
the IT Operations people were not.
Sometimes they were even oblivious to what was happening on
the other side of the ‘wall of confusion’ and painful tensions and
misunderstandings occurred between the two technology teams: IT
Ops guys grumbled about developers wanting administrator access to
production machines, the developers moaned that IT Ops guys took too
long to provision environments, and releasing an update was always
a ‘hair on fire’ moment that frequently resulted in blame games and
mostly happened at weekends.

The battle between change and stability seemed as if it would rage on.
But DevOps principles have taught us how to balance throughput and
reliability without compromise to either. DevOps is not just practiced by
the ‘born on the web’ behemoths but by an ever increasing number of
traditional enterprises, those who in the past have embraced IT Service
Management (ITSM) approaches to service delivery.
Whilst some organizations consciously or unconsciously drive their
DevOps evolution from their development teams, there always
comes a time where they seek to understand how to optimize service
management activities as part of the end-to-end technology delivery
value stream. And whilst development are about agile, IT Ops are about
ITSM and we can use lean tools to marry the two and create lightweight,
“just-enough” processes that allow both teams to work at the same
cadence.
DevOps has evolved to focus on the end-to-end optimization of the
value stream, accelerating flow from idea to value realization. How we
handle and manage key technology delivery services changes when our
primary goals are to optimize the flow of value and system integrity.

Where ITSM Has Been Painful
Traditional ITSM processes, whilst designed for all the right reasons;
to protect us, to improve our predictability and to enable common
understanding, are frequently accused of being onerous and of
blocking the flow of value from idea to realisation to the customer.
In the past, when we have experienced an issue or problem, a typical
response is to be to add a control; this is why large enterprises that
have operated for a significant amount of time are often bogged
down in bureaucracy - layers of process that have built up over time.
Additionally, these types of organizations have evolved system and
organizational designs (not necessarily by intention) that contain large
numbers of (sometimes unknown) dependencies, exacerbating the
sense, and actuality, of fragility.
DevOps seeks to improve sustainable working practices and reduce
workplace burnout and stress. It remodels the ways of working
to improve velocity, consistency and predictability, visualizing the
flow of work and removing constraints. Our focus moves from
managing dependencies, to breaking them to create loosely coupled
organizational and technology systems that allow us to build, test and
deploy in small increments.

ITSM has, inadvertently, caused some key constraints which have
led to working practices that frustrate people because they slow
people down. But these same working practices were introduced to
avoid catastrophic failures caused by chaos and unknown or unseen
dependencies.
Things like the Change Advisory Board, that in many organizations
morphed into the Change Approval Board (the difference is subtle, but
palpable) that add wait times to value streams and are often perceived
to be adding no real value to the customer experience. The change and
release calendars and checklists are often similarly reviled and
not valued.
Painful working practices
related to these are release
weekends and nights, the
subsequent war-room when
a large batch release goes
bad, and project centric
cultures that are typified by
a culture of meetings and
irregular demand flows
(feast and famine) and
spiralling technical debt.

Value Stream Centric Service
Management
Approaching service management activities such as
change, release, security, support and incident with a
DevOps hat on changes the way we work to allow for
improved adaptability whilst not forgetting what we’ve
learned about ensuring customer experience.
Underpinning this is the principle of little and often;
more frequent inspection and smaller work packages
allowing us to receive feedback more often to course
correct more frequently and regularly.

Activity Current Condition Next Target Condition Long Term Vision
Change CABs
Typifying change,
automating checklists
Teams peer review their
own changes
Release
Release weekends,
calendars & managers
More frequent,
automated releases
Teams autonomously
release on demand
Security Pen tests in prod
Automatic scanning
in CI
Checks in IDE
Support 3 tiers
ChatOps, automated
customer feedback
You build it, you own it,
and/or swarming
Incident
War rooms, incident
managers
Healthy retrospectives ChatOps in/across teams
The following table summarizes how these activities change in a value
stream centric world and describes a midway step to consider as an
organization transitions from one capability to another.
It uses the lean improvement kata approach where we first look at the
long term vision, seek to understand the current condition and identify
the next target condition. Organizations should seek to experiment using
the Deming PDCA (Plan-Do-Check-Act) cycle:

Each of these five areas: change, release, security, support and incident
are explored in detail in this paper. A key to achieving the desired
capabilities is the focus on breaking dependencies to allow for loosely
coupled systems and structures; observe Conway’s Law that tells us that
any organization will design systems that look like their communication
and organization structure.
Compare how an organization with many large silos that pass off to
each other creates monolithic ‘big balls of mud’, compared to how an
organization with small, autonomous teams creates a microservices
architecture where components are loosely connected and can be
changed, tested and deployed independently of one another.
“Organizations which design systems … are
constrained to produce designs which are
copies of the communication structures of
these organizations.”
- Melvin Conway

Identifying Current Condition:
Value Stream Mapping
In DevOps the realization of value is the core focus of teams; the
definition of done moves from “I did my job” to “the customer has
received value”. When an organization is in transition from a traditional
waterfall way of working to a more adaptable, agile way of working, it
can be hard to see what changes should be made and how. Using a
lean tool, Value Stream Mapping, is a highly effective way of reaching
understanding of the current condition and consensus on the activities
needed to make improvement.
A value stream is anything that delivers a product
or a service and is made up of several activities or
processes that start when an idea presents itself and
is complete when the customer receives the value
derived from that idea.

Value Stream Mapping requires a group of people with representation
from each activity or process in the value stream share the same physical
space while they visually collaborate to map how the activities connect
together, how long each takes and how long each step waits for the other
to start thus calculating the cycle time for value delivery.
It’s often the first time a particular group of people have been in a room
together and provides a qualitative and quantitative time diagnostic
of the value stream. It’s here that people begin to fully appreciate how
and why particular processes, for example change, cause delays in
the delivery of value and work together to imagine, measure and plan
improvements.
The improvements are based on principles around queuing and batch
size: when we map a value stream we can see how large batch sizes
create queues which consequently increase our lead and cycle time. It
highlights how risk is reduced when we make our work packages smaller
as we receive faster feedback. This will directly impact architectural
decisions and how we seek to reduce the route to live.

1.0 The DevOps Approach to Change
“There’s a right way to handle the change approval
process, and it leads to improvements in speed and
stability and reductions in burnout. Heavyweight
change approval processes, such as change approval
boards, negatively impact speed and stability.”
- 2019 Accelerate State of DevOps Report

There are other problems with change approval boards too - they are
often seen as a ‘checklist’ exercise, performed by people who have no
real understanding of the nature or impact of the change itself. Having
a change calendar constrains teams from being able to release their
changes when they are ready or they want to and inevitably slows
them down.
A lightweight change process is a peer-reviewed change process that is
owned by the team. Changes are so small and so frequent they only take
a short time to be checked, approved and released. All of the testing has
already happened (it is automated in our continuous delivery pipeline)
and we have good test data and production-like test environments to
deploy to. Our systems are sufficiently decoupled.
As part of a value stream mapping exercise, the team asks if each step or
activity in the value stream is ‘value-adding’ i.e. does work happen here
that directly creates value for the customer. The answer here is never
yes, so we should seek to remove this step, whilst ensuring that the
purpose of the activity (the protection of our services from failure) is
not lost.
1.1 Long Term Vision: Lightweight, peer-reviewed change
1.2 Current Condition: Change Advisory/Approval Boards

We mustn’t lose sight of the fact that we need these controls to protect
ourselves from chaotic change. This requires that teams have clarity and
understanding of the change process and that everyone that needs to
have visibility of the changes and that proper procedures are followed.
1.3.1 Reduce Batch Size
The first step is for individual teams to reduce the batch size of their
changes. This will have a direct effect on the length of the queues and
also the amount of risk each change carries. Whilst doing this, the team
needs to make the small changes visible probably by using a product
backlog tool and expressing them as user stories. Now that the team has
smaller changes, they will need to meet more frequently to review their
progress, ideally using an agile framework. The work in progress is made
visible via physical or virtual boards.
1.3.2 Classify Changes and Make Them Visible
As the changes become smaller and segregated out from a batch,
it becomes possible to classify changes. Teams may use terms such
as “standard”, “small” or “emergency”. Working with the change
management people, the team agree to experiment with making
changes with their Product Owner only (not the CAB) approving them.
They make these changes visible to the change management people and
do not schedule them on the change calendar.
1.3 Next Target Conditions

They agree with the change management people what checks they will
perform themselves before the change is released to customers and
record that these checks have been undertaken, ideally in a workflow in
the product backlog.
Once the team has proven themselves reliable, they gain autonomy to
increase the amount of change they perform outside of the traditional
change process.

1.3.3 Automate the Change ‘Checklist’
Working with the change management people, the team can identify
the change ‘checklist’ and automate it. It’s likely to demand certain
tests are done, for example unit, integration and user acceptance tests.
Higher levels of fluency also include non-functional tests for security
(see section 3.0) and performance.
Having smaller changes and using trunk-based development (where
there are a small number of short-lived feature branches) in a
continuous integration and delivery (CICD) pipeline demands these
gates are completed before deployment. Peer-review of code and the
change can also be baked into the product backlog workflow; this is
fantastic for auditors, as is the version control that is the foundation
of the CICD pipeline, as it not only shows how the process steps are
followed but provides actual proof that they are.
Adding monitoring means that the team receive customer feedback fast
and have access to fast fault diagnosis. Using automated deployment
(see Section 2.0) will give the team the opportunity to instantly redeploy
last known good state (caveat: not all failures are a change failure
relating directly to the last change and this can get more complex when
we are delivering changes more frequently, but the alternative is to try
to identify which change in a large batch caused the problem).

1.3.4 Limit the Blast Radius: Canary Testing/Deployment
If we recognise that the point of the change controls is to protect us
from catastrophic failure, let’s define that. If the definition includes
making a change that fails for everyone, we can tackle the ‘everyone’
element by making the change for only a few: canary testing or
deployment. If it works for a few, we can then push it out to more,
many and all. If it doesn’t, we revert, learn and try again.
1.3.5 Address System Dependencies
Since much of the requirement for centrally coordinated change comes
from tightly coupled systems and incidents caused by unpredictable
system dependencies, reducing these dependencies is key to protecting
the teams from system fragility. Teams take ownership of these
architectural discussions and drive cross-team conversations through
communities of practice/interest or agile at scale techniques such as
Scrum of Scrums. Additionally, organizations practice inner-source
where teams can see and change (with visibility and peer-review) each
others’ systems.
Conway’s Law tells us that we will design systems that look like our
organizational communication structures; if our teams are autonomous
and loosely coupled, so will be our systems architecture. Using a
microservices and API model leads us to a place where we can test
and deploy small pieces independently. It does give us more pieces to
manage but that’s the trade-off.

1.4 Example Experiments
“If the team completes a small change themselves next week, ensuring
the change is visible in Jira, versioned in GitLab and Jenkins automates
the build and runs the unit and integration tests and we peer-review
the code and change, there won’t be a change failure as a result.”
“We believe that when we classify changes to ‘small’, 60% of our
changes won’t need to go through CAB and our lead time will reduce on
average by one week in the next 4 months.”
“Our architect thinks that we can uncover and break 20 dependencies
by the end of the year if they are flagged in the Scrum of Scrums and
20% of all product teams’ sprint is allocated to this activity.”
“I hypothesize that if we create a workflow in Jira this week that won’t
allow a build to go green until the tests are passed and peer-review is
complete, our change fail rate will drop over the next three months
and over 10 sprints the central change team and CAB will accept this as
evidence that we have followed their procedures and allow us change
autonomy. Auditing will take no time at all when it comes around
next March.”

We start here with a couple of key principles:
1. Release weekends are bad.
2. The DevOps ‘little and often’ approach:
Releases should be ‘like breathing’ not creating
‘hair on fire’ moments.
2.0 The DevOps Approach to Release

For many organizations, it’s important to ensure everyone understands
what is meant by ‘release’ and ‘deploy’ as they vary frequently. Often
people prepare a release, deploy it to production and then release it
to customers. These distinctions become less important as DevOps
fluency improves, but as teams and organizations evolve, they need to
know they are speaking the same language.
2.1 Long Term Goal: Teams autonomously release on demand (CD)
Here our teams can release their new features and fixes whenever they
are ready. Their continuous delivery pipeline ensures that software
is always in a releasable state and they may also have continuous
deployment - on successful completion of all the tests, the change is
automatically deployed and released into production.
2.2 Current Condition: Release weekends, calendars and
managers
Traditional ITSM processes have taught us to create release packages;
large bundles of features. This is clearly in contention with our
DevOps principle of little and often where we reduce the risk of
deploying a change by making it smaller. Because we have large, high
risk releases we then schedule them and have people to manage
these schedules. Teams have to wait until their slots in the calendar
become available in order to perform their deployment to production
and release to customers.

We want to balance two of the
four key DevOps metrics here:
the throughput metric for
deployment frequency and the
stability metric for change fail
rate. This will also reduce our
lead time and we don’t want to
cause incidents that will cause
us to measure our Mean Time
to Recovery (see Section 5.0).
Once again, value stream mapping is likely to show in traditional
ways of working that the release and deployment process is
lengthy, particularly if teams are required to interact with a release
management team to book slots on a release calendar.
2.3.1 Define Release and Deploy and Reduce Batch Size
As covered in Section 1.0, reducing the batch size reduces the risk and
the queueing time. It’s important that people in an organization know
what is meant when the words ‘release’ and ‘deploy’ are used and they
do vary from organization to organization. People talk about ‘deploying
a release’ or ‘releasing to production’.

When changes are small they can be deployed easily with reduced
risk of disruption and the distinction between the terms becomes less
important.
2.3.2 Defer the Release Management Role to the Team
When teams are working autonomously with small changes they can
release them when they are ready. But it takes time to transition to
that place and on the path is considering the move from one state to
another. In a traditional way of working, there is likely to be a release
manager or a team of release managers who are coordinating the
release process. The release manager role can be transitioned to the
team, with the team giving access to systems that allow the release
manager visibility into releases that are happening.
2.3.3 Increase the Availability of Release Slots
Initially, when teams start using agile frameworks such as Scrum, they
will aim to release at the end of a two week sprint, or perhaps at the
end of several sprints. If the organization is working with a release
calendar that may have quarterly or monthly release slots, they should
look to increase the number of slots available to allow for the smaller
and more frequent changes. In time, whether teams are using sprint
or Kanban ways of working, they will evolve to releasing on demand
or continuous delivery. At this point, no type of release calendar or
management is required as the teams operate autonomously.

2.3.4 Automate the Release ‘Checklist’ and Deployment Process
As with change, many organizations operate with a release checklist to
ensure agreed policies and procedures are met. As with change, many of
these steps, such as versioning and testing can be automated in the CICD
pipeline and teams will release themselves; they build it, they own it.
As with change, release and deployment autonomy is highly dependent
on system autonomy but where systems remain tightly coupled, release
management tools are also available to track and manage these system
dependencies. These systems can also profile the risk associated with
release. Deployment automation tools as part of the CICD pipeline
further predictability in the process by reducing the manual effort
associated with these tasks and providing patterns or templates that
reduce configuration drift and allow for self-service in the teams.
2.3.5 Limit the Deployment Blast Radius: Blue/Green Deployments
Organizations use the canary testing/deployment scenario described
in Section 1.3.4 and also use feature toggles/flags and blue/green
deployments to mitigate deployment failure risk. Toggling features on or
off separates feature release from code deployment allowing code to be
deployed to production while restricting access (through configuration)
to a subset of users. It also allows unfinished code to undergo integration
testing whilst remaining inaccessible when live and allows for A/B testing
and canary testing/deployment.

Blue-green deployment is a technique that reduces downtime and risk by
running two identical production environments called Blue and Green. At
any time, only one of the environments is live, with the live environment
serving all production traffic. For this example, Blue is currently live and
Green is idle.
As a new version of software is prepared, deployment and the final stage
of testing takes place in the environment that is not live: in this example,
Green. Once the software is deployed and fully tested in Green, the
router switches incoming requests to Green instead of Blue. Green is
now live, and Blue is idle. This can also help with reducing the Route to
Live (RtL) which reduces handoffs and opportunities from problems and
improves flow.

2.3.6 Reducing the Route to Live, Leveraging Cloud
Many organizations have complex RtLs containing multiple test
environments and experience difficulties in production since these
environments are not production-like. Teams also frequently have to
share these environments and find it difficult to obtain good test data.
The factor that most commonly prevents teams from having access to
production like test environments is cost. Using cloud technologies
can ease the pain here (and research shows that using these type of
technologies (public, private, hybrid or multi) correlates with higher
performing organizations) allowing teams to easily spin up test
environments as when they are needed.
Working in small increments, using blue/green deployments,
automating testing, embedding testing in the team and Test Driven
Development (TDD) all contribute to a reduction in the number of
steps in the RtL, reducing the risk and accelerating the flow of value.
Once more, value stream mapping uncovers how much time is spent
stepping through the RtL.
TDD is a software development process that relies on the repetition of
a very short development cycle: first the developer writes an (initially
failing) automated test case that defines a desired improvement or new
function, then produces the minimum amount of code to pass that test,
and finally refactors the new code to acceptable standards.

As well as DevOps, we have DevSecOps. Whilst not all
in the industry are comfortable with the addition of
another term (it has the potential to confuse people
and create additional silos and handoffs) it recognizes
that security has been late to the party, or that their
invitation was sent late.
In many organizations security represents a severe
constraint, unsurprisingly since there are many
reports of cybersecurity skills shortages, and often are
significantly separate from the rest of the technology
team. It’s not uncommon, when performing value
stream mapping exercises, to find delays of several
weeks while teams wait for penetration tests.
3.0 The DevOps Approach to Security

There are many who say that security is just another test and just
another non-functional requirement, and whilst elements of this is true,
it’s also true that the extreme separation of the security team and their
often being seen as a ‘black-box’ means that incorporating them into
the pipeline earlier (shifting left) is more difficult to do than with some
other areas of testing. For example, it’s relatively easy for developers to
start incorporating unit tests as part of their automated build process.
Automated integration tests and user acceptance tests follow fast.
3.1 Long Term Goal: Checks in the IDE
Here the security tests are pushed as far left as technically possible;
into the developers’ hands, providing developers with the knowledge
they need about vulnerabilities in the components that they are
accessing from their IDE in the artifact repository and handing them
control over the software supply chain.
3.2 Current Condition: Pen tests in Prod
Most organizations perform regular or sporadic penetration tests or
vulnerability assessments in production and many are required by
regulators to do so and audited to ensure they happen. They can be
done either manually or using tools, typically a combination of the two,
and produce a report that is then passed to developers who work the
actions into their backlog. Or not.

Ultimately the security constraint is broken so there is no wait time for
security activities to complete and the teams are confident that their
product is as uncompromisable as possible. We break the security
constraint through culture and the sharing of knowledge and from
automating checks and remediation.
3.3.1 Shifting Left and Automation
As described, in DevSecOps security testing happens much earlier than
penetration testing in production (although, in many cases this may still
need to happen, not just for auditing purposes but for configuration cases
also). Where the teams are using artifacts, the repositories can be used to
scan and flag for vulnerabilities at the point of software composition. The
developer can be informed as they access a component of its vulnerability
status and advised if another version fits the organization’s security
policies better. If the teams don’t want developers interrupted in this way,
non-compliant vulnerabilities can break the build in the CICD pipeline.
Static and Dynamic Application Security Testing (SAST and DAST) are
also used to test the source code and the application when its running.
IAST (Interactive Application Security Testing) analyzes code for security
vulnerabilities while the application is running from inside the application
and reports in real-time. As cloud and CICD proliferate machines,
automated identity management tools are also recommended.

3.3.2 DevSecOps Culture and Behaviors
The relationship between development and security is fractious in
many organizations, with security believing that developers don’t care
about security and developers feeling that security are overly zealous,
detailed and don’t understand the myriad of pressures that they
are under.
An effective pattern is to have security people work in a product team
or feature squad on a temporary basis. Whilst there may not be a lot
of security people to go around (some refer to the 100:10:1 ratio of
developers:operations:security), the payoff is worth it as there are
two key benefits; the first is the building of empathy and relationships
and the second is knowledge transfer as the 80:20 rule applies here:
80% of the security issues relate to 20% of the knowledge. This 20%
of knowledge is relatively easy for the engineers to access, retain and
share in this scenario.
Developers do care about
security, since they care
deeply about their code,
particularly when they
are transitioned to a
‘you build it, you own it’
way of working.

They also care about the customer experience and for the organization
that they work for - few people are ignorant of the wide ranging
impact on company performance and reputation that a breach causes.
However, they are focused on new features that deliver value first then
the improvement of the way in which they deliver value.
Although we aim for multi-functional, ‘comb’-shaped people, nobody
can know everything and to expect a developer to know of, understand
and be able to remediate every possible vulnerability is unreasonable.
To ask them to be aware of and follow visible coding policies and use
tools that break the knowledge constraint is not unreasonable.
3.3.3 Customer Feedback and Bug Bounties
In DevOps ways of working the focus is on the customer and the flow of
value to them (The First Way). The Second Way teaches us to shorten and
amplify our feedback loops. Highly evolved and performant organizations
seek feedback from customers and the market on security too; they
understand that transparency leads to trust.
Having a public bug bounty programme
is an effective way of collaborating with
customers and the market to receive
feedback and improve security posture.

3.3.4 Software is Like Milk, Not Wine
New vulnerabilities are found and appear constantly so software that
passed its security tests today may not tomorrow. Tools are available that
continuously assess the bill of materials in applications and offer teams
fast remediation capabilities. We can look forward to a future where
products are automatically updated with security vulnerability fixes.
3.3.5 Belt and Braces
Data breaches aren’t the only way for threat actors to cause problems
with the operation and safety of an organization’s products. They can
do other things, like distributed denial of service attacks for example. In
order to protect yourself from these sort of attacks you’ll need support
from a cloud vendor or a specialist security vendor in this space.
Whilst shifting security left and continuously scanning products in
production for vulnerable components goes an enormous distance in
protection against breaches, it’s doubtful that human penetration testing
or vulnerability assessments on products in production will be in the past
any time soon.
Not only do regulators continue to require evidence for these activities,
humans are infinitely creative and will find configurations and routes into
systems that may not directly relate to a specific vulnerable artifact.

“My hypothesis is that if we launch a bug bounty programme in January,
then by the end of the first quarter, fifteen vulnerabilities of which we
were unaware will have been brought to our attention and it will have
cost us $15,000 from the bug bounty payout budget.”
“As a developer, I believe I’ll fix 100% of security vulnerabilities on the
same day if I know about them in my IDE. At the moment, I have 35
outstanding user stories in Jira flagged as issues found in a vulnerability
assessment and they are between six and sixteen weeks old. I will be able
to close all of them within 3 months using a tool in my IDE.”
“If we introduce IAST into the CICD pipeline, we’ll be able to reduce our
spend on production penetration testing by 30% per annum.”
“If I automate the management of our machine identities, then our
penetration tests will find no vulnerabilities as a result of, and we
will suffer no data breaches traceable to, expired or misconfigured
certificates.”

Support people are typically the lowest paid and least
respected in the technology hierarchy. Strange, when
they are on the frontline, dealing with our customers,
our reason for being, on a daily basis.
The Second Way in DevOps is to amplify and shorten
feedback loops - and in Value Stream Management
we are particularly interested in customer feedback.
So whilst the function of a support role is to fix
customer problems, it’s also to sense customer
sentiment and identify value delivery opportunities.
4.0 The DevOps Approach to Support

4.1 Long Term Goal: You build it, you own it, and/or
swarming
This way of working is centered around small (because of what we’ve
learned about how humans build trust and social connections),
autonomous (because we don’t want them to have to wait for decisions
to be made on their behalf and because we hired them because they
are capable of doing this themselves, and best-placed), multifunctional
(because we don’t want them having to wait for other teams to do stuff
for them) teams. They change and run their product. This isn’t about
giving developers ‘pagers’; this is about having end-to-end ownership of a
value stream.
4.2 Current Condition: 3 tiers
As with all the traditional ITSM patterns described here, there are good
reasons for why they have been widely implemented, and for some time
they worked. But the world keeps turning and right now, digital disruption
demands we all change the way that we work to optimize flow through a
value stream.
Having a support or service desk makes less sense when our users
experience few problems or are mostly able to resolve them themselves
using online documentation. If we want to shorten a feedback loop, it’s
best not to have multiple handoffs through teams - delays don’t help with
our flow or with delighting our customers.

Tiers create queues of work in progress which we seek to minimize as
queuing creates delays. Whilst the tiered approach is intended to ‘protect’
the ‘best’ (read: most expensive) staff from trivial customer issues (is there
such a thing?), when we seek to put the customer at the center of all we
do and want them to have optimized service, why would we put our best
people at the back of the process?
So instead of streaming, we move to swarming.
There are several models organizations work with, but
they all follow these broad principles:
• There should be no tiered support teams or hierarchy
• There should be no escalations from one team to another
• The issues should move directly to the person most likely to be able
to resolve it
• The person who takes the issue is the one who sees it through
to resolution
Swarming isn’t solely for Severity 1 issues or incidents (see Section 5.0 for
more); it establishes teams whose priority to ensure that the issue gets to
the right person as fast as possible and that it receives attention as soon
as possible.

4.3.1 Arrange Around Products
Having small, autonomous and multi-functional teams arranged around
products is the foundation to the ‘you build it, you own it’ mantra. Many
agile transitions start by bringing developers and testers into the same
team along with the ideation capabilities (Product Owners and business
analysis roles).
DevOps and value stream thinking brings Ops capabilities into the team
too and many teams start with support roles. This isn’t simply about
putting the developers on 24/7 call duties but about automating the front
end of support as far as possible and getting the issue in front of the right
person as soon as possible.
DevOps balances throughput and stability so as organizations improve
their posture, teams experience a reduction in the volume of issues
and a shortening of resolution time. When teams are dedicated solely
to support issue resolution, they often find Kanban a suitable way of
managing the flow of work. Where teams are working in development
sprints, they may find it helpful to record unplanned work and practice
assigning a percentage of the sprint to it. Unplanned work is an effective
proxy metric for quality and when measured is extremely useful when
teams want to assign time to invest in paying down technical debt.

4.3.2 Automating Support: ChatOps and Bots
ChatOps is the use of a group messaging tool integrated with
the DevOps toolchain. Chat channels can be created as needed
(typically for an incident) or in permanent use (typically for a theme
for a particular product). Section 5.0 following describes an incident
management use case for ChatOp. A swarming support use case might
allow the received of the customer issue to access a specific backlog
channel and request interaction from that product team or the team
may have their own channels for support issues relating to items such
as payment gateway for example.
The service desk can also encourage customers/consumers of their
service to interact via online chat once they have been guided through
available topics and support artifacts in a knowledge base. Bots
can try to resolve the issue initially and as needed the issue can be
automatically routed to the team and swarmed from there.

4.3.3 Automating Support: Knowledge and Self-Service
Many people don’t enjoy committing extended periods to writing and
documentation, however, to optimize a value stream, ‘just enough’
documentation is key. Underpinning this then is the ‘little and often’
principle; ensuring that small pieces are documented frequently at
source and held in a repository that is easily searchable and visible.
This takes burden off the support team as people can find and resolve
common issues themselves, leaving the support swarms to work with
the edge cases.
4.3.4 Telemetry Everywhere and Viewability/Observability
Much of the waste in the support value stream is in the fault diagnosis
(after we’ve removed delays through handoffs in a tiered model) so the
team needs data to help them identify unknown and unusual issues.
Support teams are frequently poorly supported by tooling, other than
ticketing systems, so providing the product teams with tools that
radiate telemetry means everyone in the team can benefit.
Application monitoring and logging tools accelerate the identification
of the root cause(s) of an issue (and these should be used in pre-
production too) - it’s over to the team then to fix it fast - but their
CICD pipeline will help validate and deploy it at speed. And it’ll be an
emergency fix or a small change so they won’t be slowed down by CAB
or the release schedule.

This type of tools also provides customer journey insights and real-
time feedback on the business value of features and changes that the
whole team can use in the sprint reviews to check the outcome of their
hypotheses and in their sprint planning to set up their next round of
experiments.
4.3.5 CICD and Intelligent Risk Management
Once a team is collaborating on a shared and visible backlog and are
proficient in performing continuous delivery, they will have reduced
their incidents and improved their MTTR. AI tools that help teams to
assess the risk of a release help teams make decisions on when to act
and who to have pre-warned.
Having this data visible to central release teams
provides evidence, builds trust and earns the right
to autonomy.

“We believe that if we set up a backlog swarm, we can resolve 50% of
backlog items over 6 months old in 4 working weeks.”
“My hypothesis is that if we have an incident swarm using ChatOps,
we’ll reduce our MTTR by 70%.”
“Implementing an application performance management tool by the
end of the month means that by the end of next month we’ll see our
fault diagnosis time drop by at least 20%.”
“Making our knowledge base publicly searchable will likely reduce the
volume of tickets by 25% within 6 months.”

5.0 The DevOps Approach to Incidents
“Incidents are unplanned investments; their costs
have already been incurred. Your org’s challenge
is to get ROI on those events. Right now, in most
companies, this ROI is left sitting in the dark because
of the “template-driven” approaches and “action
item” myopia.”
- John Allspaw

We are taught, in all cultures, from an early age, that
failure is to avoided at all costs, and that it’s shameful
and humiliating. It’s only as we grow up and experience
more in life that we realise failures are not only
inevitable, but useful for learning and light the path
to success.
In many large enterprises there is deep-seated fear of failure
(understandably so since many organizations operate infrastructure
whose availability is critical to many). Incidents will happen; however,
DevOps practices allow us to increase the flow of work through the
value stream whilst increasing stability so more value delivered does
not equal more incidents to deal with.
5.1 Long Term Goal: ChatOps in/across teams
The goal of incident management is to restore service as soon as
possible and, arguably more importantly, learn from it. ChatOps
supports this goal in two key ways. Firstly, it allows teams to swarm
through a channel in real time so that everyone has everything visible
through a single pane of glass (contrast this to some people being in
a room, on a conference call, various team members logged into and
observing various systems) and records the progress and process.
Secondly, the team has access to their DevOps toolchain and can both
receive information and make commands from the chat window.

5.2 Current Condition: War rooms, incident managers
The cultural driver for DevOps is the creation of a working space in
which people can be their best and most productive selves; removing
risk of burnout and nurturing autonomy, mastery and purpose. ‘War
rooms’ immediately set a sense of crisis and conflict.
Whilst the sense of urgency should a Severity 1 issue or incident occur
should not be diminished, a number of steps can be taken to move
from a place where incidents are catastrophic and to be avoided at all
costs to one where impact is minimal and they are valued as a learning
opportunity.
DevOps regularly seeks to decentralize activities, especially when
they have been centralized in order to manage dependencies. Since
autonomy reduces handoffs and queueing, assigning an incident
manager from a separate team because systems are so complex, is
unlikely to be the fastest way to restore service.
Ultimately the volume of incidents, or at least the time spent dealing
with them, should be as close to nil as possible since they are the main
disruptor of the delivery of planned work or value to the customer.

5.3.1 Blameless Retrospectives and Experimentation
Rather than having war-rooms, swarm an incident and once service is
restored, hold a blameless retrospective over ChatOps. Agree learnings
and write actions as experiments and save the chat log to the ticket
in the backlog. Close the ticket only once the initial experiments are
complete.
5.3.2 Reframing Failure, Safety Culture and The Andon Cord
Another tool from the kings of Lean, Toyota, The Andon Cord is used
in a manufacturing pipeline to raise an issue. But what’s important
about is the behavior and culture it created. Workers were encouraged
and empowered to highlight potential defects with the knowledge that
their leaders wanted to know about them and fix them at the earliest
opportunity before they continued downstream. Much can be taken
from the Andon Cord: that successful leaders embrace and are grateful
for learning opportunities and encourage their teams to self-discover,
that fixing the problem immediately and preventing it from proceeding
downstream is key to building a quality product and that people are
psychologically safe when they are not afraid to point out mistakes or
try new things.
Safety culture can be broadly defined as a place where all in an
organization share a view on how best to mitigate risk in their
environment and they prioritize learning over failure and create
mechanisms to protect themselves from catastrophic failure.

In an environment where these mechanisms are discovered, perhaps
through value stream mapping, to be slowing the flow, using the
mechanisms described here for change, release, security, support and
incident management accelerate the delivery of value.
5.3.3 Automation: Site Reliability and Chaos Engineering
Several of the automation techniques we have already discussed in this
paper help either to reduce the likelihood of major incidents happening
(CICD, limited blast radius) or make them more manageable (telemetry,
ChatOps, automated deployment). Organizations also look to Site
Reliability Engineering (SRE) to improve their stability posture.
“SRE is fundamentally doing work that has historically
been done by an operations team, but using
engineers with software expertise and banking on
the fact that these engineers are inherently both
predisposed to, and have the ability to, substitute
automation for human labor. In general, an SRE team
is responsible for availability, latency, performance,
efficiency, change management, monitoring,
emergency response, and capacity planning.”
- Ben Traynor, founder of SRE at Google

Some organizations have teams of SREs, others look to embed this
role in product or feature teams/squads. Whichever model is used, the
principle is to increase the focus on antifragility and SRE has this goal
in common with Chaos Engineering. The best known example of Chaos
Engineering is Netflix’s Chaos Monkey which is essentially a fire drill.
With an actual fire.
“We believe if we use chaos engineering to practice incident recovery 4
times this year, we’ll find ways to improve that will reduce our MTTR by
50% next year.”
“I hypothesize that asking two of my product team, one whose
background is in development, the other in system administration to
learn to extend their skillset to include site reliability engineering skills,
they will cross-skill each other and buddy. As a result, our change fail
rate will drop by 5% in 6 months.”
“My experiment says that if we can only close our incident tickets when
all experiments have been completed, we will be able to document
25 key learnings in our knowledgebase in the first quarter of the new
practice.”

Conclusion: Flow & Value Stream
Management
Taking a value stream approach to service delivery puts
the priority on optimization of the flow of work from
the idea to the realization of the value in the hands of
the customer.
Necessarily it demands a rethink of the traditional
approaches and organizational practices, just as
becoming agile and product focused demands we
rethink an inherently waterfall and project centric
approach.

Value Stream Mapping is an extremely valuable and effective method
for quantifying the cycle time, waste and cost associated with
delivering an iteration of a product or service. It also provides a great
deal of qualitative data through the visual collaboration and human
conversation it drives.
Good value stream mapping exercises are held regularly and deliver
backlogs of improvements which are steadily and iteratively worked
through. The disadvantage of Value Stream Mapping is that it’s a
human driven and opinion driven process and whilst those opinions are
mostly accurate (and a big part of the value stream mapping process
is understanding the system and building empathy for counterparts
in the end to end lifecycle of the product or service) they struggle to
provide data as evidence.
Since improvements in value stream flow are likely to necessitate
significant and far-reaching decisions about things like the roles in
the organization, the organizational design, how work is funded and
how investments are prioritized, it’s helpful for the people making
those decisions to be as well-informed as possible and able to monitor
feedback, learnings and evolutionary progress over time.
Following our telemetry everywhere mantra, it’s best to support the
human-driven value stream mapping efforts with data-driven value
stream management evidence.

Choices can be made when building a CICD pipeline or DevOps
toolchain about the traceability of value through the value delivery
lifecycle. Teams can build integrations between the tools themselves
or use available connectors and APIs (but this might make it difficult
to swap tools out as needs inevitably change), or integration brokers
can be used to pass the feature/code from one tool to another as it
progresses.
Since we want feedback for learning, we want all of this to be visible,
so some organizations use dashboards. But when a dashboard is
effectively just screenscraping data from a number of tools and
presenting it in a single pane of glass, it’s very difficult to understand
the end to end cycle time of delivering a piece of value.
Value Stream Management tooling allows simple integration within a
toolchain, which future-proofs for ongoing evolution, and collects data
that not only shows the cycle time but also where it’s slow and risky,
providing insights for improvements.

Further Reading
Learn how to use value streams to accelerate DevOps
transformation at your organization and become a
software juggernaut.
What is Value Stream Management?
Learn DevOps: Enterprise DevOps at Scale
CI/CD Tools Universe: 100+ Tools
Enable Value Stream Management with Plutora:
Why Plutora?
The Plutora Platform
Request a Demo

Helen Beal helps people practice DevOps principles in real
world organizations for Ranger4. She describes herself as
a DevOpsologist as her main role in her working life is to
study the inputs and outputs of the thinking systems that
make up DevOps and what value outcomes they deliver
and we can measure.
Helen is also a product owner and DevOps Ambassador
for London at the DevOps Institute, a DevOps editor for
InfoQ and writes for a number of online platforms.
Outside of DevOps she is an ecologist and novelist. She
once saw a flamingo lay an egg and has a particular
fondness for llamas.
About
the
Author

Service Management in a DevOps World - by Helen Beal

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Service Management in a DevOps World - by Helen Beal

Similaire à Service Management in a DevOps World - by Helen Beal (20)

Plus de Plutora

Plus de Plutora (7)

Dernier

Dernier (20)

Service Management in a DevOps World - by Helen Beal