Ways to minimise performance risks in continuous delivery

WAYSTO MINIMISE PERFORMANCE RISKS
IN CONTINUOUS DELIVERY
AdriaanThomas
4 June 2013

OBJECTIVE
Put working software into production as quickly as possible, whilst minimising risk of
load-related problems:
• Bad response times
• Lack of capacity
• Availability too low
• Excessive system resource use
Within the context of websites.

TRADITIONAL APPROACH
Load testing through simulation
http://www.ﬂickr.com/photos/danramarch/4423023837

DECIDE WHATTOTEST
•Focus on busiest instant
•Model most-hit functionality
•Extrapolate to expected load
•Look at production trafﬁc
•Or attempt educated guess

DECIDE ON SCOPE
Component test
Chain test
Full environment test
•Test coverage
•Level of certainty
•Number of systems
•Amount of work

SET UPTEST DATA
• Usually starts as a copy from production
• Or educated guess what people will enter
• Render anonymous
• Make tests deterministic
• Synchronise between all systems
http://www.ﬂickr.com/photos/22168167@N00/3889737939/

DECIDE ON STRATEGY
One or more of:
•Scalability test
•Stress test
•Endurance test
•Regression test
•Resilience test
http://www.ﬂickr.com/photos/timjoyfamily/5935279962/

DECIDE ONTEST DURATION
(which is tricky)
http://www.ﬂickr.com/photos/wwarby/3297205226

PROVIDE HARDWARE
http://www.ﬂickr.com/photos/s_w_ellis/2681151694/
Copy of production?
Only one copy?
Virtualisation?
Sharing between teams?

INTEGRATE INTO PIPELINE
Unit test
Functional
integration
test
Load test
Very fast Fast Takes longer

INTEGRATE INTO PIPELINE
Unit test
Functional
integration
test
Load test
Very fast Takes longer

PERMANENT LOADTESTING
Daytime: constant load, teams
inspect impact of changes
Nighttime: Endurance
test
Weekends: refresh test data
http://www.ﬂickr.com/photos/renaissancechambara/5106171956/

RESPONSETIME
DNS lookup (www.xebia.com)
Time to ﬁrst byte + loading HTML
Time to render
Time to document complete
Browser CPU use
Bandwidth
# connections to a single
host
http://www.webpagetest.org/result/130522_FG_10SC/1/details/
SSL handshake
Parse times
Blocking client code

IMPACT OFTHE BROWSER
www.browserscope.org

CLEAR REQUIREMENTS
Response time
Fail: 10 Now: 3.5 Goal: 1
Intention: Users get a response quickly so that
they are happy and spend more money.
Stakeholder: Marketing dept.
Scale: 95th percentile of “document complete”
response times, in seconds, measured over one
minute.
Metric: Page load times as reported by our
RUM tool.
Inspired byTom Gilb, Competitive Engineering

WebPageTest: ﬁrst view + repeat view (median of 3)
95th percentile response times from access logs
ADJUST REQUIREMENTS DUETO LACK OF
REAL BROWSERS

Playground to test changes
No impact on real users
Less pressure
More work
Guesswork and extrapolation
Can take a signiﬁcant amount of time
More hardware

THINGS WILL BREAK...
... in spite of your best efforts
http://www.ﬂickr.com/photos/jmarty/1239950166/

SO INSTEAD WE SHOULD FOCUS ON
FAST RECOVERY
http://www.ﬂickr.com/photos/19107136@N02/8386567228/

“MTTR is more important than
MTBF*”
John Allspaw
* for most types of F

0
0.5
1.0
1.5
2.0
99thpercentileresponsetime(s)
Test duration
MTBF LEADSTO FUD

Time→
TTD find cause (RCA) write & test fix build deploy
validate
compile
deploy&test
Monitoring
Alerts
•Skills
•Organisation
•Culture
•Maintainability
•Simple architecture
•Fastworkstations
•Goodtooling
•Abletoquicklytestlocally
•Automation
•Fastbuildserver
•Efficienttests
Monitoring
•Automation
•Flexiblearchitecture
TTR

DEMING FEEDBACK LOOPS
Plan
Do
Study
Act

OODA LOOPS
Observe
Orient
Decide
Act

THE ONLYTHINGTHAT MATTERS IS
WHAT HAPPENS IN PRODUCTION
Everything else is an assumption.

DEPLOYING CHANGES
http://www.ﬂickr.com/photos/39463459@N08/5083733600

BLUE-GREEN DEPLOYMENTS
Version n+1
Version n
Amazon
Route 53
Elastic
Load
Balancer
Elastic
Load
Balancer
Instances
Instances

DARK LAUNCHING
Web page DB Weather SP

CONTROLLED LOADTESTING
Instance RDS DB
Instance
RDS DB Instance
Read Replica
Instance
Instance
Amazon
Route 53
Elastic
Load
Balancer

MONITORING
http://www.ﬂickr.com/photos/smieyetracking/5609671098/

MONITORING
Technical metrics
•CPU use
•Memory use
•TPS
•Response times
•etc
Process metrics
•# bugs
•MTTR, MTTD
•Time from idea to live on site
•etc
Business metrics
•Revenue
•# unique visitors
•etc
http://www.ﬂickr.com/photos/smieyetracking/5609671098/

tail
-‐f
access_log
|
alstat.pl
-‐i10
-‐n10
-‐stt

Hits

Hits%

TPS
AvgTmTk
TTmTk%

AvgRSize
RSize%
2013-‐06-‐04
19:37:40
(08)

14

0.1%

1.4

1.652

5.7%

2691

0.2%
POST

200
/login.do

14

0.1%

1.4

0.918

3.2%

3739

0.3%
GET

200
/home.do

14

0.1%

1.4

0.879

3.1%

3185

0.2%
POST

200
/order.do

7

0.1%

0.7

0.807

1.4%

1974

0.1%
POST

200
/account.do

4

0.0%

0.4

0.735

0.7%

3228

0.1%
GET

200
/products.do

5

0.0%

0.5

0.697

0.9%

969

0.0%
POST

200
/settings.do

9

0.1%

0.9

0.687

1.5%

1827

0.1%
POST

200
/changeorder.do

27

0.2%

2.7

0.649

4.3%

2997

0.4%
POST

200
/newpasswd.do

15

0.1%

1.5

0.580

2.2%

2488

0.2%
GET

200
/offer.do

95

0.9%

9.5

0.520

12.2%

4801

2.3%
GET

200
/search.do

MEASURE LATENCY
Avg. response times front end vs backend
Number of calls

SMALL DEPLOYMENTS
http://www.ﬂickr.com/photos/rbulmahn/4925464931/

GO/NO-GO MEETINGS
• What are the biggest fears?
• How can we measure this?
• What can be done if it does happen?

RETROSPECTIVES
How can we prevent a failure from
happening again?
How can we detect it earlier?
Was there only one root cause?
http://www.ﬂickr.com/photos/katerha/8380451137

INTRODUCE OUTAGES
Chaos monkey
Game day exercises
http://www.ﬂickr.com/photos/frostnova/440551442/

CULTURE
• Dev and Ops work together on providing information.
• Assumptions are dangerous, try to eliminate as many as possible.
• Small changes are easier to ﬁx than large ones.
• Deploy during ofﬁce hours so everyone is available in case problems happen.
• All information, including business metrics, should be accessible to everyone.

CLAMS
Culture
Lean
Automation
Measurement
Sharing

SIMPLE, FLEXIBLE ARCHITECTURE
• If the site goes down often, probably its architecture is at fault
• Avoid fragile systems
• Resilience is key
• Scalable (redundancy is not waste)
• Rather many small systems than a few large ones
• State is a “hot brick”

CHANGES FORTHE BUSINESS
• Accept to push smaller changes.
• Continuous delivery vs continuous
deployment.
• Share data.

CONCLUSION
Work on your ability to respond to failure.Trying to prevent failure can slow you down
and make you focus on the wrong things.
Keep assumptions clearly separated from facts. Make your decisions based on evidence.
Measure everything, including the impact of changes to the business.
Look for your compromise, try permanent load testing ﬁrst and learn from that.

QUESTIONS?
athomas@xebia.com
@a32an
www.xebia.com
blog.xebia.com
(we’re hiring)

Ways to minimise performance risks in continuous delivery

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Similaire à Ways to minimise performance risks in continuous delivery

Similaire à Ways to minimise performance risks in continuous delivery (20)

Dernier

Dernier (20)

Ways to minimise performance risks in continuous delivery