3. OBJECTIVE
Put working software into production as quickly as possible, whilst minimising risk of
load-related problems:
• Bad response times
• Lack of capacity
• Availability too low
• Excessive system resource use
Within the context of websites.
6. DECIDE WHATTOTEST
•Focus on busiest instant
•Model most-hit functionality
•Extrapolate to expected load
•Look at production traffic
•Or attempt educated guess
7. DECIDE ON SCOPE
Component test
Chain test
Full environment test
•Test coverage
•Level of certainty
•Number of systems
•Amount of work
8. SET UPTEST DATA
• Usually starts as a copy from production
• Or educated guess what people will enter
• Render anonymous
• Make tests deterministic
• Synchronise between all systems
http://www.flickr.com/photos/22168167@N00/3889737939/
9. DECIDE ON STRATEGY
One or more of:
•Scalability test
•Stress test
•Endurance test
•Regression test
•Resilience test
http://www.flickr.com/photos/timjoyfamily/5935279962/
14. PERMANENT LOADTESTING
Daytime: constant load, teams
inspect impact of changes
Nighttime: Endurance
test
Weekends: refresh test data
http://www.flickr.com/photos/renaissancechambara/5106171956/
15. RESPONSETIME
DNS lookup (www.xebia.com)
Time to first byte + loading HTML
Time to render
Time to document complete
Browser CPU use
Bandwidth
# connections to a single
host
http://www.webpagetest.org/result/130522_FG_10SC/1/details/
SSL handshake
Parse times
Blocking client code
17. CLEAR REQUIREMENTS
Response time
Fail: 10 Now: 3.5 Goal: 1
Intention: Users get a response quickly so that
they are happy and spend more money.
Stakeholder: Marketing dept.
Scale: 95th percentile of “document complete”
response times, in seconds, measured over one
minute.
Metric: Page load times as reported by our
RUM tool.
Inspired byTom Gilb, Competitive Engineering
18. WebPageTest: first view + repeat view (median of 3)
95th percentile response times from access logs
ADJUST REQUIREMENTS DUETO LACK OF
REAL BROWSERS
19. Playground to test changes
No impact on real users
Less pressure
More work
Guesswork and extrapolation
Can take a significant amount of time
More hardware
20. THINGS WILL BREAK...
... in spite of your best efforts
http://www.flickr.com/photos/jmarty/1239950166/
21. SO INSTEAD WE SHOULD FOCUS ON
FAST RECOVERY
http://www.flickr.com/photos/19107136@N02/8386567228/
22. “MTTR is more important than
MTBF*”
John Allspaw
* for most types of F
41. MONITORING
Technical metrics
•CPU use
•Memory use
•TPS
•Response times
•etc
Process metrics
•# bugs
•MTTR, MTTD
•Time from idea to live on site
•etc
Business metrics
•Revenue
•# unique visitors
•etc
http://www.flickr.com/photos/smieyetracking/5609671098/
46. GO/NO-GO MEETINGS
• What are the biggest fears?
• How can we measure this?
• What can be done if it does happen?
47. RETROSPECTIVES
How can we prevent a failure from
happening again?
How can we detect it earlier?
Was there only one root cause?
http://www.flickr.com/photos/katerha/8380451137
49. CULTURE
• Dev and Ops work together on providing information.
• Assumptions are dangerous, try to eliminate as many as possible.
• Small changes are easier to fix than large ones.
• Deploy during office hours so everyone is available in case problems happen.
• All information, including business metrics, should be accessible to everyone.
51. SIMPLE, FLEXIBLE ARCHITECTURE
• If the site goes down often, probably its architecture is at fault
• Avoid fragile systems
• Resilience is key
• Scalable (redundancy is not waste)
• Rather many small systems than a few large ones
• State is a “hot brick”
52. CHANGES FORTHE BUSINESS
• Accept to push smaller changes.
• Continuous delivery vs continuous
deployment.
• Share data.
53. CONCLUSION
Work on your ability to respond to failure.Trying to prevent failure can slow you down
and make you focus on the wrong things.
Keep assumptions clearly separated from facts. Make your decisions based on evidence.
Measure everything, including the impact of changes to the business.
Look for your compromise, try permanent load testing first and learn from that.