In 2013, consumers spent $1.96 billion online with U.S. retailers on Black Friday, representing an 18.5% increase over 2012. Discover how Target, one of the nation’s largest retailers, planned for this—their biggest API event of their year.
Aaron Strey from Target, an Apigee customer, and Greg Brail from Apigee will deep dive into the technical challenges of meeting massive business demand, supporting huge traffic surges, and optimizing API load and performance testing.
Join to learn:
• How to forecast load for peak events
• Effective ways to create tests to measure predicted load
• Continuous load testing to be better prepared next time
• Best load testing and performance tools
Download video: http://youtu.be/X03M6CL-FSA
Download podcast: http://bit.ly/1rn1Jge
Download eBook: Are you where your customers are? http://bit.ly/1puABrr
Download eBook: Mobile mandate for retail http://bit.ly/1rwzbCD
10. Example metrics
10
Last 15 minutes
Total Hits 7m ago
2347
Average TPS 7m ago
3.3
Perfy -Avg
TPS per Minute by API 7m ago
15
10
View results
carts v1.0
carts v2.0
guests v3.0
lists v1.0
locat... v2.0
prod... v2.0
prod... v3.0
pro... v2.0
regis... v1.0
1:05 PM
Thu Sep 4
2014
1:10 PM 1:15 PM
_time
5
Distribution by Region 7m ago
View results
East
NULL
Perfy-Avg
Errors per Second per Minute 7m ago
3
2
View results
Errors
201...5:00 201...5:00 201...5:00 201...5:00
_time
1
Errors
Perfy -Count
of Errors by API 7m ago
View results
count
carts v2.0
products v3.0
API
0 2 4
count
0.164609
7m ago
Average Client Total
Response Time
0.098480
7m ago
Average Gateway
Processing Time
0.068041
7m ago
Average Backend
Processing Time
0.304624
7m ago
Standard Deviation Client
Total Response Time
0.165385
7m ago
Standard Deviation
Gateway Processing Time
0.260147
7m ago
Standard Deviation Backend
Processing Time
7m ago
Perfy -95th
Percentile Response Time
1.5
1
View results
p95(ClientTotalTime)
p95(difference)
p95(TargetTotalTime)
1:05 PM
Thu Sep 4
2014
1:10 PM 1:15 PM
_time
0.5
7m ago
Perfy -95th
Percentile Response Time where Cache NOT True
1.5
1
View results
p95(ClientTotalTime)
p95(difference)
p95(TargetTotalTime)
1:05 PM
Thu Sep 4
2014
1:10 PM 1:15 PM
_time
0.5
14. Keep ownership of performance
testing as close to the
development team as possible
14
15. Types of tests
Stress: determines the load you can handle and still meet your
SLA
Load: 80% of stress
Soak: 80% of load for an extended period
Spike: 80% of load to 120% of load back and forth
15
For customer presentations consider putting the customer’s name on the first slide.
Biggest drivers of traffic are our mobile apps (Cartwheel, ipad, iphone, android, etc)
Mention response times
Grinder, Jmeter, curl loader, vendor software?
I will provide a verbal compare and contrast of the different toolsets we looked at and evaluated.
Use the cloud
Structure your test plans in a way that makes sense:
You should think about maintainability of your test plans the same way you think about your code (low coupling, high cohesion)
In one instance we found a problem with APIs that used Oauth story
Use source control
Seems obvious but surprisingly it is not
Early on in our efforts before we did not have a ton of discipline around version control. We had test scripts walk out the door on developer laptops.
Started by going off and asking clients what they thought usage would be. Turns out this wasn’t the best approach.
We log every single API request that is made to our platform and we persist it for future analysis. This puts us in a great position to estimate future load. Formula we used:
max(avg TPS) over a 60 second timeframe 2 Sundays before last Thanksgiving and max(avg TPS) over a 60 second timeframe on last Thanksgiving to get a multiplier.
In many cases replaying production traffic was our best option.
We were very close
When you have a new consumer of your API you need to work closely with them as they onboard to your API.
In the era of big data our ability to monitor and troubleshoot is unparalleled.
-We log everything and we persist it. We continually poll our infrastructure with tools like vmstat, top, netstat, iostat, etc and we forward that information to a system that aggregates logs.
-We also log and monitor application level details like the request URI for every single request.
-Putting them all in the same location allows us to query data ad-hoc and tie together events at different layers of our API stack (we use Splunk).
In the era of big data our ability to monitor and troubleshoot is unparalleled.
-We log everything and we persist it. We continually poll our infrastructure with tools like vmstat, top, netstat, iostat, etc and we forward that information to a system that aggregates logs.
-We also log and monitor application level details like the request URI for every single request.
-Putting them all in the same location allows us to query data ad-hoc and tie together events at different layers of our API stack (we use Splunk).
In the era of big data our ability to monitor and troubleshoot is unparalleled.
-We log everything and we persist it. We continually poll our infrastructure with tools like vmstat, top, netstat, iostat, etc and we forward that information to a system that aggregates logs.
-We also log and monitor application level details like the request URI for every single request.
-Putting them all in the same location allows us to query data ad-hoc and tie together events at different layers of our API stack (we use Splunk).
Think Chaos Monkey
At our last internal hack day a handful of engineers on my team built a poc that does the following:
Query’s chef server to get a list of nodes running a specific API cookbook.
Connect to a specified number of those nodes and kill the process running the API.
Run some load
Bring the node back up
Ensure there was no performance impact when bringing down nodes.
Much of the tooling we have built around performance testing internally was built by engineers/developers that work on the APIs. Many of our performance tests scripts are written by the same developers/product owners/analysts building the API. That is not to say we take a dogmatic approach saying developers must write all the tests all the time. We have a handful of extremely talented test engineers that provide help out with writing performance and functional tests when it makes sense.
Load testing writes presents some additional challenges, especially when we are talking about ecommerce.
We are not a unicorn company yet. We are on our way, but can’t spin up an exact production replica at the drop of a hat.
Options:
Stub out responses as far back in the stack as possible
Build out a lower environment that is as close to production as possible
Assume the risk of not testing
All of the things we have seen happen with functional testing