#Measurefest : 20 Simple Ways to Fuck Up your AB tests
1. 20 simple ways
to fuck up your AB testing
28th March 2014 @OptimiseOrDie
2. @OptimiseOrDi
e
• UX and Analytics (1999)
• User Centred Design (2001)
• Agile, Startups, No budget (2003)
• Funnel optimisation (2004)
• Multivariate & A/B (2005)
• Conversion Optimisation (2005)
• Persuasive Copywriting (2006)
• Joined Twitter (2007)
• Lean UX (2008)
• Holistic Optimisation (2009)
Was : Group eBusiness Manager, Belron
Now : Spareroom.co.uk
3. #1 : You’re doing it in the wrong
place
@OptimiseOrDie
4. #1 : You’re doing it in the wrong place
There are 4 areas a CRO expert always looks at:
1. Inbound attrition (medium, source, landing page, keyword,
intent and many more…)
2. Key conversion points (product, basket, registration)
3. Processes and steps (forms, logins, registration, checkout)
4. Layers of engagement (search, category, product, add)
1. Use visitor flow reports for attrition – very useful.
2. For key conversion points, look at loss rates & interactions
3. Processes and steps – look at funnels or make your own
4. Layers and engagement – make a ring model
@OptimiseOrDie
9. #1 : You’re doing it in the wrong place
• Get to know the flow and loss (leaks) inbound, inside and through key
processes or conversion points.
• Once you know the key steps you’re losing people at and how much
traffic you have – make a money model.
• Let’s say 1,000 people see the page a month. Of those, 20% (200)
convert to checkout.
• Estimate the influence your test can bring. How much money or KPI
improvement would a 10% lift in the checkouts deliver?
• Congratulations – you’ve now built the worlds first IT plan with a
return on investment estimate attached!
• I’ll talk more about prioritising later – but a good real world analogy
for you to use:
@OptimiseOrDie
10. Think like a store
owner!
If you can’t refurbish the entire
store, which floors or
departments will you invest in
optimising?
Wherever there is:
• Footfall
• Low return
• Opportunity
@OptimiseOrDie
11. Insight - Inputs
#FAIL
Competitor
copying
Guessing
Dice rolling
An article
the CEO
read
Competitor
change
Panic
Ego
Opinion
Cherished
notions
Marketing
whims Cosmic rays
Not ‘on
brand’
enough
IT
inflexibility
Internal
company
needs
Some
dumbass
consultant
Shiny
feature
blindness
Knee jerk
reactons
@OptimiseOrDie
#2 : Your Hypothesis is a piece of
crap
12. Insight - Inputs
Insight
Segmentation
Surveys
Sales and
Call Centre
Session
Replay
Social
analytics
Customer
contact
Eye tracking
Usability
testing
Forms
analytics
Search
analytics Voice of
Customer
Market
research
A/B and
MVT testing
Big &
unstructured
data
Web
analytics
Competitor
evalsCustomer
services
#2 : These are the inputs you
need…
@OptimiseOrDie
13. #2 : Solutions
• You need multiple tool inputs
– Tool decks are here : www.slideshare.net/sullivac
• Usability testing and User facing teams
– If you’re not doing these properly, you’re
hosed
• Session replay tools provide vital input
– Get vital additional customer evidence
• Simple page Analytics don’t cut it
– Invest in your analytics, especially event
tracking
• Ego, Opinion, Cherished notions – fill gaps
– Fill these vacuums with insights and data
• Champion the user @OptimiseOrDie
14. We believe that doing [A] for
People [B] will make
outcome [C] happen.
We’ll know this when we
observe data [D] and obtain
feedback [E]. (reverse)
@OptimiseOrDie
15. #3 : No analytics integration
• Investigating problems with tests
• Segmentation of results
• Tests that fail, flip or move around
• Tests that don’t make sense
• Broken test setups
• What drives the averages you see?
@OptimiseOrDie
17. These Danish
porn sites are
so hardcore!
We’re still
waiting for our
AB tests to
finish!
• Use a test length calculator like this one:
• visualwebsiteoptimizer.com/ab-split-test-duration/
#4 : The test will finish after you die
18. • The minimum length
– 2 business cycles (cross check)
– Usually a week, 2 weeks, Month
– Always test ‘whole’ not partial cycles
– Be aware of multiple cycles
– Don’t self stop!
– PURCHASE CYCLES – KNOW THEM
• How long after that
– I aim for a minimum 250 outcomes, ideally 350+ for each ‘creative’
– If you test 4 recipes, that’s 1400 outcomes needed
– You should have worked out how long each batch of 350 needs before you start!
– 95% confidence or higher is my aim BUT BIG SECRET -> (p values are unreliable)
– If you segment, you’ll need more data
– It may need a bigger sample if the response rates are similar*
– Use a test length calculator but be aware of BARE MINIMUM TO EXPECT
– Important insider tip – watch the error bars! The +/- stuff – let’s explain
* Stats geeks know I’m glossing over something here. That test time depends on how the two experiments separate in
terms of relative performance as well as how volatile the test response is. I’ll talk about this when I record this one!
This is why testing similar stuff sux.
#5 : You don’t test for long enough
@OptimiseOrDie
19. 95%, 99%, 99.99% Confidence – what’s that?
• It’s a stats thing
• Seriously, look at this one LAST in your testing
• Purchase Cycle, Business Cycles, Sample Size, Error bar
separation – ALL come before this one. Got it?
• Why? It’s to do with p-values. Read these articles:
• http://bit.ly/1gq9dtd
• If you rely on confidence, you are relying upon something
that’s unreliable and moves around, particularly early in
testing.
• Don’t be fooled by your testing package – watch the error
bars instead of confidence. 19
#5 : You put faith in the Confidence
value
21. Graph is a range, not a line:
9.1 ± 0.3%9.1 ± 0.9%9.1 ± 1.9%
@OptimiseOrDie
22. • The minimum length:
– 2 business cycles and > purchase cycle as a minimum, regardless of
outcomes. Test for less and you’re cutting.
– 250+, prefer 350+ outcomes in each
– Error bar separation between creatives
– 95%+ confidence (unreliable)
• Pay attention to:
– Time it will take for the number of ‘recipes’ in the test
– The actual footfall to the test – not sitewide numbers
– Test results that don’t separate – makes the test longer
– This is why you need brave tests – to drive difference
– The error bars – the numbers in your AB testing tool are not precise –
they’re fuzzy regions that depend on response and sample size.
– Sudden changes in test performance or response
– Monitor early tests like a chef!
#5 : Test Length Summary
@OptimiseOrDie
23. • Ignore the graphs. Don’t draw conclusions. Don’t dance. Calm down.
• Get a feel for the test but don’t do anything yet!
• Remember – in A/B - 50% of returning visitors will see a new shiny website!
• Until your test has had at least 1 business cycle and 250-350 outcomes, don’t bother even
getting excited!
• Watching regularly is good though. You’re looking for anything that looks really odd – your
analytics person should be checking all the figures until you’re satisfied
• All tests move around or show big swings early in the testing cycle. Here is a very high traffic
site – it still takes 10 days to start settling. Lower traffic sites will stretch this period further.
#6 : You suffer premature test
ejaculation
@OptimiseOrDie
24. #7 : No QA testing for the AB test?
@OptimiseOrDie
25. #7 - QA Test or Die!
• Over 40% of tests have had QA issues.
• It’s very easy to break or bias the testing
Browser testing www.crossbrowsertesting.com
www.browserstack.com
www.spoon.net
www.cloudtesting.com
www.multibrowserviewer.com
www.saucelabs.com
Mobile devices www.perfectomobile.com
www.deviceanywhere.com
www.mobilexweb.com/emulators
www.opendevicelab.com
@OptimiseOrDie
26. #8 : Opportunities are not prioritised
Once you have a list of potential
test areas, rank them by
opportunity vs. effort.
The common ranking metrics that I
use include these:
•Opportunity (revenue, impact)
•Dev resource
•Time to market
•Risk / Complexity
Make yourself a quadrant diagram
and plot them!
27. #9 : Your cycles are too slow
0 6 12 18
Months
Conversio
n
@OptimiseOrDie
28. #9 : Solutions
• Give Priority Boarding for opportunities
– The best seats reserved for metric shifters
• Release more often to close the gap
– More testing resource helps, analytics ‘hawk eye’
• Kaizen – continuous improvement
– Others call it JFDI (just f***ing do it)
• Make changes AS WELL as tests, basically!
– These small things add up
• RUSH Hair booking – Over 100 changes
– No functional changes at all – 37% improvement
• Inbetween product lifecycles?
– The added lift for 10 days work, worth 360k
@OptimiseOrDie
30. #10 : How do I know when it’s ready?
• The hallmarks of a cooked test are:
– It’s done at least 1 or preferably 2+ business and at least one if
not two purchase cycles
– You have at least 250-350 outcomes for each recipe
– It’s not moving around hugely at creative or segment level
performance
– The test results are clear – even if the precise values are not
– The intervals are not overlapping (much)
– If a test is still moving around, you need to investigate
– Always declare on a business cycle boundary – not the middle of
a period (this introduces bias)
– Don’t declare in the middle of a limited time period advertising
campaign (e.g. TV, print, online)
– Always test before and after large marketing campaigns (one
week on, one week off)
@OptimiseOrDie
32. #11: Your test fails
• Learn from the failure! If you can’t learn from the failure, you’ve
designed a crap test.
• Next time you design, imagine all your stuff failing. What would
you do? If you don’t know or you’re not sure, get it changed so
that a negative becomes insightful.
• So : failure itself at a creative or variable level should tell you
something.
• On a failed test, always analyse the segmentation and analytics
• One or more segments will be over and under
• Check for varied performance
• Now add the failure info to your Knowledge Base:
• Look at it carefully – what does the failure tell you? Which
element do you think drove the failure?
• If you know what failed (e.g. making the price bigger) then you
have very useful information
• You turned the handle the wrong way
• Now brainstorm a new test
@OptimiseOrDie
33. #12 : The test is ‘about the same’
• Analyse the segmentation
• Check the analytics and instrumentation
• One or more segments may be over and under
• They may be cancelling out – the average is a lie
• The segment level performance will help you (beware of
small sample sizes)
• If you genuinely have a test which failed to move any
segments, it’s a crap test – be bolder
• This usually happens when it isn’t bold or brave enough in
shifting away from the original design, particularly on
lower traffic sites
• Get testing again!
@OptimiseOrDie
34. • There are three reasons it is moving around
– Your sample size (outcomes) is still too small
– The external traffic mix, customers or reaction has
suddenly changed or
– Your inbound marketing driven traffic mix is
completely volatile (very rare)
• Check the sample size
• Check all your marketing activity
• Check the instrumentation
• If no reason, check segmentation
#13 : The test keeps moving
around
@OptimiseOrDie
35. • Something like this can happen:
• Check your sample size. If it’s still small, then expect this until the test
settles.
• If the test does genuinely flip – and quite severely – then something has
changed with the traffic mix, the customer base or your advertising. Maybe
the PPC budget ran out? Seriously!
• To analyse a flipped test, you’ll need to check your segmented data. This is
why you have a split testing package AND an analytics system.
• The segmented data will help you to identify the source of the shift in
response to your test. I rarely get a flipped one and it’s always something
#14 : The test has flipped on me
@OptimiseOrDie
36. • No – and this is why:
– It’s a waste of time
– It’s easier to test and monitor instead
– You are eating into test time
– Also applies to A/A/B/B testing
– A/B/A running at 25%/50%/25% is the best
• Read my post here :
http://bit.ly/WcI9EZ
#15 : Should I run an A/A test
first
@OptimiseOrDie
37. #16 : Nobody feels the
test
• You promised a 25% rise in checkouts - you only see 2%
• Traffic, Advertising, Marketing may have changed
• Check they’re using the same precise metrics
• Run a calibration exercise
• I often leave a 5 or 10% stub running in a test
• This tracks old creative once new one goes live
• If conversion is also down for that one, BINGO!
• Remember – the AB test is an estimate – it doesn’t
precisely record future performance
• This is why infrequent testing is bad
• Always be trying a new test instead of basking in the
glory of one you ran 6 months ago. You’re only as good
as your next test.
@OptimiseOrDie
38. #17 : You forgot about Mobile &
Tablet
• If you’re AB testing a responsive site, pay attention
• Content will break differently on many screens
• Know thy users and their devices
• Use bango or google analytics to define a test list
• Make sure you test mobile devices & viewports
• What looks good on your desk may not be for the user
• Harder to design cross device tests
• You’ll need to segment mobile, tablet & desktop response
in the analytics or AB testing package
• Your personal phone is not a device mix
• Ask me about making your device list
• Buy core devices, rent the rest from deviceanywhere.com
@OptimiseOrDie
39. • If small volumes, contact customers – reach out.
• If data volumes aren’t there, there are still customers!
• Drive design from levers you can apply – game the system
• Pick clean and simple clusters of change (hypothesis driven)
• Use a goal at an earlier ring stage or funnel step
• Beware of using clickthroughs when attrition is high on the
other side
• Try before and after testing on identical time periods
(measure in analytics model)
• Be careful about small sample sizes (<100 outcomes)
• Are you working automated emails?
• Fix JFDI, performance and UX issues too!
#17 : Oh shit – Low Traffic!
40. • Forget MVT or A/B/N tests – run your numbers
• Test things with high impact – don’t be a wuss!
• Use UX, Session Replay to aid insight
• Run a task gap survey (4Q style)
• Run a dropped basket survey (LF style)
• Run a general survey + check social + other sites
• Run sitewide tests that appear on all pages or large clusters
of pages –
• UVPs (“We are a cool brand”), USPs (“Free returns!”), UCPs
(“10% off today”).
• Headers, Footers, Nudge Bars, USP bars, footer changes,
Navigation, Product pages, Delivery info etc.
#17 : Low traffic site tips
41. • A/B testing – good for:
– A single change of content or design layout
– A group of related changes (e.g. payment security)
– Finding a new and radical shift for a template design
– Lower traffic pages or shorter test times
• Multivariate testing – good for:
– Higher traffic pages
– Groups of unrelated changes (e.g. delivery & security)
– Multiple content or design style changes
– Finding specific drivers of test lifts
– Testing multiple versions (e.g. click here, book now, go)
– Where you need to understand strong and weak cross variable
interactions
– Don’t use to settle arguments or sloppy thinking!
#17 : You chose the wrong kind of
test
42. #20 – Other flavours of testing
• Micro testing (tiny change) – good for:
– Proving to the boss that testing works
– Demonstrating to IT that it works without impact
– Showing the impact of a seemingly tiny change
– Proof of concept before larger test
• Funnel testing – good for:
– Checkouts
– Lead gen
– Forms processes
– Quotations
– Any multi-step process with data entry
• Fake it and Build it – good for:
– Testing new business ideas
– Trying out promotions on a test sample
– Estimating impact before you build
– Helps you calculate ROI
– You can even split test entire server farms
Vs.
43. #20 – Other flavours of testing
Congratulations!
Today you’re the lucky winner of our
random awards programme.
You get all these extra features for free,
on us. Enjoy!
44. Top F***ups for 2014
1. Testing in the wrong place
2. Your hypothesis inputs are crap
3. No analytics integration
4. Your test will finish after you die
5. You don’t test for long enough
6. You peek before it’s ready
7. No QA for your split test
8. Opportunities are not prioritised
9. Testing cycles are too slow
10. You don’t know when tests are ready
@OptimiseOrDie
11. Your test fails
12. The test is ‘about the same’
13. Test flips behaviour
14. Test keeps moving around
15. You run an A/A test and waste time
16. Nobody ‘feels’ the test
17. You forgot you were responsive
18. You forgot you had no traffic
19. You ran the wrong test type
20. You didn’t try all the flavours of testing
45. Is there a way to fix this then?
Conversion
Heroes!
@OptimiseOrDie
And here’s a boring slide about me – and where I’ve been driving over 400M of additional revenue in the last few years. In two months this year alone, I’ve found an additional ¾ M pounds annual profit for clients. For the sharp eyed amongst you, you’ll see that Lean UX hasn’t been around since 2008. Many startups and teams were doing this stuff before it got a new name, even if the approach was slightly different. For the last 4 years, I’ve been optimising sites using the combination of techniques I’ll show you today.
Tomorrow - Go forth and kick their flabby low converting asses