1. Company
D
LOGO
A/B Testing Framework
Design Issues
Patrick McKenzie 2010
(This presentation is meant to be read. It is released
under the Creative Commons By Attribution license –
feel free to spread it or use it.)
www.abingo.org
By Patrick McKenzie 2010. Please use or send to people who'd benefit.
2. Company
D
A/B Testing Frameworks
LOGO
• Why You Should Care
• Core Use Scenarios
• A/B Test Lifecycle
• Design Decisions
• Technical Considerations
• API Considerations
www.abingo.org
3. Company
D
Why You Should Care
LOGO
There is a paucity of A/B testing frameworks.
"I can probably name a dozen different systems for
building high scale applications (distributed storage,
message queues, caching layers, search engines,
etc), but I can’t name a single AB testing framework
other than Google Website Optimizer. That seems
like a serious inversion of priorities for most
startups."
http://www.tomkleinpeter.com/2009
/01/21/where-are-the-ab-testing-
frameworks/
www.abingo.org
4. Company
D
Why You Should Care
LOGO
• A/B testing helps you validate your
hypotheses about customers and product.
• A/B testing is drop-dead easy if your tech
supports it.
• You won't do it otherwise, because it feels
like boring busywork.
The goal is to have split-testing be a continuous part of our
development process, so much so that it is considered a
completely routine part of developing a new feature. In fact,
I've seen this approach work so well that it would be
considered weird and kind of silly for anyone to ship a new
feature without subjecting it to a split-test. That's when this
approach can pay huge dividends.
Eric Ries in blog post
www.abingo.org
5. Company
D
Why You Should Care
LOGO
• There are only two decent A/B test
frameworks for Rails. Both less than 9
months old.
• There are (to best of my knowledge) no
OSS frameworks for Java, Python, etc.
• You should write one. V1.0 can be done
in 10 man hours in modern MVC
frameworks. Will be best ROI you ever
get.
• This presentation hopes to save you time
by telling you where the hard decisions
are.
www.abingo.org
6. Company
D
Three Use Scenarios
LOGO
• Customers interacting with site.
• Implementers coding A/B test.
• Somebody interpreting results.
www.abingo.org
7. Company
D
User View of A/B Test
LOGO (What Cindy Sees)
www.abingo.org
8. Company
D
User View of A/B Test
LOGO (What Bob Sees)
www.abingo.org
9. Company
D
Key Points For Users
LOGO
• Users get consistent behavior. Cindy
always sees her alternative. Bob always
sees his.
• A/B test doesn't break usage of site.
(Sounds obvious, can be non-trivial. Test
for interactions!)
• Ending A/B test doesn't break site.
Did you know that in Google Website Optimizer
users can bookmark individual A/B alternatives
because they have distinct URLs? And that after
the test is over they may 404? Yeah. Don't do
that.
www.abingo.org
10. Company
D
What Developers See
LOGO
• One line to add a test.
• One line to track it.
• No thought required beyond creating
alternatives.
www.abingo.org
11. Company
D
What Internal Customers See
LOGO
• Simple, clear, actionable results.
• Stats 101 not required.
Your marketing team might know math.
That doesn't mean they should have to.
www.abingo.org
12. Company
D
A/B Test Lifecycle
LOGO
• Come up with alternatives.
• Code alternatives.
• Test alternatives.
• Deploy to site.
• Users interact with alternatives.
• Analyze results.
• End test.
When designing your A/B testing framework,
keep in mind that you'll be doing all of the
above. Eliminate as much friction from each
step as possible – this decreases total time
through the loop.
www.abingo.org
13. Company
D
Come up with alternatives.
LOGO
• Not generally a technical problem.
• Inspiration can come from anywhere – a
blog post, a passing fancy, customer
comments.
• Should never have to say "We can't do
that!"
• Strong recommendation: If we pay your
salary, you are authorized to test.
Customers do not think in terms of
Model/View/Controller interfaces. They just want
to know what the app can do. You should be able
to A/B test from any point in the app.
www.abingo.org
14. Company
D
Code Alternatives
LOGO
• Programming is hard, but you have to do it
anyway.
• Programming A/B tests is easy – one liner
and if statement.
• Testing framework handles all
bookkeeping – programmers never care.
• Re-use conversion code. Typical
businesses have lots of tests, few defined
conversions. No need to reinvent wheel
every single time.
www.abingo.org
15. Company
D
Test Alternatives
LOGO
• A/B tests are live code. They can have
bugs. You should be able to unit test like
normal.
• Helpful for developers to have access to
quick "switch what test I'm seeing"
functionality. Simplest example: manually
add parameter to URL
(&exampleTest=altA). Turn off feature in
production.
• Careful of test interactions. Very easy to
do once you start testing behavior in
addition to display.
www.abingo.org
16. Company
D
Deploy to site.
LOGO
• Avoid pointless work here. "Push code
live, test starts automatically" is the ideal.
• Testing framework should handle its own
setup first time test is called. After that, re-
use.
• Note this decision going to be made
thousands or hundreds of thousands of
times, possibly right after you push live:
consider performance implications.
• Can make code default to old version,
control start/stop of test via dashboard.
Could be worth it, adds complexity.
www.abingo.org
17. Company
D
Users interact with alternatives.
LOGO
• Happily, this takes very little work for you...
• … except when it creates Heisenbugs.
• In addition to thorough testing, make sure
your "What The User Is Seeing" feature
(you have one, right?) reflects their A/B
tests.
www.abingo.org
18. Company
D
Analyze results.
LOGO
• Stats behind A/B tests may not be well
understood. Impress that stats are real,
measured, and actionable. It doesn't
matter if they think it is magic as long as
they trust the magic.
• Do significance testing so it isn't magic.
• Doing significance testing is grunt work: let
the computer do it.
• Spend the extra time to make internal
dashboard pretty. People trust pretty
things.
• A/B tests not a good place to dig for data.
One glance tells you all you need.
www.abingo.org
19. Company
D
End test
LOGO
• Simple solution: rip code out, test stops.
• Simple solution requires redeploy. In event of bug
or strong test result ("Oh my God what were we
thinking!?!") might want immediate end button on
dashboard. Be able to specify alternative.
• Automatic end of test? Probably a misfeature, but
easy to implement.
• Ending test should switch all users to winner (or
else you get to support old tests until doomsday).
However, users have memories.
• Negatively affected users (e.g. you end test in favor
of higher price, user planning on buying later saw
lower price) may be mad. Not big problem, but be
ready.
www.abingo.org
20. Company
D
Design Considerations
LOGO
• Tracking and managing identity.
• How to choose alternatives by identity.
• Where to store test participation.
• Where to store alternatives.
• Stats is hard, let's go shopping.
• Presenting results.
www.abingo.org
21. Company
D
Tracking Identity
LOGO
• Cindy is Cindy, Bob is Bob, Cindy should
always see Cindy's tests.
• Cindy is not a cookie. Cindy is not a
session. Cindy is freaking Cindy. Even
when she is on different computer.
• You already have identity via user
authentication. Probably want to punt
identity problem there. Have it inform
framework of current user identity.
• Important edge case: new user signup
should persist “identity” from anonymous
visitor to identifiable user.
www.abingo.org
22. Company
D
Tracking Identity
LOGO
• Easiest identity is random number thrown
into cookie. Associate with user accounts.
Restore on login. Bam, done.
• However, you will occasionally have A/B
test conversions outside of Cindy's HTTP
cycle. (e.g. Purchase notification comes
from Paypal, not from Cindy. Cindy calls
up to place order.) Think it through – not
terribly difficult if you plan for it.
www.abingo.org
23. Company
D
How To Choose Alternatives
LOGO
• If you have N alternatives, picking
randomly and persisting it by identity works
decently.
• Another approach: MD5(identity) %
number_of_alts. Saves space
(marginally).
• Don't need to save what test Cindy is
seeing as long as you can reproduce it.
www.abingo.org
24. Company
D
How To Choose Alternatives
LOGO
• If you have N alternatives, picking
randomly and persisting it by identity works
decently.
• Another approach: MD5(identity) %
number_of_alts. Saves space
(marginally).
• Don't need to save what test Cindy is
seeing as long as you can reproduce it.
www.abingo.org
25. Company
D
Where to store test participation
LOGO
• Cookie/session bad idea: Cindy will log in
at work tomorrow. She should see
consistent behavior.
• Cache (memcached) possible, but if Cindy
is evicted from cache or cache resets,
tough for Cindy and tough for you.
• Persistent data store best bet. Will talk
about specific data stores later in slides.
www.abingo.org
26. Company
D
Where to store alternatives
LOGO
• Many approaches. Whatever works for
you.
• A/Bingo puts alternatives directly in code.
Easiest place, always right in front of
developer, no conceptual overhead.
• Vanity puts alternatives in special
experiment files. Arguably cleaner code,
but have to context/switch.
• Google Website Optimizer has you define
alternatives on a web form. Great for
marketing department at insurance
company. Don't do this. Greatly limits
possibilities, increases integration work,
www.abingo.org blows testing to heck and back.
27. Company
D
Doing Stats
LOGO
• If possible, call out to dedicated stats
modules/libraries to do stats.
• Many types of possible stats for A/B
testing. Pick one, stick with it. I use Z-
scores because a) I remember them and
b) implementation was drop-dead easy.
• Sadly, Ruby lacks many good stats
libraries. Oh, to be a Perl programmer...
• This subject worth its own presentation.
See Ben Tilly.
http://elem.com/~btilly/effective-ab-testing/
www.abingo.org
28. Company
D
Presenting Results
LOGO
• Text is easy! Graphs not quite.
• Google's confidence bars are sexy... and
pretty useless.
• Simple, human language to describe what
confidence intervals and statistical
significance mean.
• De-emphasize null results (A > B but not
statistically significantly so) but don't hide
them. (After all, the fact that "this test was
too close to call" tells you something
useful.)
www.abingo.org
29. Company
D
Technical Considerations
LOGO
• Less than 1,000 visitors per hour? Skip
these slides.
• A/B testing turns performance
assumptions on head: heavy writes in very
bursty fashion ("as soon as test goes
live"), very non-relational data, fairly
infrequent reads (~3X writes on my site),
extraordinarily infrequent use of summary
statistics.
• Practically tailor-made for key/value store,
not so much for SQL.
www.abingo.org
30. Company
D
Queries You Have To Answer FAST
LOGO
• Who is Cindy? (user → identity)
• Is Cindy participating in Test X?
• If so, what alternative has she seen?
• If not, what alternative should she see?
• Record fact that Cindy is participating in
Test X.
• Has Cindy converted in Test X?
• Record fact that Cindy converted for Test
X.
www.abingo.org
31. Company
D
Queries You Can Answer Leisurely
LOGO
• How many people have participated in
Experiment X?
• How many saw Alternative A?
• Umm, do that stats magic for me.
www.abingo.org
32. Company
D
Query You Will NEVER ASK
LOGO
• Who saw Alternative A in Experiment X?
www.abingo.org
33. Company
D
Possible Architectures
LOGO
• Summary statistics (participant counts &
conversion counts) in MySQL table with
"fairly few" rows. Simple increment
statements for updates.
• Participation information (Cindy,
Experiment X, Alternative A) in key/value
store.
• Or whole thing in key/value store.
www.abingo.org
34. Company
D
Quick Speed Improvement for SQL
LOGO
• Give each of your alternatives a unique
string ID like MD5(experiment name,
alternative name). Calculate that in
application code. Index on column.
• UPDATE alternatives SET participants =
participants + 1 where lookup_code =
'CALCULATED IN APPLICATION';
• This avoids having to translate human
name in code to ID in table. (Or having to
use multi-column index for lookup.)
• Note: I am not a very good guy with DBs,
but I am informed this is fairly fast. Test
for yourself.
www.abingo.org
35. Company
D
Specific Key/Value Store
LOGO Recommendations
• MySQL with big string columns for key,
value: ewwwwww. I mean, ewwwwww.
• Memcachd: Acceptable (and fast) but not
persistent. Also tends to only go down
when server does. For A/B testing, might
just re-run all in progress tests if it dies.
• MemcacheDB: Tried it. Has unacceptable
performance when BerkeleyDB flushes to
disk. (5 seconds+!)
• Redis: Tried it. Not in production yet. My
recommendation – very fast. Vanity also
uses it.
www.abingo.org
36. Company
D
API Considerations
LOGO
Only need to expose two methods:
• ab_test(name, alternatives, conversion_name)
• conversion(conversion_name)
Note lack of identity in method calls. Let the
framework worry about that.
How you specify alternatives up to you.
Array of strings is easy to understand.
www.abingo.org
37. Company
D
Consuming API
LOGO
ab_test(name, alternatives, conversion_name) returns
the chosen alternative, handles all bookkeeping as
side effect.
Typically:
if (ab_test(...) == "something") {
#do something
} else {
#do something else
}
Fun opportunity for blocks/binding if your language
supports that.
www.abingo.org
38. Company
D
Got Questions?
LOGO
Great A/B testing resources:
• Eric Ries (startuplessonslearned.com) – heavy on
motivation, less on stats/design decisions
• #abtests and @abtests on Twitter. Good
community, many ideas for inspiration.
• http://abtests.com – ditto
• http://www.bingocardcreator.com/abingo/resources
– links I use when I forget the math.
• http://www.kalzumeus.com – my blog
• patrick@bingocardcreator.com –
I'm always happy to chat about A/B testing, with
anybody. Potentially available for consulting.
www.abingo.org