What is the value of my AB Testing Program?


Occasionally we are asked by companies how they should best assess the value of running their AB testing programs. I thought it might be useful to put down in writing some of the points to consider if you find yourselves asked this question.

With respect to hypothesis tests, there are two main sources of value:
1) The Upside – reducing Type 2 error. 
This is implicitly what people tend to think about in Conversion Rate Optimization (CRO) – the gains view of testing. When looking to value testing programs, they tend to ask something along the lines of ‘What are the gains that we would miss if we didn’t have a testing program in place?’ One possible approach to measure this is to reach back into the testing toolkit and create a test program control group.  The users assigned to this control group are then shielded from any of  the changes made based on outcomes from the testing process. This control group is then used to estimate a global treatment effect for the bundle of changes over some reasonable time horizon (6 months, a year etc.)  The calculation looks something like:

Total Site Conversion – Control Group Conversion – cost of the testing program.

You can think of this as a sort of meta AB Test.

Of course, in reality, this isn’t going to be easy to do, as forming a clean global control group will often be complicated, if not impossible, and determining how to value the lift over the various possible conversion measure each individual test may have used can be tricky – especially in non commerce applications.

2) The Downside – mitigating Type 1 loss.
However, if we only consider the explicit gains from our testing program, we ignore another reason for testing – the mitigation of Type 1 errors. Type 1 errors are changes in behaviors that lead to harm, or loss. To estimate the value of mitigating this possible loss, we would need to expose to our control group the changes that we WOULD have made on the site had they not been rejected by our testing. That means that we would need to make changes to the meta control group’s experiences that we have strong reason to think would harming, and degrade their experience. Of course this is almost certainly a bad idea, let alone potentially unethical, and it highlights why certain types of questions are not amenable to randomized controlled trials (RCT) – the backbone of AB Testing.

(Anyone out there using instrumental variables or other counterfactual methods? Please comment if you are).

(for a refresher on Type 1 and Type 2 errors please see https://blog.conductrics.com/do-no-harm-or-ab-testing-without-p-values/)

But even if we did go down this route (bad idea), it still doesn’t get us a proper estimate of the true value of testing, since even if we don’t encounter harmful events, we still were protected against them.  For example, you may have collision car insurance but have had no accidents over the past year. What was the value of the collision insurance, zero? You sure?  The value of insurance isn’t equal to the amount that ultimately gets paid out. Insurance doesn’t work like that – it is good that is consumed regardless if it pays out or not. What you are paying for is the reduction in downside risk – and that is something that testing provides regardless if the adverse event occurs or not.  The difficult part for you is to assess the probability (risk or maybe Knightian uncertainty), and severity of the potential negative results.

The main take away is that the value of testing is in both optimization (finding the improvements); and in mitigating downside risk. To value the latter, we need to be able to price what is essentially a form of insurance against whatever the org considers to be intolerable downside risk. It is like asking what is the value of insurance, or privacy policies, or security policies. You may get by in any given year without them, but as you scale up, the risks of downside events grow, making it more and more likely that a significant adverse event will occur.

One last thing. Testing programs tend to jumble together the related, but separate concepts of hypothesis testing, the binary decision of  Reject/Fail to Reject the outcome, with the estimation of effect sizes, the best guess for the ‘true’ population conversion rates.  I mention this because often we just think about the value of the actions taken based the hypothesis tests, rather than also considering the value of robust estimates of the effect sizes for forecasting, ideation, and for helping allocate future resources  (as an aside, one can run an experiment that has a robust hypotheses test, but also yields a biased estimate of the effect size (magnitude error). [Sequential testing I’m looking at you!]

Ultimately, testing can be seen as both a profit (upside discovery) AND cost (downside mitigation) center.  Just focusing on one will lead to underestimating the value your testing program can provide to the organization.  That said, it is a fair question to ask, and one that hopefully will help lead to extracting even more value from your experimentation efforts.

What are your thoughts? Is there any thing we are missing, or should consider? Please feel free to comment and let us know how you value your testing program.


Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*