Some are Useful: AB Testing Programs

As AB testing becomes more commonplace, companies are moving beyond thinking about how to best run experiments to how to best set up and run experimentation programs. Unless the required time, effort, and expertise is invested into designing and running the AB Testing program, experimentation is unlikely to be useful.

Interestingly, some of the best guidance for getting the most out of experimentation can be found in a paper published almost 45 years ago by George Box. If that name rings a bell, it is because Box is attributed with coining the phrase “All models are wrong, but some are useful”. In fact, from the very same paper that this phrase comes from we can discover some guiding principles for running an a successful experimentation program.

In 1976 Box published Science and Statistics in the Journal of the American Statistical Association. In it he discusses what he considers to be the key elements to successfully applying the scientific method. Why might this be useful for us? Because in a very real sense, experimentation and AB Testing programs are the way we implement the scientific method to business decisions. They are how companies DO science. So learning about how to best employ the scientific method directly translates to how we should best set up and run our experimentation programs.

Box argues that the scientific method is made up, in part, of the following:
1) Motivated Iteration
2) Flexibility
3) Parsimony
4) Selective Worry

According to Box, the attributes of the scientific method can best thought of as “motivated iteration in which, in succession, practice confronts theory, and theory, practice.” He goes on to say that, “Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models [and] to worry selectively …”.

Let’s look at what he means in a little more detail and how it applies to experimentation programs. 

Learning and Motivated Iteration

Box argues that learning occurs through the iteration between theory and practice. Experimentation programs formalize the process for continuous learning about marketing messaging, customer journeys, product improvements, or any other number of ideas/theories. 

Box: “[L]earning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice. Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory. Deductions made from the modified theory now may or may not be in conflict with fact, and so on.”

As part of the scientific method, experimentation of ideas naturally requires BOTH a theory about how things work AND the ability to collect facts/evidence that may or may not support that theory. By theory, in our case, we could mean an understanding of what motivates your customer, why they are your customer and not someone else’s, and what you might do to ensure that they stay that way. 

Many times marketers purchase technology and tools in an effort to better understand their customers. However, without a formulated experimentation program, they are missing out on one half of the equation. The main takeaway is that just having AB Testing and other analytics tools are not going to be sufficient for learning. It is vital for YOU to also have robust theories about customer behavior, what they care about, and what is likely to motivate them. The theory is the foundation and drives everything else. It is then through the iterative process of guided experimentation, that then feeds back on the theory and so on, that we establish a robust and useful system for continuous learning. 


Box;  “On this view efficient scientific iteration evidently requires unhampered feedback. In any feedback loop it is … the discrepancy between what tentative theory suggests should be so and what practice says is so that can produce learning. The good scientist must have the flexibility and courage to seek out, recognize, and exploit such errors … . In particular, using Bacon’s analogy, he must not be like Pygmalion and fall in love with his model.”

Notice the words that Box uses here: “unhampered” and “courage”. Just as inflexible thinkers are unable to consider alternative ways of thinking, and hence never learn, so it is with inflexible experimentation programs. Just having a process for iterative learning is not enough. It must also be flexible. By flexible Box doesn’t only mean it must be efficient in terms of throughput. It must also allow for ideas and experiments to flow unhampered, where neither influential stakeholders nor the data science team holds too dearly to any pet theory. People must not be afraid of creating experiments that seek to contradict existing beliefs, nor should they fear reporting any results that do.  


Box: ”Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary, following William of Occam [we] should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so over elaboration and overparameterization is often the mark of mediocrity.”

This is where the  “All Models are Wrong” saying comes from! I take this to mean that rather than spend effort seeking the impossible, we should instead seek what is most useful and actionable –  “how useful is this model or theory in helping to make effective decisions?”

In addition, we should try to keep analysis and experimental methods as simple as required for the problem.  Often companies can get distracted, or worse, seduced by a new technology or method that adds complexity without advancing the cause.  This is not to say that more complexity is always bad, but whatever the solution is, it should be the simplest one that can do the job. That said, the ‘job’ may be really for signaling/optics rather than to solve a specific task. For example, to differentiate a product or service as more ‘advanced’ than the competition, regardless if it actually improves outcomes. It is not for me to say if those are good enough reasons for making something more complex, but I do suggest being honest about it and going forward forthrightly and with eyes wide open. 

Worry Selectively

Box: “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

This is my favorite line from Box. Being “alert to what is importantly wrong” is perhaps the most fundamental and yet underappreciated analytic skill. It is so vital not just in building an experimentation program but for any analytics project to be able to step back and ask “while this isn’t exactly correct, will it matter to the outcome, and if so by how much?” Performing this type of sensitivity analysis, even if informally in your own mind’s eye, is an absolutely critical part of good analysis. You don’t have to be an economist to think, and decide at the margin.

Of course if something is a mouse or a tiger will depend on the situation and context. That said, in general, at least to me, the biggest tiger in AB Testing is fixating on solutions or tools before having defined the problem properly. Companies can easily fall into the trap of buying, or worse, building a new testing tool or technology without having thought about: 1) exactly what they are trying to achieve; 2) the edge cases and situations where the new solution may not perform well; and 3) how the solution will operate within the larger organizational framework. 

As for the mice, they are legion.  They have nests in all the corners of any business, whenever spotted causing people to rush from one approach to another in the hopes of not being caught out.  Here are a few of the ‘mice’ that have scampered around AB Testing:

  • One Tail vs Two Tails (eek! A two tailed mouse – sounds horrible)
  • Bayes vs Frequentist AB Testing
  • Fixed vs Sequential designs
  • Full Factorial Designs vs Taguchi designs

There is a pattern here. All of these mice tend to be features or methods that were introduced by vendors or agencies as new and improved, frequently over-selling their importance, and implying that some existing approach is ‘wrong’.  It isn’t that there aren’t often principled reasons for preferring one approach over the other. In fact, often, all of them can be useful (except for maybe Taguchi MVT – I’m not sure that was ever really useful for online testing) depending on the problem. It is just that none of them, or others, will be what makes or breaks a program’s usefulness.

The real value in an experimentation program are the people involved, and the process and culture surrounding it – not some particular method or software.  Don’t get me wrong, selecting software and the statistical methods that are most appropriate for your company matters, a lot, but it isn’t sufficient.  I think what Box says about the value of the statistician should be top of mind for any company looking to run experimentation at scale:
 “… the statistician’s job did not begin when all the work was over-it began long before it started. …[The Statistician’s] responsibility to the scientific team was that of the architect with the crucial job of ensuring that the investigational structure of a brand new experiment was sound and economical.” 

So too for companies looking to include experimentation into their workflow. It is the experimenter’s responsibility to ensure that the experiment is both sound and economical and it is the larger team’s responsibility to provide an environment and process, in part by following Box, that will enable their success.  

If you are looking to upgrade your AB Testing software and improve your experimentation program please contact us here to learn more.

Post a Comment

Your email is never published nor shared. Required fields are marked *