Balancing Earning with Learning: Bandits and Adaptive Optimization

While AB Testing is currently the default method for optimizing online conversions, there are frameworks other than AB testing for selecting between competing optimization options.  One of the main alternative approaches is known as the Multi-Armed Bandits problem.  Solutions to the so called bandit problems have several nice features that often make them a good choice for your optimization problems.

This post looks a little long, what am I going to learn?

You will learn how you might be able to squeeze even more from your optimization efforts by using alternatives to AB Testing.  Additionally,  you should have a basic handle on what the multi-armed bandit problems it and how it relates to your optimization goals.

Why Bandits?

A bandit is another name a slot machine, and online conversion optimization is in some ways analogous to the task of how to pick between slot machine with different payoffs.  So think about walking into a casino, and you have reason to believe that different machines have different payoffs.

If you knew beforehand which machine had the highest payoff, you would play that one exclusively to maximize your expected return from playing.  The difficulty is that you don’t know which machine is best beforehand, so you have to deduce which slot machine has the highest pay off.  To do this, you are going to have to try them all out in order to get a sense of the expected return from playing each machine.

But how to do this?  Run an AB style experiment?

You could do that, but remember; each time you play a poor performing machine, you lower your take that you walk out of the Casino with that night.  In order to maximize how much money you walk out of the casino with, you will have to be efficient with how you collect your data.

Learn and Earn

The solutions to Bandit style problems try to account for this data acquisition cost. They do it by trying to balance using the current best guess about the performance of each option and value of collecting more data to improve upon these guesses.

This tradeoff between using currently available information and collecting more data is often referred to as explore exploit problem, but some like to call it earning while learning.  You need to both learn in order to figure out what works and what doesn’t, but to earn; you take advantage of what you have learned. This is what I really like about the Bandit way of looking at the problem, it highlights that collecting data has a real cost, in terms of opportunities lost.

Unlike AB or multivariate testing, which all use basically the same methodology (apologies to Fisher, Neyman, and Jeffreys), there is no one standard approach for Bandit solutions, but they all try to incorporate this notion of balancing exploration with exploitation. Technically, these are problems that try to minimize what’s known as regret.


Regret is the difference between your actual payoff and the payoff you would have collected had you played the optimal (best) options at every opportunity.

This can get a little confusing, so to make this clearer, let’s compare AB Testing to one of the simpler bandit algorithms.  While there are many different approaches to solving Bandits, one of the simplest is the value weighted selection approach, where the frequency that each optimization option is selected is based on how much it is currently estimated to be worth. So if we have two options, ‘A’ and ‘B’, and ‘A’ has been performing better, we weight the probability of selecting ‘A’ higher than ‘B’.  We still randomly select between the two, but we  give ‘A’ a better chance of being selected.

First lets take a look at how AB Testing selects during learning compared our Bandit method.  For our example we consider a problem where we are trying to optimize with three possible options; ‘A’,’B’, and ‘C’.

In AB Testing, we see that each time we make a selection, each of the options gets the same chance to be selected.

In the Bandit approach, this isn’t the case.  The probabilities for being selected at each turn depend on the current best guess off how valuable each option is.  In the above case, Option A has the highest current estimated value, and so is getting selected more often than either options B or C, and likewise, option B has a higher estimated value than option C.

Since the selection probabilities are based on the current best estimates for each option, the Bandit selection probabilities can change over time as we update our value estimates for each option.  Let’s imagine we set up an optimization project to try the out three possible options.

Unbeknownst to us at the start, les say that the top performing option is worth $2.00, the next best, $1.00 and the worst option only $0.50. Let’s see how Bandit’s continuous Learn and Earn balancing act looks compared to AB Testing in the graph below.

With the AB Testing approach we try out each of the three options with equal proportions until we run our test at week 5, and then select the option with the highest value.  So we have effectively selected the best option 44% of the time over the 6 week period (33% of the time through week 5, and 100% at week 6), and the medium and lowest performing options 28% of time each (33% of the time through week 5, and 0% at week 6).  If we the weight these selection proportions by the actual value of each option, we get an expect yield of about $1.31  per user over the 6 weeks (44%*$2 + %28*$1 + %28*$0.50=$1.31).

Now lets look at the Bandit approach.  Remember, the bandit attempts to use what it knows about each option from the very beginning, so it continuously updates the probabilities that it will select each option throughout the optimization project.

In the above chart we can see that with each new week, the bandit reduces how often it selects the lower performing options and increases how often if selects the highest performing option.

So, over the 6 week period the bandit approach has selected the best option 72% of the time, the medium option 16% of the time, and the lowest performer on 12% of the time.  If we weight by the average value of each option we get an expected yield of about $1.66 per user.  In the case the bandit would have returned a 27% lift over the AB Testing approach – ($1.66-$1.31)/$1.31 – during the learning time of the project.

The following chart show the same idea in a slightly differently. Here we can see what the average value of each user will be during out optimization project.

For the first 5 weeks, the AB Test has an expected yield of $1.17 per user (33%*$2.00 + 33%*$1.00 + 33%*$0.50), and then an expected return of $2.00 once Option C has been selected.

The Bandit, on the other hand, starts at $1.17, like the AB Test, but then increases the average value continuously throughout the project. So you get more payoffs while learning.

In fact, the area under the bandit value curve and above the AB Testing value curve is the total extra payoff we get by using the bandit.  If we averaged around 1,000 visits per week, we would take in $4,666 while learning with AB Testing but $6,810 with Bandit learning – a lift of 46%.

Of course, the longer you extend your project out for, the smaller this incremental payout is of the total.  This means that for short lived projects, like online campaigns, are particularly well suited for adaptive bandit methods – since they are active for a limited time.

Why Should you Care?

There really are three main reasons to prefer Bandits over AB Testing

  1. Earn while you Learn.  Data collection is a cost, and we ought to try to do it as efficiently as possible.  Using a bandit style approach lets us at least consider these costs while we run our optimization projects.
  2. Automation – the bandit approach is a natural way to formulate the online optimization problems for automating the selection optimization with machine learning.  This is especially true when applying user targeting, as correct AB tests becomes much more complicated in this situation.
  3. A changing world – Bandit approaches are useful when you have an optimization problem where the best options can change over time.  By letting the bandit method always leave some chance to select poorer performing option, you give it the opportunity to ‘reconsider’ the option effectiveness.  If there is evidence that it is performing better than first thought, it will be automatically be promoted and used more often. It provides a working framework for swapping out low performing options with fresh options, in a continuous process.

Of course, there is no guarantee that using a Bandit method will perform better than running an AB Test. However, there is also no guarantee that using either a Bandit or AB Test selection approach is necessary going to be better than just guessing – but on average it should be.

To wrap up, Bandits are a continuous learning method for selecting between competing options. They weight the competing selections between what currently seems to work best and the potential benefits of collecting more information to improve future results.

Make your own adaptive or standard AB tests. Sign up for a Free access today at Conductrics

Tags: ,

2 Trackbacks

  1. By Intelligent Agents | Conductrics on March 15, 2013 at 8:42 pm

    […] learner is made accessible to the controller component. We will see a bit of this when we look at Multi-armed Bandits in an upcoming […]

  2. By The Wrong (and Right) Ways to Test Content Titles on August 11, 2017 at 3:50 pm

    […] are a great way to balance earning with learning. Essentially, they attempt to minimize regret by exploiting learnings in real time, and shifting […]

Post a Comment

Your email is never published nor shared. Required fields are marked *