Multi-Armed Bandits vs A/B Testing: A Comparison

Multi-armed bandit octopus
Multi-armed bandit octopus
Article Outline:
  1. Introduction to A/B Testing
  2. Drawbacks of A/B Testing
  3. Business Metric Optimization
  4. Multi-Armed Bandit Testing
  5. Interpretation of Test Results
  6. Overview of Bandit Benefits and Drawbacks
  7. Which Testing Method should you use?
  8. References
For those unfamiliar with bandits, please have a look at Multi-Armed Bandits: An Introduction for the necessary background.

Introduction to A/B Testing:

Suppose you’re the owner of a popular online clothing store and you’d like to increase sales. An employee suggested that a “red” buy button icon will result in more clicks than a “blue” one. To empirically settle the question of which icon color yields more clicks, you decide to run a test. The standard approach to testing of this sort is A/B testing and involves the following:

  1. Hosting two versions of the site. Version A with a “red” buy button and Version B with a “blue” buy button.
  2. Randomly routing half of the traffic to Version A and the other half to Version B.
  3. For each user that visits the site, record which version of the site he visited (A or B) and whether he clicked on the buy button or not.
  4. Ending the test once enough data has been collected or the test has been run for a sufficiently long time.
  5. At the end of the test, for each version of the site tested, computing the proportion of visitors that clicked on the buy button. We’ll call this proportion the Click-Through Rate(CTR).

To decide which version of the site performs better on the basis of the test, we simply choose the version which has the greater CTR. At more disciplined organizations, we’d run a test of statistical significance and determine whether any perceived difference in the CTR of the two versions is merely due to chance. If it is, we’ll become indifferent to either site and either collect more data or abandon the test.

Drawbacks of A/B Testing:

Notice that A/B testing involves a long fixed period of uniform exploration i.e. routing half of the traffic to one version and half to the other, followed by a period of pure exploitation i.e. deploying the more successful version of the site. This is suboptimal for the reasons we noted in our previous post about Multi-Armed Bandits. Consider a case where we decide to collect data for a week, and at the end of the first day, it’s already clear that one version of the site is vastly better than the other. According to the rules of A/B testing, we must continue to route traffic to this site for another 6 days, only to gain a higher degree of confidence in a conclusion we are already quite confident about, resulting in a loss of income. You might be inclined to patch up this problem by simply stopping the test after day one and interpreting the results. Unfortunately, this entails several statistical hazards and will contaminate your results beyond repair. The crucial drawback of the A/B testing method is not exploiting its current knowledge.

A more glaring issue for vanilla A/B tests is that they fail to address the problems posed by non-stationary distributions. People’s preferences change over time! Suppose you ran your A/B test in the Winter and decided that sweaters were preferable to t-shirts, this result certainly won’t hold in Summer. Even subtle non-seasonal changes in taste which take months to years will be absolutely ignored by A/B testing. Repeated A/B tests, scheduled at regular intervals can solve this problem to some extent but ultimately exist as a discrete and lousy approximation to the more elegant continuous solution we’ll see later. The central problem here is a lack of ongoing exploration.

Business Metric Optimization:

We’ll take a brief detour here to remind ourselves why we run tests in the first place. The ultimate goal of any organization boils down to maximizing some business metric(at successful companies this is usually profit). To that end, we run tests to determine which among a basket of processes and products are preferable with respect to that metric. Crucially, testing is an intermediate goal. The idea is that the test will reveal the best product, at which point we simply offer that product. The problem is that lazy tests like the A/B test utterly disregard the final goal of business metric maximization, leading to costly testing.

Multi-Armed Bandit Testing:

Multi-Armed Bandit(MAB) testing will give us a way to perform tests and gather information without incurring undue losses due to exploration. MAB tests focus jointly on testing and optimizing the business metric at the same time!

Let’s see how Multi-Armed Bandits can be used as a testing and deployment strategy which addresses the three glaring problems of A/B testing:

  1. Lack of ongoing exploration
  2. No exploitation of current knowledge
  3. Naive exploration strategy

MAB testing will begin in the same way as A/B testing with two versions of a site, A and B. Now we’ll use the terminology we developed in the last article to turn this testing and deployment strategy into a bandit problem. In this 2-armed bandit problem, the two actions we can perform are, routing a customer to Version A of the site or routing him to Version B. We receive a reward of 1 if the customer clicks the buy button and a reward of -1 if he doesn’t. Now we can use any of the bandit algorithms introduced above to choose which site to route traffic to.

Notice that nearly every competent bandit we discussed previously continuously explores in a sophisticated way and exploits his current knowledge from the very first time step. It’s worth pointing out that A/B testing is just a special case of MAB testing, with an agent that explores uniformly for some fixed time steps and then exploits for the remainder of the time steps.

Interpretation of Test Results

In a typical A/B test setting, the sample sizes are roughly equal, so the degree of confidence we have in the version’s CTR will not be greater than in the other. With MAB testing, this is no longer true. Because exploitation happens from the beginning, if the versions don’t have exactly the same distribution, the sampling will not be even, in fact, it will be heavily skewed. We should expect results like the following:

  1. Version A: CTR=0.9 Number of Samples: 900
  2. Version B: CTR=0.2 Number of Samples: 43

This throws us in a bit of a dilemma, in cases where the Click-Through rates are a bit closer. Should we have greater faith in a CTR of 0.7 with 100 samples than in one that is 0.8 with 10 samples? Bayesian methods will help address this, but it’s worth pointing out that it requires a subtler hand to interpret MAB results than it does to interpret an A/B test.

Luckily, as far as answering the basic question of whether Version A is preferable to Version B, we can use Welch’s t-test for statistical significance. This test, unlike other t-tests, is robust against unequal variances and unequal sample sizes. In fact, it really ought to be the default for A/B testing too, but hasn’t yet attained widespread adoption.

It must be mentioned, that while there is no loss in the interpretability of the experimentation results, MAB really isn’t the ideal choice for those exclusively seeking information and understanding. MAB is far less sample efficient than A/B tests in the sense that for the same degree of confidence in results of the form “A is better than B”, MAB tests will sometimes demand an order of magnitude more samples, though usually less.

Overview of Bandit Benefits and Drawbacks:

To review, bandit algorithms are a class of methods for balancing the trade-off between exploration and exploitation when sampling from reward distributions. We can often use bandit algorithms as an alternative to A/B testing when the problem is appropriately modeled. Bandits offer the following benefits and drawbacks:


  1. Converge earlier than A/B methods because they exploit their knowledge from the first time step and take an informed approach to exploration instead of randomly sampling
  2. Track non-stationary distributions by exploring continuously


  1. Final results with a lower degree of confidence when compared to those of A/B tests. This makes them far less sample efficient when determining statistical significance. Statistical significance and confidence intervals can still be computed, although these computations require a slightly higher level of sophistication
Which testing method should you use?
Ideal Use Case: Multi-Armed Bandit
  1. When understanding is less important than optimizing a business metric
  2. When the cost of testing is high e.g. The permanent loss of a user is a relatively high cost to bear when compared to simply losing or deferring a sale
  3. When you have a small number of users to sample from. MAB tests are more sample efficient than A/B tests with regards to convergence to the optimum
  4. In dynamic environments. MAB tests track non-stationary distributions well
  5. More than 2 alternatives to test. MAB scales well to contemporaneous testing of 10–20 variants.
Ideal Use Case: A/B
  1. When understanding is more important than optimizing a business metric or at least is an end in itself
  2. When the cost of testing is low, i.e. the financial loss associated with the less optimal alternative is acceptable
  3. In a stationary environment

Note: This section is based in part on (Lu Reference:4)


  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Bradford Books.
  2. White, J. M. (2012). Bandit algorithms for website optimization. O’Reilly Media.
  3. What is Multi-Armed Bandit(MAB) Testing? | VWO
  4. Beyond A/B Testing: Multi-armed Bandit Experiments | by Shaw Lu | Towards Data Science