Multi-Armed Bandits And Probability Matching

I took an interest in the multi-armed bandit problem after reading John Holland's book, Adaptation in Natural and Artificial Systems. Traditionally, the problem is described as follows: a gambler is putting coins into a slot machine (the bandit) with a number of arms, each of which has an independent probability of paying out a fixed reward. The gamber's objective is to learn, or explore, as little about the slot machine's arms as necessary in order to determine which has the highest probability of returning a reward. Then, using this knowledge, the gambler can capitalize on, or exploit, the arm that yields the most rewards by neglecting to pull all other arms. This trade-off between learning and capitalizing is known as exploration and exploitation.

The problem structure is general enough that it appears in a variety of domains, ranging from website testing, such as changing fonts or page structure to improve click through rates, to online advertising to robotics to dynamic pricing.

In what follows, I'll formalize a simple version of the problem, describe a modern solution, and then present an implementation and a visual simulation of the inner workings of the solution's algorithm.

Formalizing the problem

In the simplest case, a bandit has \(k\) arms, each with it's own success probability, or bias. We will denote these success probabilities as \({\theta_0, \theta_1, \dots, \theta_k}\). When a bandit arm is pulled, it yields a reward \(y \in \{0,1\}\). Under these conditions each bandit arm is a Bernoulli random variable, parameterized by \(\theta\), where after some number of trials, the expected number of successes follows a Binomial distribution. In the literature, this formulation is known as the binomial bandit.

A solution: probability matching

There are a rich set of solutions to the problem based on a a variety of techniques. The one I'm focusing on in this post, presented here, is a Bayesian variant of probability matching, also known as Thompson sampling, developed by Steven Scott.

To solve the binomial multi-armed bandit problem, our goal is to learn which arm has the highest success probability. Following Scott's method, we achieve this by conducting a series of trials, where, for each trial, we probablistically select an arm according to their estimated success probablities. From a Bayesian point of view, we can estimate each arm's success probability according to the following expression: \[ \Pr(\theta | \mathbf{y}) \propto \Pr(\mathbf{y} | \theta) \Pr(\theta), \] where \(\Pr(\mathbf{y} | \theta)\) is known as the likelihood function and \(\Pr(\theta)\) is known as the prior. Because each arm is a Bernoulli random variable, the likelihood function is equal to: \[ \Pr(\mathbf{y} | \theta) = \prod_{i=1}^{N_t} \theta^{y}(1-\theta)^{1-y}, \] which simplifies to \[ \Pr(\mathbf{y} | \theta) = \theta^{x_t}(1-\theta)^{N_t-x_t} \] where \(x_t\) is the total number of successful trials and \(N_t\) is the total number of trials conducted up to time \(t\). The prior distribution represents any knowledge known about the arms before each trial. The conjugate prior for the likelihood function we just derived above is the Beta distribution, defined as: \[ \Pr(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \] where \(B(\alpha,\beta)\) is the Beta function, \(\alpha\) represents the number of previous successful trials and \(\beta\) represents the number of previous unsuccessful trials.

When the likelihood and prior functions are multiplied together to obtain the posterior, we get another Beta distribution, parameterized as follows: \[ \Pr(\theta | \mathbf{y}) = Be(\alpha+x_t, \beta+N_t-x_t) \]

Given this problem structure, Scott's solution proceeds as follows:

  • Initialize the experiment by applying a uniform prior to each arm. This is equal to setting \(\alpha = \beta = 1\), for each arm's prior distribution.
  • For each trial
    1. Choose an arm by sampling from each arm's posterior and selecting the arm with the largest \(\theta\).
    2. Pull the selected arm and get a reward.
    3. Update the arm's parameters by incrementing \(\alpha\) or \(\beta\) according to whether or not the trial produced a success or failure.

Measuring performance

Over the course of the experiment, we would like to measure the performance of our method. As mentioned earlier, a quantitative measurement of performance is regret. Regret is the loss taken when one of the non-optimal arms is pulled during an experiment. If the truly optimal arm is known, mathematically, we can compute the total regret after each trial as follows: \[ R_t = \sum_{arms} n_t(\theta^* - \hat{\theta}_t), \] where \(n_t\) is equal to the number of trials allocated to each arm, \(\hat{\theta}\) is equal to the estimated success probability of the arm, and \(\theta^*\) is the optimal arm's success probability. Note that because we are representing each arm's success probability as a Beta distribution, we can compute \(\hat{\theta}\) by taking the distribution's expectation, which is equal to \[ \hat{\theta} = \frac{\alpha+x_t}{\alpha + \beta + N_t}. \] Optimal solutions, like the one presented here, to the multi-armed bandit problem exhibit have a total regret function that grows logarithmically with the number of trials.


If we use existing statistical libraries, the solution's implementation is trivial. In what follows I'll present a Python implementation. The complete code base for this post can be found here.

To begin, we need to create a data structure to maintain the state of each arm. As we showed earlier, each arm's posterior can be represented as a Beta distribution whose parameters are equal to the number of successes and failures. We can define a class to store these counts as follows:

class BetaDist(object):
    def __init__(self, alpha=1., beta=1.):
        #initialize the distribution to uniform
        self.alpha = alpha
        self.beta = beta

The constructor initializes the distribution such that it is uniform. Scott's algorithm states that after each trial we will need to update the distribution's counts of \(\alpha\) and \(\beta\), compute it's expectation, and draw samples from it. These methods are defined as:

def update(self, x):
    #update counts according to whether or not the trial was a success of failure
    if x == 0.:
    elif x == 1.:

def get_mean(self):
    #return the expectation of the distribution
    return self.alpha / (self.alpha + self.beta)

def sample(self, samples):
    #draw some number of samples from the posterior and return them
    return scipy.stats.beta.rvs(self.alpha, self.beta, size=samples)

A bandit can be represented by instantiating a BetaDist class for each arm and defining methods that return relevant values for each step of Scott's algorithm.

class BernoulliBandit(object):
    def __init__(self, arms=2, samples=1):
        #initialize a number of beta distributions, one for each arm
        self.arms = [BetaDist() for i in xrange(arms)]
        self.samples = samples
    def update(self, arm, reward):
        #update the state of the relevant arm's posterior
    def choose_arm(self):
        #probability sampling: estimate mean by sampling from the posterior
        #generated by trials, choose arm with the largest mean
        return np.argmax([arm.sample(samples=self.samples) for arm in self.arms])
    def best_arm(self):
        #get the current best arm by returning the index of the largest
        return np.argmax([arm.get_mean() for arm in self.arms])
    def get_mean(self, arm):
        #return the arm's expectation
        return self.arms[arm].get_mean()

To test this code, we can conduct a simple simulation where we know the values of actual arms and then measure how well the algorithm estimates which one is optimal by computing total regret. I've included this testing framework in the code repository under the file Below is a simplified testing framework that does not include code to plot the evolution of the algorithm in terms of posterior distributions or total regret.

import numpy as np
import scipy.stats

from BernoulliBandit import BernoulliBandit

trials = 100
arms = 3
regrets = []

#draw random biases for each of the arms on the true bandit
true_arms = np.random.random_sample(arms)
arm_choices = np.zeros(arms)

#initialize random variables the true bandit using random bias values
bandit = [scipy.stats.bernoulli(a) for a in true_arms]

#initialize our model
bb = BernoulliBandit(arms=arms)

for t in xrange(trials):
    #choose a bandit probablistically, based on what we know so far
    arm = bb.choose_arm()
    #record choice for regret measurement

    #get a reward from a single trial from the arm
    reward = bandit[arm].rvs()
    #estimate the regret for this trial
    regrets.append(regret(true_arms.argmax(), true_arms, arm_choices))
    #update the model

#Print out summary of arm pulls, estimated biases and actual biases
print 'arm biases ', true_arms
print 'arm allocations ', arm_choices
print 'best estimated arm ', bb.best_arm()
print 'best actual arm ', true_arms.argmax()


In addition to the Python implementation I've reviewed above, I've developed a simple web-based simulator. To see the algorithm in action, click the button and watch how it evolves over 100 trials.

The top figure plots each arm's posterior distribution after each trial. Notice how at the beginning of the simulation each arm's distribution is the same and after each trial, the arm that was chosen has it's posterior adjusted according to whether or not the trial was successful.

The figure below the posterior plot shows the experiment's total regret as a function of trials. Notice that once the algorithm has built up a strong belief that one arm is optimal, the rate at which total regret grows decreases. The table show's the experiments true arm success probabilities as well as the number of trials allocated to each arm and the arm's total number of rewards.