3 Multi-armed Bandits

In the case of multi-armed bandits, no state is used for the selection of the next actions.
It exists an “intermediate case” where the state is used, but the state itself does not depend on the previous actions, called contextual bandits.
Multi-armed bandits are a simplified version of reinforcement learning.

Multi-armed Bandits	Contextual Bandits	Reinforcement Learning
No state is used	State is used State does not depend on previous actions.	State is used State depends on previous actions.
	The action that is selected does not affect the next state/situation.	The action that is selected affects the next state/situation.

3.1 K-armed Bandit Problem

Consider the following learning problem. We are faced repeatedly with a choice among \(k\) different options, or actions. After each choice we receive a numerical reward chosen from a stationary probability distribution that depends on the action we selected. Our objective is to maximise the expected total reward over some time period, for example, over 1000 action selections, or time steps.
This is the “classic” k-armed bandit problem, which is named by analogy to a slot machine or “one-armed bandit”, except that it has \(k\) levers instead than one. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot.

Problem Definition

We can model a k-armed problem as follows:
- \(k\) different actions;
- after each choice we receive a numerical value that depends only on the action we selected;
- the goal is to maximise the expected reward over a certain number of time steps.

Like in a slot machine, we need to learn which lever provides us with the highest reward.

The problem is the same, maximising the expected reward:

\(E[G_t \vert S_t = s, A_t = a]\)

\(\qquad \;\; \doteq E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \vert S_t = s, A_t = a]\)

\(\qquad \;\; = E [\sum^\infty_{k=0} \gamma^k R_{t+k+1} \vert S_t = s, A_t = a]\)

In the case of the k-armed bandit problem, the state is always the same (or, in other words, it does not matter). We can think about having \(s=\overline{s}\) with \(\overline{s}\) constant.

Considering a constant state \(\overline{s}\), the problem is equal to maximise:

\(E[G_t \vert S_t = \overline{s}, A_t = a]\)

\(\qquad \;\; \doteq E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \vert S_t = \overline{s}, A_t = a]\)

\(\qquad \;\; = E [\sum^\infty_{k=0} \gamma^k R_{t+k+1} \vert S_t = \overline{s}, A_t = a]\)

with \(a\) one of the \(k\) actions.

Action Selection and Estimated Value

In a k-armed bandit problem, each of the \(k\) actions has a value, equal to the expected or mean reward given that that action is selected.

As in the general RL case, we denote the action selected on time step \(t\) as \(A_t\) and the corresponding reward as \(R_t\).

The value of an arbitrary action \(a\), denoted as \(q_*(a)\), is the expected reward given that \(a\) is selected:

\(q_*(a) \doteq E[R_t \vert A_t = a]\)

In the trivial case, since we know the value of each action, solving the k-armed bandit problem is very easy: it is sufficient to always select the action with the highest value!

However, in general, we do not know the action values with certainty, we only know estimates.

The estimated value of action \(a\) at time step \(t\):

\(Q_t(a)\)

Ideally, we would like to have that the value of \(Q_t(a)\) would be very close to \(q_*(a)\).

3.2 Exploration vs. Exploitation

We maintain the estimates of each action value.

Exploitation:

At any step there is at least one action whose estimated value is the greatest \(\rightarrow\) we refer to this action as greedy action.

When we select one of these actions, we are exploiting our current knowledge of the values of the actions.

Exploration:

If we select non-greedy actions, we say that we are exploring \(\rightarrow\) alternative actions.

By doing so, we can improve our estimation of the value functions of the non-greedy actions.

Balancing exploration and exploitation in a smart way is the key aspect: several theoretical results in terms of bounds, etc. given specific assumptions.

3.3 Evaluating Action-Value Methods

Multi-armed bandits are a very good way of approaching the general problem of Reinforcement Learning.

Simplification: the state does not change, which means:
- Our actions do not modify the state;
- Since the state does not change, the agent’s actions will not depend on the previous actions (the two things are strictly linked).
We will consider later the “full” reinforcement learning problem where the agent’s actions do modify the state and the agent’s action do depend on the previous actions.
Recall:

The true value of an action is the mean reward when that action is selected.
A possible way to estimate this is by averaging the rewards actually received:

\(Q_t(a) \doteq \frac{sum \; of \; rewards \; when \; an \; action \; a \; is \; taken \; prior \; time \; t}{number \; of \; times \; an \; action\; a \; is \; taken \; prior \; time \; t}\) \[= \frac{\sum_{i=1}^{t-1} R_i 1_{A_i=a}}{\sum_{i=1}^{t-1} 1_{A_i=a}}\]

where \(1\) denotes the random variable that is 1 if the predicate \(A_i=a\) is true and 0 if not.

If the denominator is 0, we set \(Q_t(a)\) to some default value (e.g., 0).

As the denominator goes to infinity, by the law of large numbers, \(Q_t(a)\) converges to \(q_*(a)\).

This is called the sample-average method because each estimate is the average of the sample of the rewards.

3.4 Greedy Action Selection Rule

The simplest selection rule is to select one of the acitons with the highest estimated value. This is usually referred to as greedy selection rule.

More formally, a greedy selection rule is one that selects \(A_t\) so that:

\(A_t \doteq \arg\max_{a} Q_t(a)\)

i.e. the action \(a\) for which \(Q_t\) is maximised.

3.5 \(\epsilon\)-greedy Selection Rule

A simple alternative is to behave greedily most of time, but from time to time, with a probability \(\epsilon\), select instead randomly from among all the actions with equal probability. We call this method \(\epsilon\)-greedy.

def epsilon_greedy:
    if random.random() < 1 - eps:
        return np.argmax(Q)
    else:
        return random.randint(0, len(Q) - 1)

One of the advantages of this method is that, in the limit, as the number of steps increases, all the actions will be sampled an infinite amont of time, ensuring that for all \(a\) the \(Q_t(a)\) will converge to \(q_*(a)\).

3.6 Tracking a Stationary Problem (Incremental Implementation)

Let’s now discuss how to implement these methods in practice. We consider a single action \(a\) to simplify the notation.

Let \(R_i(a)\) the reward received after the \(i\)-th selection of the action \(a\).

Let \(Q_n(a)\) denote the estimate of its action value after it has been selected \(n-1\) times.

We can write:

\(Q_n(a) \doteq \frac{R_1(a) + R_2(a) + \dots + R_{n-1}(a)}{n-1}\)

Trivial impementation

The trivial implementation would be to maintain a record of all the rewards and then execute the formula when that value would be needed. However, if this is done, the computational requirements would grow over time. A possible alternative is an incremental implementation.

Efficient computation of estimates

Incremental implementation:

\(Q_{n+1} = \frac{1}{n} \sum^n_{i=1} R_i\)

\(\qquad \;\; = \frac{1}{n} (R_n + \sum^{n-1}_{i=1} R_i)\)

\(\qquad \;\; = \frac{1}{n} (R_n + (n-1) \frac{1}{n-1} \sum^{n-1}_{i=1} R_i)\)

\(\qquad \;\; = \frac{1}{n} (R_n + (n-1) Q_n)\)

\(\qquad \;\; = \frac{1}{n} (R_n + n Q_n - Q_n)\)

\(\qquad \;\; = Q_n + \frac{1}{n} (R_n - Q_n)\)

This form of update rule is quite common in reinforcement learning.

The general form is:

NewEstimate \(\leftarrow\) OldEstimate + StepSize (Target – OldEstimate)

where the expression (Target – OldEstimate) is usually defined as error in the estimate.

3.7 Tracking a Nonstationary Problem

The averaging method discussed before is appropriate for stationary bandit problems.

However, many problems are non-stationary. In these cases, it makes sense to give more weight to recent rewards rather than long-past rewards.

One of the most popular way of doing this is to use constant step-size parameter (in the example above was \(\frac{1}{n}\)).

The incremental update rule for updating an average \(Q_n\) of the \(n-1\) past rewards is modified to be:

\(Q_{n+1} \doteq Q_n + \alpha(R_n - Q_n)\)

where the step-size parameter \(\alpha \in (0,1]\) is constant.

This results in \(Q_{n+1}\) being a weighted average of past rewards and the initial estimate.

More formally:

\(Q_{n+1} \doteq Q_n + \alpha (R_n + Q_n)\)

\(\qquad \;\; = (1-\alpha)^n Q_1 + \sum^n_{i=1} \alpha (1-\alpha)^{(n-i)}R_i\)

This is called a weighted average since:

\[(1-\alpha)^n + \sum^n_{i=1} \alpha(1-\alpha)^{(n-i)} = 1\]

The quantity \(1-\alpha\) is less than 1 and the weight given to \(R_i\) decreases as the number of rewards increases: actually it’s exponential, and for this reason it is sometimes called exponential-recency-weighted average.

3.8 Optimistic Initial Values

These methods are dependent on the initial action-value estimates \(Q_1(a)\).

We say that these methods are biased by their initial estimates.

Initial values can be used as a simple way to encourage exploration.

If the learnineer is “disappointed” by the initial values and eplore more.
For example, if the initial values for each action \(a\) are selected in order to be quite high with respect to the actual optimal values, all the actions are tried several times before we get convergence.

Example

Suppose that we have a multi-armed bandit, where the optimal action values \(q_*(a)\) for every action \(a\) are selected from a normal distribution with mean 0 and variance 1.

In this case, an initial value \(Q_1(a)=10\) for each activity is highly disappointing.

Remember the formula:

NewEstimate \(\leftarrow\) OldEstimate + StepSize (Target - OldEstimate)

This will lead to a lot of exploration, also in pure greedy action selection.

3.9 Upper-Confidence-Bound (UCB) Action Selection

Exploration is needed because there is always uncertainty of the actual value estimates.

Greedy actions might be the best ones, but it might not be the case.
\(\epsilon\)-greedy action selection forces the non-greedy actions to be tried, but indiscriminately with no preferences for those that are nearly greedy or particularly uncertain.

It would be better to select among the non-greedy actions considering their potential of being optimal, taking into account both how close their estimates are to being optimal and the uncertainties in these estimates.

One effective way of doing this is to select actions according to:

\(A_t \doteq \arg\max_{a} (Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}})\)

where \(t\) is the time at which action \(A_t\) is taken, \(N_t(a)\) denotes the number of times that action \(a\) has been selected prior to time \(t\) and the number \(c > 0\) controls the degree of exprolation.

The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance of the estimate of the actual value of action \(a\).

3.10 Multi-armed Bandits in practice

import numpy as np
import random

class MultiArmedBandit:
    '''Create a k-armed bandit.

    Each bandit arm must have its reward distribution.

    The bandit has to receive the action: to select one of the k bandits,
    with possible value from 1 to 10.

    The bandit must have k random generators of two values i,j (with i < j):
    the min and max of the reward distribution.
    e.g.:
        1 3 -> 1.5, ...
        2 6 -> 2.1, 3.3, ...
        4 8 -> 4.2, 5.3, 7.8
    '''
    def __init__(self, k):
        self.arms = [self._create_generator() for i in range(k)]
        for i, arm in enumerate(self.arms):
            print(f"arm {i} range: {arm}")

    def _create_generator(self):
        lbound, ubound = 1, 10

        a = random.randint(lbound, ubound)
        b = random.randint(a, ubound)

        return a, b

    def get_reward(self, a: int):
        assert a >= 0 and a < len(self.arms)
        return random.uniform(*self.arms[a])


class ArmChooser:
    def __init__(self, k: int, eps: float, initial_value: float):
        self.Q, self.N = [initial_value] * k, [0.] * k
        self.eps = eps

    def _epsilon_greedy(self):
        if random.random() < 1 - self.eps:
            return np.argmax(self.Q)
        else:
            return random.randint(0, len(self.Q) - 1)
    
    def choose(self):
        a = self._epsilon_greedy()
        R = bandit.get_reward(a)
        print(f"R: {R}")
        self.N[a] += 1
        self.Q[a] += 1/self.N[a] * (R - self.Q[a])
        print(f"Q: {self.Q}")
        print (f"N: {self.N}")


# Example usage
k, T, eps, initial_value = 4, 10, 0.1, 10.
bandit = MultiArmedBandit(k)
chooser = ArmChooser(k, eps, initial_value)

for t in range(T):
    print(f"step {t}")
    chooser.choose()

arm 0 range: (4, 6)
arm 1 range: (1, 1)
arm 2 range: (5, 8)
arm 3 range: (8, 8)
step 0
R: 5.007640757785914
Q: [5.007640757785914, 10.0, 10.0, 10.0]
N: [1.0, 0.0, 0.0, 0.0]
step 1
R: 1.0
Q: [5.007640757785914, 1.0, 10.0, 10.0]
N: [1.0, 1.0, 0.0, 0.0]
step 2
R: 5.971678348982844
Q: [5.007640757785914, 1.0, 5.971678348982844, 10.0]
N: [1.0, 1.0, 1.0, 0.0]
step 3
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 1.0]
step 4
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 2.0]
step 5
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 3.0]
step 6
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 4.0]
step 7
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 5.0]
step 8
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 6.0]
step 9
R: 8.0
Q: [5.007640757785914, 1.0, 5.971678348982844, 8.0]
N: [1.0, 1.0, 1.0, 7.0]

3.11 Contextual bandits

Multi-armed bandits are used whent the selection of the action is not dependent on the state.

But for many applications (e.g., ads on a webpage), actually the state is very important.

When the action selection depends on the state, but not on its previous history, we have the case of contextual bandits.

Since the action is associated to the state, contextual bandits are also defined as a problem of associative search.

They require to learn a policy, i.e. a mapping from the current state to the probabilities of each action.

As we said, contextual bandits differ from the “full” reinforcement learning problem in that the action does not affect the future state. In case the action affects the future state, we have the “full” reinforcement learning problem.

Additional Readings

Lihong Li, Wei Chu, John Langford, Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web (WWW 2010): 661–670.

3.12 Exercises

Website serving ads with bandits.

Design a system serving ads on a website (e.g. Google Ads) only using bandits.

Auto-placement of ad campaigns using multi-armed bandits

import random

def shuffle_array(arr):
    random.shuffle(arr)

def get_random_int(min_val, max_val):
    return random.randint(min_val, max_val - 1)

def get_favorite_category(Q):
    return max(Q, key=Q.get)

def choose_next_ad(Q, ads, eps):
    if random.random() > eps:
        favorite_category = get_favorite_category(Q)
        category_ads = [ad for ad in ads if ad[1] == favorite_category]
    else:
        random_category = random.choice(list(Q.keys()))
        category_ads = [ad for ad in ads if ad[1] == random_category]
    
    return random.choice(category_ads) if category_ads else None

def display_ads(ad):
    print(f"Showing Ad: {ad[0]} (Category: {ad[1]})")

ads = [
    ("Soccer ⚽️", "Sport"), ("Basket 🏀", "Sport"), ("Volleyball 🏐", "Sport"),
    ("Banana 🍌", "Fruit"), ("Apple 🍏", "Fruit"), ("Orange 🍊", "Fruit"),
    ("Cat 🐈‍⬛", "Pet"), ("Dog 🐕‍🦺", "Pet"), ("Parrot 🦜", "Pet"),
    ("Sneakers 👟", "Fashion"), ("Sunglasses 🕶️", "Fashion"), ("Dress 👗", "Fashion"),
    ("Paris 🇫🇷", "Travel"), ("Tokyo 🇯🇵", "Travel"), ("London 🇬🇧", "Travel")
]

shuffle_array(ads)

Q = {category: 0 for _, category in ads}

eps = 0.2

display_ads(ads[0])
while True:
    user_input = input("Did you like this ad? (y/skip): ").strip().lower()
    if user_input == 'y':
        Q[ads[0][1]] += 1
    elif user_input == 'skip':
        Q[ads[0][1]] -= 1
    
    next_ad = choose_next_ad(Q, ads, eps)
    if next_ad:
        ads[0] = next_ad
        display_ads(next_ad)
    else:
        print("No more ads to display.")
        break

Website serving ads with contextual bandits.

Design a system serving ads on a website (e.g. Google Ads) using contextual bandits.

3.13 References

Books for the bandit:

Chapter 2 of Sutton and Barto (2018)
Lattimore and Szepesvari (2020)