Safe Exploration for Optimizing Contextual Bandits

Author: Rolf Jagerman.

In this blog post I will introduce our recent work on safe exploration for optimizing contextual bandits which was published in TOIS. The full paper can be found here. The source code used to run the experiments in our paper can be found here.

What are Contextual Bandits?

Contextual bandit algorithms are a family of algorithms that optimize a policy by playing the following game:

  1. The algorithm observes a context
  2. The algorithm plays an action
  3. The algorithm observes a reward for the played action

The goal of the algorithm is then to maximize its reward over time. To accomplish this, these algorithms optimize a policy: a mapping of contexts to actions. A good contextual bandit algorithm will learn a policy that maps contexts to actions that have high reward. Contextual bandit algorithms are a natural fit for many information retrieval systems such as search engines. For example:

  1. A user issues a search query (the “context”)
  2. The search engine responds with a ranked list of items (the “action”)
  3. The user clicks on some documents in the list but not others (the “reward”)

We illustrate this process in the figure below:

A widely studied challenge of contextual bandit algorithms is the exploration-exploitation trade-off: An algorithm wants to explore new actions so as to find actions that can yield potentially high reward while simultaneously exploit its existing knowledge to accumulate a large amount of reward. In our paper we explore safety: can a contextual bandit algorithm explore new actions without harming the user experience?


Exploration is a key part in optimizing contextual bandit algorithms: not enough exploration and the algorithm may be stuck playing suboptimal actions and it can never learn; too much exploration and the user experience will be degraded as the algorithm produces seemingly random results. We ideally want the best of both worlds: explore new actions but only do so when we know those actions will not be harmful to the user experience. To accomplish this goal we introduce a new learning method called SEA: Safe Exploration Algorithm. This algorithm is a combination of two important recent developments: Counterfactual Learning and High-Confidence Off-Policy Evaluation.

Before we introduce SEA in more detail, we first need to more formally define what exactly a “safe” learning algorithm is. In our work we use the definition of safety from so-called conservative methods. This means that we measure safety relative to an existing baseline policy. More formally:

A learning algorithm is safe if its reward during the entire learning process is at least as high as that of a baseline policy.

The following figure illustrates both a safe and an unsafe contextual bandit algorithm. The gray line is the baseline policy which has an average reward of 0.62. The red line represents an unsafe algorithm: it explores agressively in the beginning leading to suboptimal performance but eventually learns a good policy. The blue line represents a safe algorithm: its performance is always at least as good as the baseline policy.

SEA: Safe Exploration Algorithm

The Safe Exploration Algorithm (SEA) is a contextual bandit learning algorithm that is safe. It functions by considering two policies: a baseline policy $\pi_b$ and a policy that is being learned $\pi_w$. The algorithm works by performing the following steps at each user interaction:

  1. The user issues a search query (the “context”)
  2. The algorithm uses the baseline policy $\pi_b$ to present a ranked list of results to the user (the “action”).
  3. The algorithm observes clicks on the results (the “reward”)
  4. Based on the reward, the algorithm improves the learned policy $\pi_w$ via Counterfactual Learning.
  5. Based on all historical rewards, the algorithm estimates the current reward of $\pi_b$ and $\pi_w$.
  6. If the estimated reward of $\pi_w$ exceeds that of $\pi_b$, then we replace $\pi_b$ with the newly learned $\pi_w$.

The fourth step is accomplished by using counterfactual learning: a technique that allows unbiased optimization of a new policy ($\pi_w$) based on historical user interactions collected from a different policy ($\pi_b$). A more in-depth explanation of counterfactual learning is beyond the scope of this blog post. A slightly more technical blog post about counterfactual learning can be found here.

The fifth step of the algorithm uses high-confidence off-policy estimators. These estimators allow us to estimate the reward of a policy by constructing high confidence bounds. This means that we can say with high probability (e.g. 99%) that the reward of the policy lies somewhere in the constructed confidence interval. We use these confidence intervals to be able to compare the performance of $\pi_b$ and $\pi_w$.


We experiment with a learning to rank scenario where the goal of the algorithm is to learn a ranking system based on user clicks. Since we do not have a platform with real users, we instead simulate user clicks according to three user behavior models: perfect, position-biased, near-random. These user models represent different degrees of click noise, ranging from perfect user behavior: users always click on relevant results and never on non-relevant ones; to near-random behavior: users click on relevant results 60% of the time and non-relevant ones 40% of the item. By using our simulation we can compare SEA to both online and counterfactual learning algorithms.

In the following figure we compare SEA against several online learning algorithms for the learning to rank scenario. For the reward we use NDCG@10: a commonly used metric for learning to rank.

These results indicate that SEA is indeed safe: its performance is always at least as good as that of the baseline policy $\pi_b$. In contrast, the online learning algorithm are not necessarily safe. Especially in the scenario with near random clicks: the performance of RankSVM for example degrades far below the baseline.

The paper contains more results than I can cover in this blog post: for a different contextual bandit task (text classification), across more datasets and a comparison against purely counterfactual methods.


In this blog post I introduced our recent work about SEA: A safe exploration algorithm for optimizing contextual bandits. Safety is an important criteria for practitioners of information retrieval systems where it is important to maintain a good user experience during learning.

SEA has several benefits over purely online or purely counterfactual learning algorithms. First, compared to purely online learning algorithms, the safe exploration algorithm provides a formal guarantee of safety: with high probability the algorithm never deploys sub-optimal policies. Second, compared to purely counterfactual learning algorithms, the safe exploration algorithm can explore new actions: allowing it to find actions with potentially high payoff which can lead to learning a better model in the end.

However, safety is not free, and the safe exploration algorithm also has drawbacks: First, compared to purely online learning algorithms, SEA is more conservative and does not explore as aggressively as it has to overcome potentially large confidence bounds. This means that online algorithms can achieve better exploration, and as a result may learn better models, than SEA. Second, compared to purely counterfactual learning algorithms, SEA requires the deployment of models to explore new actions. This adds additional logistic and engineering overhead for practitioners that may be more used to a traditional offline build-evaluate-deploy cycle.

Future work can consider more efficient ways of estimating policy performance. This can help make SEA explore earlier and can help learn better models. Additionally, in our work we consider safety at the level of policies: either a policy is safe or it is not. Future work should consider learning algorithms that can determine per context or even per action whether it is safe to explore or not.