module Obandit:sig
..end
Ocaml Multi-Armed Bandits
Obandit is an Ocaml module for multi-armed bandits. It supports the EXP, UCB and Epsilon-greedy family of algorithms.
This library implements multi-armed bandits as modules. A bandit module is obtained by calling a functor with a bandit module parameter. The parameter usually gives the number $K$ of arms and the hyperparameters of the bandit.
module type Bandit =sig
..end
A bandit algorithm.
$$ $$
Bandit modules are instanciated using functors. Depending on the algorithm type used, the module parameter varies.
For instance, the UCB1 bandit module for 3 arms is obtained with:
module UCB1 =
let module P = struct
let k=3
end
in MakeUCB1(P)
The following algorithms are available:
We use the viewpoint of the survey [1]
.
At time $t$, the $(\alpha,\psi)$-UCB algorithm[1]
is taking action:
$$ \argmax_{i=1,\cdots,K} \quad \left[ \widehat{\mu_i}+(\psi^{*})^{-1}\left(\frac{\alpha\ln t}{T_i} \right) \right] $$
where $\alpha > 0$ is a hyperparameter, $\widehat{\mu_i}$ is the estimate of the average reward of arm $i$, $T_i$ is the number of times arm $ i$ was visited so far and $\psi^*$ denotes the Legendre-Fenchel transform of a convex function $\psi$.
The pseudo-regret $\bar{R_n}$ has the following bound at round $n$: $$ \bar{R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{\alpha \Delta_i}{\psi^* (\Delta_i / 2 )} \ln n + \frac{\alpha }{\alpha-2 } \right) $$
where $ \Delta_i = \mu^* - \mu_i $ is the suboptimality parameter of arm $ i$.
type
banditEstimates = {
|
t : |
(* | The index of the step | *) |
|
a : |
(* | The last action taken. | *) |
|
nVisits : |
(* | The visit counts by arm. | *) |
|
u : |
(* | The cumulative arm reward observations. | *) |
The inner state of a bandit that maintains estimates of arm means.
module type AlphaPhiUCBParam =sig
..end
Use to instanciate a Bandit
from MakeAlphaPhiUCB
by giving $\alpha$ and $\phi$.
module MakeAlphaPhiUCB:
The $(\alpha,\psi)$-UCB Bandit for stochastic regret minimization described in [1]
.
The $\alpha$-UCB algorithm[5]
uses $ \psi(\lambda) = \lambda^2 / 8 $.
It chooses the action:
$$ \argmax_{i=1,\cdots,K} \left[ \widehat{\mu_i} + \sqrt{ \frac{\alpha \ln t}{2 T_i} } \right] $$
This gives a pseudo-regret bound of:
$$ \bar{ R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{2 \alpha} { \Delta_i } \ln n + \frac{\alpha}{\alpha - 2} \right) $$
module type AlphaUCBParam =sig
..end
Use to instanciate a Bandit
from MakeAlphaUCB
by giving $\alpha$.
module MakeAlphaUCB:
The $\alpha$-UCB Bandit for stochastic regret minimization described in [1]
.
The UCB1 algorithm[5]
uses $\alpha = 4$.
It chooses the action:
$$ \argmax_{i=1,\cdots,K} \left[ \widehat{\mu_i} + \sqrt{ \frac{2 \ln t}{T_i} } \right] $$
At round $ n$, this gives a pseudo-regret bound of:
$$ \bar{R_n} \leq \sum_{i:\Delta_i > 0} \left( \frac{8}{\Delta_i} \ln n + 2 \right) $$
module type KBanditParam =sig
..end
Use to instanciate a Bandit
from MakeUCB1
.
module MakeUCB1:
The UCB1 Bandit for stochastic regret minimization .
At round $t$, the $ \epsilon_t$-Greedy algorithm[5]
takes action $\argmax_{i=1,\cdots,K} \widehat{\mu_i} $
with probability $ 1-\epsilon_t$ and an uniformly random action with probability $ \epsilon_t$.
module type RateBanditParam =sig
..end
Use to instanciate algorithms that need a parametrizable rate.
module MakeParametrizableEpsilonGreedy:
The $\epsilon$-Greedy Bandit with a parametrizable exploration rate.
[5]
.This uses the exploration rate decay: $$ \epsilon_t = \min \left\{ 1, \frac{cK}{d^2 t} \right\} $$ where $ d > 0 $ should be taken as a tight lower bound on $ \max_{i=1,\cdots,K} \Delta_i$ and $ c > 0$ is a hyperparameter.
module type DecayingEpsilonGreedyParam =sig
..end
Use to instanciate a Bandit
from MakeDecayingEpsilonGreedy
.
module MakeDecayingEpsilonGreedy:
The Epsilon-Greedy Bandit with the decaying exploration rate from [5]
.
This uses a fixed exploration rate $ \epsilon$.
module type EpsilonGreedyParam =sig
..end
Use to instanciate a Bandit
from MakeEpsilonGreedy
.
module MakeEpsilonGreedy:
The Epsilon-Greedy Bandit with a fixed exploration rate.
At round $ t$, the EXP3 algorithm[1]
draws an arm from a probability
distribution $ p$ and updates this distribution with a softmax operator:
$ p_{i,t+1} = \frac{\exp ( - \eta_t \widetilde{L_{i,t}})}{\sum{k=1}^{K}\text{exp}(-\eta_t \widetilde{L_{k,t}})} $
where $\widetilde{L_{i,t}}$ is the cumulative probability-normalized loss at time $ t$ of arm $i$, $\eta_t$ is the rate at time $t$.
type
banditPolicy = {
|
t : |
(* | The index of the step $t$. | *) |
|
a : |
(* | The last action taken. | *) |
|
w : |
(* | The weights of the arm that define the policy. | *) |
The internal state of an Exp3 bandit
module MakeExp3:
The Exp3 Bandit for adversarial regret minimization with a parametrizable learning rate.
[1]
.This variant uses the learning rate decay:
$$ \eta_t = \sqrt{\frac{ln K}{tK}} $$
and enjoys the pseudo-regret bound: $$ \bar{R_n} \leq 2 \sqrt{nK \ln K} $$
module MakeDecayingExp3:
The Exp3 Bandit for adversarial regret minimization with a decaying learning rate as per [1]
.
This uses a fixed rate $\eta$.
module type FixedExp3Param =sig
..end
Use to instanciate a Bandit
from MakeFixedExp3
.
module MakeFixedExp3:
The Exp3 Bandit for adversarial regret minimization with a decaying learning rate as per [1]
.
[1]
.This variant optimizes for a known horizon $ n$ and uses the learning rate:
$$ \eta = \sqrt{\frac{2 ln K}{nK}} $$
It has the pseudo-regret bound:
$$ \bar{R_n} \leq \sqrt{2 nK \ln K} $$
module type HorizonExp3Param =sig
..end
Use to instanciate a Bandit
from MakeHorizonExp3
.
module MakeHorizonExp3:
The Exp3 Bandit for adversarial regret minimization with a horizon-based learning rate as per [1]
.
Reward normalization in online stochastic and/or adversarial learning is a hard problem.
While this is well studied in online learning [2]
[3]
[4]
, there is no
well studied procedure for bandits yet.
The WrapRange
Functors applies the heuristic solution known as the doubling trick.
The WrapRange functor wraps a bandit algorithm with the doubling trick. This heuristic allows to use a bandit algorithm without knowing the reward ranges. All rewards are linearly rescaled to a range (initially given by a RangeParam). When a value is observed above the range, the bandit algorithm is restarted and the range interval is doubled in that direction.
A convenience WrapRange01
is provided for rewards that are initially thought
to lie in $\left[0,1\right]$.
module type RangeParam =sig
..end
A Reward range.
type
rangedAction =
| |
Reset of |
| |
Action of |
A ranged action: Action a in normal course of action, Reset a in case * the bandit was just restarted.
type 'b
rangedBandit = {
|
bandit : |
(* | The original type of the bandit. | *) |
|
u : |
(* | The upper reward bound. | *) |
|
l : |
(* | The lower reward bound. | *) |
The type of a bandit with a range.
module type RangedBandit =sig
..end
The type of a bandit with reward scaling.
module WrapRange:
The WrapRange functor wraps a bandit algorithm with the doubling trick.
module WrapRange01:
The WrapRange01 functor is a convenience aliasing of WrapRange with an initial "standard" range of $ \left[ 0,1 \right] $.
see test/test.ml for an example of bandit use.
[1]
Regret Analysis of Stochastic and
Nonstochastic Multi-armed Bandit Problems,
Sebastien Bubeck and Nicolo Cesa-Bianchi.
[2]
Adaptive Subgradient
methods for Online Learning and Stochastic Optimization,
John Duchi , Elad Hazan and Yoram Singer.
[3]
Normalized Online Learning,
Stephane Ross, Paul Mineiro, John Langford
[4]
Scale-Free Online Learning,
Francesco Orabona, Dávid Pál
[5]
Finite-time
Analysis of the Multiarmed Bandit Problem,
Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer