Functor Obandit.MakeFixedExp3

module MakeFixedExp3: functor (P : FixedExp3Param) -> Bandit  with type bandit = banditPolicy

The Exp3 Bandit for adversarial regret minimization with a decaying learning rate as per [1].

Parameters:

P : FixedExp3Param

type bandit

The internal data structure of the bandit algorithm.

val initialBandit : bandit

The initial state of the bandit algorithm.

val step : bandit -> float -> int * bandit

step r advances the bandit game one step, where r is the reward for the last action. The result of this call is the next action, encoded as an integer in $ \{ 0, \cdots , K-1 \} $, and the new state of the bandit. The reward range depends on the bandit algorithm in use and the first reward provided to the algorithm is discarded.