module MakeHorizonExp3:
The Exp3 Bandit for adversarial regret minimization with a horizon-based learning rate as per [1]
.
type
bandit
The internal data structure of the bandit algorithm.
val initialBandit : bandit
The initial state of the bandit algorithm.
val step : bandit -> float -> int * bandit
step r
advances the bandit game one step, where r
is the reward for
the last action. The result of this call is the next action, encoded as an
integer in $ \{ 0, \cdots , K-1 \} $, and the new state of the bandit.
The reward range depends on the bandit algorithm in use and the first reward
provided to the algorithm is discarded.