\thesubsection Exploration in tabular MDPs.

\section

Model-Free Active Exploration Algorithms\labelsec:obpi In this section we present MF-BPI, a model-free exploration algorithm that leverages the optimal allocations obtained through the previously derived upper bound of the sample complexity lower bound. We first present an upper bound $\tilde{U}(\omega)$ of $U(\omega)$ , so that it is possible to derive a closed form solution of the optimal allocation (an idea previously proposed in \citeal2021adaptive).

{proposition}

Assume that $\phi$ has a unique optimal policy $\pi^{\star}$ . For all $\omega\in\Delta(S\times A)$ , we have:

U(\omega)\leq\tilde{U}(\omega):=\max_{s,a\neq\pi^{\star}(s)}\frac{H(s,a)}{% \omega(s,a)}+\frac{H}{\min_{s^{\prime}}\omega(s^{\prime},\pi^{\star}(s^{\prime% }))},

with $H(s,a)\coloneqq\frac{2+8\varphi^{2}M_{sa}^{k}[V^{\star}]^{2^{1-k}}}{\Delta(s,a% )^{2}}$ and $H\coloneqq\frac{\max_{s^{\prime}}C(s^{\prime})(1+\gamma)^{2}}{\deltamin^{2}(1-% \gamma)^{2}}$ . The minimizer $\tilde{\omega}^{\star}\coloneqq\arg\inf_{\omega}\tilde{U}(\omega)$ satisfies $\tilde{\omega}^{\star}(s,a)\propto H(s,a)$ for $a\neq\pi^{\star}(s)$ and $\tilde{\omega}^{\star}(s,\pi^{\star}(s))\propto\sqrt{H\sum_{s,a\neq\pi^{\star}% (s)}H(s,a)/|S|}$ otherwise. In the MF-BPI algorithm, we estimate the gaps $\Delta(s,a)$ and $M_{sa}^{k}[V^{\star}]$ for a fixed small value of $k$ (we later explain how to do this in a model-free manner.) and compute the corresponding allocation $\tilde{\omega}^{\star}$ . This allocation drives the exploration under MF-BPI. Using this design approach, we face two issues: (1) Uniform $k$ and regularization. It is impractical to estimate $M_{sa}^{k}[V^{\star}]$ for multiple values of $k$ . Instead, we fix a small value of $k$ (e.g., $k=1$ or $k=2$ ) for all state-action pairs (refer to the previous section for a discussion on this choice). Then, to avoid excessively small values of the gaps in the denominator, we regularize the allocation $\tilde{\omega}^{\star}$ by replacing, in the expression of $H(s,a)$ (resp. $H_{\min}$ ), $\Delta(s,a)$ (resp. $\Delta_{\min}$ ) by $(\Delta(s,a)+\lambda)$ (resp. $(\Delta_{\min}+\lambda)$ ) for some $\lambda>0$ . (2) Handling parametric uncertainty via bootstrap**. The quantities $\Delta(s,a)$ and $M_{sa}^{k}[V^{\star}]$ required to compute $\tilde{\omega}^{\star}$ remain unknown during training, and we adopt the Certainty Equivalence principle, substituting the current estimates of these quantities to compute the exploration strategy. By doing so, we are inherently introducing parametric uncertainty into these terms that is not taken into account by the allocation $\tilde{\omega}^{\star}$ . To deal with this uncertainty, the traditional method, as used e.g. in [al2021adaptive, marjani2021navigating]), involves using $\epsilon$ -soft exploration policies to guarantee that all state-action pairs are visited infinitely often. This ensures that the estimation errors vanish as time grows large. In practice, we find this type of forced exploration inefficient. In MF-BPI, we opt for a bootstrap** approach to manage parametric uncertainties, which can augment the traditional forced exploration step, leading to more principled exploration.

\thesubsection Exploration in tabular MDPs.

{algorithm}

[t] Boostrapped MF-BPI (Boostrapped Model Free Best Policy Identification) {algorithmic}[1] \REQUIREParameters $(\lambda,k,p)$ ; ensemble size $B$ ; learning rates $\{(\alpha_{t},\beta_{t})\}_{t}$ . \STATEInitialize $Q_{1,b}(s,a)\sim{\cal U}([0,1/(1-\gamma)])$ and $M_{1,b}(s,a)\sim{\cal U}([0,1/(1-\gamma)^{2^{k}}])$ for all $(s,a)\in S\times A$ and $b\in[B]$ . \FOR $t=0,1,2,\dots,$ \STATEBootstrap a sample $(\hat{Q}_{t},\hat{M}_{t})$ from the ensemble, and compute the allocation $\omega^{(t)}$ using \crefcorollary:upper_bound_new_bound. Sample $a_{t}\sim\omega^{(t)}(s_{t},\cdot)$ ; observe $(r_{t},s_{t+1})\sim q(\cdot|s_{t},a_{t})\otimes P(\cdot|s_{t},a_{t})$ . \FOR $b=1,\dots,B$ \STATEWith probability $p$ , using the experience $(s_{t},a_{t},r_{t},s_{t+1})$ , update $Q_{t,b}$ and $M_{t,b}$ using \crefeq:stochastic_approximation_step_qvalues,eq:stochastic_approximation_step_mvalues. \ENDFOR\ENDFOR The pseudo-code of MF-BPI for tabular MDPs is presented in Algorithm \thesubsection. In round $t$ , MF-BPI explores the MDP using the allocation $\omega^{(t)}$ estimating $\tilde{\omega}^{\star}$ . To compute this allocation, we use \crefcorollary:upper_bound_new_bound and need (i) the sub-optimality gaps $\Delta(s,a)$ , which can be easily derived from the $Q$ -function; (ii) the $2^{k}$ -th moment $M_{sa}^{k}[V^{\star}]$ , which can always be learnt by means of stochastic approximation. In fact, for any Markovian policy $\pi$ and pair $(s,a)$ we have $M_{sa}^{k}[V_{\phi}^{\pi}]=\frac{1}{\gamma^{2^{k}}}\mathbb{E}_{s^{\prime}\sim P% (\cdot|s,a)}[\delta^{\pi}(s,a,s^{\prime})^{2^{k}}],$ where $\delta^{\pi}(s,a,s^{\prime})=r(s,a)+\gamma\mathbb{E}_{a^{\prime}\sim\pi(\cdot|% s^{\prime})}[Q^{\pi}(s^{\prime},a^{\prime})]-Q^{\pi}(s,a)$ is a variant of the TD-error. MF-BPI then uses an asynchronous two-timescale stochastic approximation algorithm to learn $Q^{\star}$ and $M_{sa}^{k}[V^{\star}]$ , {align} Q_t+1(s_t,a_t) &= Q_t(s_t,a_t) + α_t(s_t,a_t)(r_t+γmax_a Q_t(s_t+1,a)-Q_t(s_t,a_t)),
M_ t+1(s_t,a_t) = M_t(s_t,a_t) + β_t(s_t,a_t)((δ_t’/γ)^2^ k - M_ t(s_t,a_t)), where $\delta_{t}^{\prime}=r_{t}+\gamma\max_{a}Q_{t+1}(s_{t+1},a)-Q_{t+1}(s_{t},a_{t})$ , and $\{(\alpha_{t},\beta_{t})\}_{t\geq 0}$ are learning rates satisfying $\sum_{t\geq 0}\alpha_{t}(s,a)=\sum_{t\geq 0}\beta_{t}(s,a)=\infty,\sum_{t\geq 0% }(\alpha_{t}(s,a)^{2}+\beta_{t}(s,a)^{2})\leq\infty$ , and $\frac{\alpha_{t}(s,a)}{\beta_{t}(s,a)}\to 0$ . MF-BPI uses bootstrap** to handle parametric uncertainty. We maintain an ensemble of $(Q,M)$ -values, with $B$ members, from which we sample $(\hat{Q}_{t},\hat{M}_{t})$ at time $t$ . This sample is generated by sampling a uniform random variable $\xi\sim{\cal U}([0,1])$ and, for each $(s,a)$ set $\hat{Q}_{t}(s,a)={\rm Quantile}_{\xi}({Q_{t,1}(s,a),\dots,Q_{t,B}(s,a)})$ (assuming a linear interpolation). This method is akin to sampling from the parametric uncertainty distribution (we perform the same operation also to compute $\hat{M}_{t}$ ). This sample is used to compute the allocation $\omega^{(t)}$ using \crefcorollary:upper_bound_new_bound by setting $\Delta_{t}(s,a)=\max_{a^{\prime}}\hat{Q}_{t}(s,a^{\prime})-\hat{Q}_{t}(s,a)$ , $\pi_{t}^{\star}(s)=\argmax_{a}\hat{Q}_{t}(s,a)$ and $\deltaminestimatet{t}=\min_{s,a\neq\pi_{t}^{\star}(s)}\Delta_{t}(s,a)$ . Note that, the allocation $\omega^{(t)}$ can be mixed with a uniform policy, to guarantee asymptotic convergence of the estimates. Upon observing an experience, with probability $p$ , MF-BPI updates a member of the ensemble using this new experience. $p$ tunes the rate at which the models are updated, similar to sampling with replacement, speeding up the learning process. Selecting a high value for $p$ compromises the estimation of the parametric uncertainty, whereas choosing a low value may slow down the learning process. Exploration without bootstrap**? To illustrate the need for our bootstrap** approach, we tried to use the allocation $\omega^{(t)}$ mixed with a uniform allocation. In \creffig:forced_generative_performance, we show the results on Riverswim-like environments with $5$ states. While forced exploration ensures infinite visits to all state-action pairs, this guarantee only holds asymptotically. As a result, the allocation mainly focuses on the current MDP estimate, neglecting other plausible MDPs that could produce the same data. This makes the forced exploration approach too sluggish for effective convergence, suggesting its inadequacy for rapid policy learning. These results highlight the need to account for the uncertainty in $Q,M$ when computing the allocation.

\includegraphics

[width=]figures/riverswim/forced_generative.pdf

Figure \thefigure: Forced exploration example with

5

states. We explore according to

\omega^{(t)}(s_{t},a)=(1-\epsilon_{t})\frac{\tilde{\omega}_{t}^{\star}(s_{t},a% )}{\sum_{a^{\prime}}\tilde{\omega}_{t}^{\star}(s_{t},a^{\prime})}+\epsilon_{t}% \frac{1}{|A|}

, mixing the estimate of the allocation

\tilde{\omega}^{\star}

from \crefcorollary:upper_bound_new_bound with a uniform policy, with

\epsilon_{t}=\max(10^{-3},1/N_{t}(s_{t}))

where

N_{t}(s)

indicates the number of times the agent visited state

s

up to time

t

. Shade indicates

95\%

confidence interval.

{algorithm}

[b] DBMF-BPI (Deep Bootstrapped Model Free BPI) {algorithmic}[1] \REQUIREParameters $(\lambda,k)$ ; ensemble size $B$ ; exploration rate $\{\epsilon_{t}\}_{t}$ ; estimate $\deltaminestimatet{0}$ ; mask probability $p$ . \STATEInitialize replay buffer ${\cal D}$ , networks $Q_{\theta_{b}},M_{\tau_{b}}$ and targets $Q_{\theta^{\prime}_{b}}$ for all $b\in[B]$ . \FOR $t=0,1,2,\dots,$ \STATESampling step. {ALC@g} \STATECompute allocation $\omega^{(t)}\leftarrow{\tt ComputeAllocation}(s_{t},\{Q_{\theta_{b}},M_{\tau_{% b}}\}_{b\in[B]},\deltaminestimatet{t},\gamma,\lambda,k,\epsilon_{t})$ . \STATESample $a_{t}\sim\omega^{(t)}(s_{t},\cdot)$ and observe $(r_{t},s_{t+1})\sim q(\cdot|s_{t},a_{t})\otimes P(\cdot|s_{t},a_{t})$ . \STATEAdd transition $z_{t}=(s_{t},a_{t},r_{t},s_{t+1})$ to the replay buffer ${\cal D}$ . \STATETraining step. {ALC@g} \STATESample a batch ${\cal B}$ from ${\cal D}$ , and with probability $p$ add the $i^{th}$ experience in ${\cal B}$ to a sub-batch ${\cal B}_{b}$ , $\forall b\in[B]$ . Update the $(Q,M)$ -values of the $b^{th}$ member in the ensemble using ${\cal B}_{b}$ : $\{Q_{\theta_{b}},Q_{\theta_{b}^{\prime}},M_{\tau_{b}}\}_{b\in[B]}\leftarrow{% \tt Training}(\{{\cal B}_{b},Q_{\theta_{b}},Q_{\theta_{b}^{\prime}},M_{\tau_{b% }}\}_{b\in[B]})$ . \STATEUpdate estimate $\deltaminestimatet{t+1}\leftarrow{\tt EstimateMinimumGap}(\deltaminestimatet{t% },{\cal B},\{Q_{\theta_{b}}\}_{b\in[B]})$ . \ENDFOR

\thesubsection Extension to Deep Reinforcement Learning

To extend bootstrapped MF-BPI to continuous MDPs, we propose DBMF-BPI (see \crefalgo:dbomfbpi, or Appendix B). DBMF-BPI uses the mechanism of prior networks from BSP [osband2018randomized](bootstrap** with additive prior) to account for uncertainty that does not originate from the observed data. As before, we keep an ensemble $\{Q_{\theta_{1}},\dots,Q_{\theta_{B}}\}$ of $Q$ -values (with their target networks) and an ensemble $\{M_{\tau_{1}},\dots,M_{\tau_{B}}\}$ of $M$ -values, as well as their prior networks. We use the same procedure as in the tabular case to compute $(\hat{Q}_{t},\hat{M}_{t})$ at time $t$ , except that we sample $\xi\sim{\cal U}([0,1])$ every $T_{s}\propto(1-\gamma)^{-1}$ training steps (or at the end of an episode) to make the training procedure more stable. The quantity $\hat{Q}_{t}$ is used to compute $\pi_{t}^{\star}(s_{t})$ and $\Delta_{t}(s_{t},a)$ . We estimate $\deltaminestimatet{t}$ via stochastic approximation, with the minimum gap from the last batch of transitions sampled from the replay buffer serving as a target. To derive the exploration strategy, we compute $H_{t}(s_{t},a)=\frac{2+8\varphi^{2}\hat{M}_{t}(s_{t},a)^{2^{1-k}}}{(\Delta_{t}% (s_{t},a)+\lambda)^{2}}$ and $H_{t}=\frac{4(1+\gamma)^{2}\max(1,4\gamma^{2}\varphi^{2}\hat{M}_{t}(s_{t},\pi_% {t}^{\star}(s_{t}))^{2^{1-k}})}{(\deltaminestimatet{t}+\lambda)^{2}(1-\gamma)^% {2}}$ . Next, we set the allocation $\omega_{o}^{(t)}$ as follows: $\omega_{o}^{(t)}(s_{t},a)=H_{t}(s_{t},a)$ if $a\neq\pi_{t}^{\star}(s_{t})$ and $\omega_{o}^{(t)}(s_{t},a)=\sqrt{H_{t}\sum_{a\neq\pi_{t}^{\star}(s_{t})}H_{t}(s% _{t},a)}$ otherwise. Finally, we obtain an $\epsilon_{t}$ -soft exploration policy $\omega^{(t)}(s_{t},\cdot)$ by mixing $\omega_{o}^{(t)}(s_{t},\cdot)/\sum_{a}\omega_{o}^{(t)}(s_{t},a)$ with a uniform distribution (using an exploration parameter $\epsilon_{t}$ ).