Model-Free Active Exploration Algorithms\labelsec:obpi In this section we present MF-BPI, a model-free exploration algorithm that leverages the optimal allocations obtained through the previously derived upper bound of the sample complexity lower bound. We first present an upper bound of , so that it is possible to derive a closed form solution of the optimal allocation (an idea previously proposed in \citeal2021adaptive).
Assume that has a unique optimal policy . For all , we have:
with and . The minimizer satisfies for and otherwise. In the MF-BPI algorithm, we estimate the gaps and for a fixed small value of (we later explain how to do this in a model-free manner.) and compute the corresponding allocation . This allocation drives the exploration under MF-BPI. Using this design approach, we face two issues: (1) Uniform and regularization. It is impractical to estimate for multiple values of . Instead, we fix a small value of (e.g., or ) for all state-action pairs (refer to the previous section for a discussion on this choice). Then, to avoid excessively small values of the gaps in the denominator, we regularize the allocation by replacing, in the expression of (resp. ), (resp. ) by (resp. ) for some . (2) Handling parametric uncertainty via bootstrap**. The quantities and required to compute remain unknown during training, and we adopt the Certainty Equivalence principle, substituting the current estimates of these quantities to compute the exploration strategy. By doing so, we are inherently introducing parametric uncertainty into these terms that is not taken into account by the allocation . To deal with this uncertainty, the traditional method, as used e.g. in [al2021adaptive, marjani2021navigating]), involves using -soft exploration policies to guarantee that all state-action pairs are visited infinitely often. This ensures that the estimation errors vanish as time grows large. In practice, we find this type of forced exploration inefficient. In MF-BPI, we opt for a bootstrap** approach to manage parametric uncertainties, which can augment the traditional forced exploration step, leading to more principled exploration.
\thesubsection Exploration in tabular MDPs.
[t]
{algorithmic}[1]
\REQUIREParameters ; ensemble size ; learning rates .
\STATEInitialize and for all and . \FOR
\STATEBootstrap a sample from the ensemble, and compute the allocation using \crefcorollary:upper_bound_new_bound. Sample ; observe .
\FOR
\STATEWith probability , using the experience , update and using \crefeq:stochastic_approximation_step_qvalues,eq:stochastic_approximation_step_mvalues.
\ENDFOR\ENDFOR
The pseudo-code of MF-BPI for tabular MDPs is presented in Algorithm \thesubsection. In round , MF-BPI explores the MDP using the allocation estimating . To compute this allocation, we use \crefcorollary:upper_bound_new_bound and need (i) the sub-optimality gaps , which can be easily derived from the -function; (ii) the -th moment , which can always be learnt by means of stochastic approximation. In fact, for any Markovian policy and pair we have
where is a variant of the TD-error.
MF-BPI then uses an asynchronous two-timescale stochastic approximation algorithm to learn and ,
{align}
Q_t+1(s_t,a_t) &= Q_t(s_t,a_t) + α_t(s_t,a_t)(r_t+γmax_a Q_t(s_t+1,a)-Q_t(s_t,a_t)),
M_ t+1(s_t,a_t) = M_t(s_t,a_t) + β_t(s_t,a_t)((δ_t’/γ)^2^ k - M_ t(s_t,a_t)),
where , and are learning rates satisfying , and .
MF-BPI uses bootstrap** to handle parametric uncertainty. We maintain an ensemble of -values, with members, from which we sample at time . This sample is generated by sampling a uniform random variable and, for each set (assuming a linear interpolation). This method is akin to sampling from the parametric uncertainty distribution (we perform the same operation also to compute ). This sample is used to compute the allocation using \crefcorollary:upper_bound_new_bound by setting , and . Note that, the allocation can be mixed with a uniform policy, to guarantee asymptotic convergence of the estimates. Upon observing an experience, with probability , MF-BPI updates a member of the ensemble using this new experience. tunes the rate at which the models are updated, similar to sampling with replacement, speeding up the learning process. Selecting a high value for compromises the estimation of the parametric uncertainty, whereas choosing a low value may slow down the learning process.
Exploration without bootstrap**? To illustrate the need for our bootstrap** approach, we tried to use the allocation mixed with a uniform allocation. In \creffig:forced_generative_performance, we show the results on Riverswim-like environments with states. While forced exploration ensures infinite visits to all state-action pairs, this guarantee only holds asymptotically. As a result, the allocation mainly focuses on the current MDP estimate, neglecting other plausible MDPs that could produce the same data. This makes the forced exploration approach too sluggish for effective convergence, suggesting its inadequacy for rapid policy learning. These results highlight the need to account for the uncertainty in when computing the allocation.
[b] {algorithmic}[1] \REQUIREParameters ; ensemble size ; exploration rate ; estimate ; mask probability . \STATEInitialize replay buffer , networks and targets for all . \FOR \STATESampling step. {ALC@g} \STATECompute allocation . \STATESample and observe . \STATEAdd transition to the replay buffer . \STATETraining step. {ALC@g} \STATESample a batch from , and with probability add the experience in to a sub-batch , . Update the -values of the member in the ensemble using : . \STATEUpdate estimate . \ENDFOR
\thesubsection Extension to Deep Reinforcement Learning
To extend bootstrapped MF-BPI to continuous MDPs, we propose DBMF-BPI (see \crefalgo:dbomfbpi, or Appendix B). DBMF-BPI uses the mechanism of prior networks from BSP [osband2018randomized](bootstrap** with additive prior) to account for uncertainty that does not originate from the observed data. As before, we keep an ensemble of -values (with their target networks) and an ensemble of -values, as well as their prior networks. We use the same procedure as in the tabular case to compute at time , except that we sample every training steps (or at the end of an episode) to make the training procedure more stable. The quantity is used to compute and . We estimate via stochastic approximation, with the minimum gap from the last batch of transitions sampled from the replay buffer serving as a target. To derive the exploration strategy, we compute and . Next, we set the allocation as follows: if and otherwise. Finally, we obtain an -soft exploration policy by mixing with a uniform distribution (using an exploration parameter ).