\section

Model-Free Active Exploration Algorithms\labelsec:obpi In this section we present MF-BPI, a model-free exploration algorithm that leverages the optimal allocations obtained through the previously derived upper bound of the sample complexity lower bound. We first present an upper bound U~(ω)~𝑈𝜔\tilde{U}(\omega)over~ start_ARG italic_U end_ARG ( italic_ω ) of U(ω)𝑈𝜔U(\omega)italic_U ( italic_ω ), so that it is possible to derive a closed form solution of the optimal allocation (an idea previously proposed in \citeal2021adaptive).

{proposition}

Assume that ϕitalic-ϕ\phiitalic_ϕ has a unique optimal policy πsuperscript𝜋\pi^{\star}italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. For all ωΔ(S×A)𝜔Δ𝑆𝐴\omega\in\Delta(S\times A)italic_ω ∈ roman_Δ ( italic_S × italic_A ), we have:

U(ω)U~(ω):=maxs,aπ(s)H(s,a)ω(s,a)+Hminsω(s,π(s)),𝑈𝜔~𝑈𝜔assignsubscript𝑠𝑎superscript𝜋𝑠𝐻𝑠𝑎𝜔𝑠𝑎𝐻subscriptsuperscript𝑠𝜔superscript𝑠superscript𝜋superscript𝑠U(\omega)\leq\tilde{U}(\omega):=\max_{s,a\neq\pi^{\star}(s)}\frac{H(s,a)}{% \omega(s,a)}+\frac{H}{\min_{s^{\prime}}\omega(s^{\prime},\pi^{\star}(s^{\prime% }))},italic_U ( italic_ω ) ≤ over~ start_ARG italic_U end_ARG ( italic_ω ) := roman_max start_POSTSUBSCRIPT italic_s , italic_a ≠ italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT divide start_ARG italic_H ( italic_s , italic_a ) end_ARG start_ARG italic_ω ( italic_s , italic_a ) end_ARG + divide start_ARG italic_H end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ω ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG ,

with H(s,a)\coloneqq2+8φ2Msak[V]21kΔ(s,a)2𝐻𝑠𝑎\coloneqq28superscript𝜑2superscriptsubscript𝑀𝑠𝑎𝑘superscriptdelimited-[]superscript𝑉superscript21𝑘Δsuperscript𝑠𝑎2H(s,a)\coloneqq\frac{2+8\varphi^{2}M_{sa}^{k}[V^{\star}]^{2^{1-k}}}{\Delta(s,a% )^{2}}italic_H ( italic_s , italic_a ) divide start_ARG 2 + 8 italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT 1 - italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and H\coloneqqmaxsC(s)(1+γ)2\deltamin2(1γ)2𝐻\coloneqqsubscriptsuperscript𝑠𝐶superscript𝑠superscript1𝛾2superscript\deltamin2superscript1𝛾2H\coloneqq\frac{\max_{s^{\prime}}C(s^{\prime})(1+\gamma)^{2}}{\deltamin^{2}(1-% \gamma)^{2}}italic_H divide start_ARG roman_max start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. The minimizer ω~\coloneqqarginfωU~(ω)superscript~𝜔\coloneqqsubscriptinfimum𝜔~𝑈𝜔\tilde{\omega}^{\star}\coloneqq\arg\inf_{\omega}\tilde{U}(\omega)over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT roman_arg roman_inf start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT over~ start_ARG italic_U end_ARG ( italic_ω ) satisfies ω~(s,a)H(s,a)proportional-tosuperscript~𝜔𝑠𝑎𝐻𝑠𝑎\tilde{\omega}^{\star}(s,a)\propto H(s,a)over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s , italic_a ) ∝ italic_H ( italic_s , italic_a ) for aπ(s)𝑎superscript𝜋𝑠a\neq\pi^{\star}(s)italic_a ≠ italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) and ω~(s,π(s))Hs,aπ(s)H(s,a)/|S|proportional-tosuperscript~𝜔𝑠superscript𝜋𝑠𝐻subscript𝑠𝑎superscript𝜋𝑠𝐻𝑠𝑎𝑆\tilde{\omega}^{\star}(s,\pi^{\star}(s))\propto\sqrt{H\sum_{s,a\neq\pi^{\star}% (s)}H(s,a)/|S|}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) ) ∝ square-root start_ARG italic_H ∑ start_POSTSUBSCRIPT italic_s , italic_a ≠ italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_H ( italic_s , italic_a ) / | italic_S | end_ARG otherwise. In the MF-BPI algorithm, we estimate the gaps Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ) and Msak[V]superscriptsubscript𝑀𝑠𝑎𝑘delimited-[]superscript𝑉M_{sa}^{k}[V^{\star}]italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ] for a fixed small value of k𝑘kitalic_k (we later explain how to do this in a model-free manner.) and compute the corresponding allocation ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. This allocation drives the exploration under MF-BPI. Using this design approach, we face two issues: (1) Uniform k𝑘kitalic_k and regularization. It is impractical to estimate Msak[V]superscriptsubscript𝑀𝑠𝑎𝑘delimited-[]superscript𝑉M_{sa}^{k}[V^{\star}]italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ] for multiple values of k𝑘kitalic_k. Instead, we fix a small value of k𝑘kitalic_k (e.g., k=1𝑘1k=1italic_k = 1 or k=2𝑘2k=2italic_k = 2) for all state-action pairs (refer to the previous section for a discussion on this choice). Then, to avoid excessively small values of the gaps in the denominator, we regularize the allocation ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT by replacing, in the expression of H(s,a)𝐻𝑠𝑎H(s,a)italic_H ( italic_s , italic_a ) (resp. Hminsubscript𝐻H_{\min}italic_H start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT), Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ) (resp. ΔminsubscriptΔ\Delta_{\min}roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT) by (Δ(s,a)+λ)Δ𝑠𝑎𝜆(\Delta(s,a)+\lambda)( roman_Δ ( italic_s , italic_a ) + italic_λ ) (resp. (Δmin+λ)subscriptΔ𝜆(\Delta_{\min}+\lambda)( roman_Δ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + italic_λ )) for some λ>0𝜆0\lambda>0italic_λ > 0. (2) Handling parametric uncertainty via bootstrap**. The quantities Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ) and Msak[V]superscriptsubscript𝑀𝑠𝑎𝑘delimited-[]superscript𝑉M_{sa}^{k}[V^{\star}]italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ] required to compute ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT remain unknown during training, and we adopt the Certainty Equivalence principle, substituting the current estimates of these quantities to compute the exploration strategy. By doing so, we are inherently introducing parametric uncertainty into these terms that is not taken into account by the allocation ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. To deal with this uncertainty, the traditional method, as used e.g. in [al2021adaptive, marjani2021navigating]), involves using ϵitalic-ϵ\epsilonitalic_ϵ-soft exploration policies to guarantee that all state-action pairs are visited infinitely often. This ensures that the estimation errors vanish as time grows large. In practice, we find this type of forced exploration inefficient. In MF-BPI, we opt for a bootstrap** approach to manage parametric uncertainties, which can augment the traditional forced exploration step, leading to more principled exploration.

\thesubsection Exploration in tabular MDPs.

{algorithm}

[t] Boostrapped MF-BPI (Boostrapped Model Free Best Policy Identification) {algorithmic}[1] \REQUIREParameters (λ,k,p)𝜆𝑘𝑝(\lambda,k,p)( italic_λ , italic_k , italic_p ); ensemble size B𝐵Bitalic_B; learning rates {(αt,βt)}tsubscriptsubscript𝛼𝑡subscript𝛽𝑡𝑡\{(\alpha_{t},\beta_{t})\}_{t}{ ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. \STATEInitialize Q1,b(s,a)𝒰([0,1/(1γ)])similar-tosubscript𝑄1𝑏𝑠𝑎𝒰011𝛾Q_{1,b}(s,a)\sim{\cal U}([0,1/(1-\gamma)])italic_Q start_POSTSUBSCRIPT 1 , italic_b end_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ caligraphic_U ( [ 0 , 1 / ( 1 - italic_γ ) ] ) and M1,b(s,a)𝒰([0,1/(1γ)2k])similar-tosubscript𝑀1𝑏𝑠𝑎𝒰01superscript1𝛾superscript2𝑘M_{1,b}(s,a)\sim{\cal U}([0,1/(1-\gamma)^{2^{k}}])italic_M start_POSTSUBSCRIPT 1 , italic_b end_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ caligraphic_U ( [ 0 , 1 / ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] ) for all (s,a)S×A𝑠𝑎𝑆𝐴(s,a)\in S\times A( italic_s , italic_a ) ∈ italic_S × italic_A and b[B]𝑏delimited-[]𝐵b\in[B]italic_b ∈ [ italic_B ]. \FORt=0,1,2,,𝑡012t=0,1,2,\dots,italic_t = 0 , 1 , 2 , … , \STATEBootstrap a sample (Q^t,M^t)subscript^𝑄𝑡subscript^𝑀𝑡(\hat{Q}_{t},\hat{M}_{t})( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from the ensemble, and compute the allocation ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT using \crefcorollary:upper_bound_new_bound. Sample atω(t)(st,)similar-tosubscript𝑎𝑡superscript𝜔𝑡subscript𝑠𝑡a_{t}\sim\omega^{(t)}(s_{t},\cdot)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋅ ); observe (rt,st+1)q(|st,at)P(|st,at)(r_{t},s_{t+1})\sim q(\cdot|s_{t},a_{t})\otimes P(\cdot|s_{t},a_{t})( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∼ italic_q ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊗ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). \FORb=1,,B𝑏1𝐵b=1,\dots,Bitalic_b = 1 , … , italic_B \STATEWith probability p𝑝pitalic_p, using the experience (st,at,rt,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1(s_{t},a_{t},r_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ), update Qt,bsubscript𝑄𝑡𝑏Q_{t,b}italic_Q start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT and Mt,bsubscript𝑀𝑡𝑏M_{t,b}italic_M start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT using \crefeq:stochastic_approximation_step_qvalues,eq:stochastic_approximation_step_mvalues. \ENDFOR\ENDFOR The pseudo-code of MF-BPI for tabular MDPs is presented in Algorithm \thesubsection. In round t𝑡titalic_t, MF-BPI explores the MDP using the allocation ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT estimating ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. To compute this allocation, we use \crefcorollary:upper_bound_new_bound and need (i) the sub-optimality gaps Δ(s,a)Δ𝑠𝑎\Delta(s,a)roman_Δ ( italic_s , italic_a ), which can be easily derived from the Q𝑄Qitalic_Q-function; (ii) the 2ksuperscript2𝑘2^{k}2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT-th moment Msak[V]superscriptsubscript𝑀𝑠𝑎𝑘delimited-[]superscript𝑉M_{sa}^{k}[V^{\star}]italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ], which can always be learnt by means of stochastic approximation. In fact, for any Markovian policy π𝜋\piitalic_π and pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) we have Msak[Vϕπ]=1γ2k\mathbbEsP(|s,a)[δπ(s,a,s)2k],M_{sa}^{k}[V_{\phi}^{\pi}]=\frac{1}{\gamma^{2^{k}}}\mathbb{E}_{s^{\prime}\sim P% (\cdot|s,a)}[\delta^{\pi}(s,a,s^{\prime})^{2^{k}}],italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG italic_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a ) end_POSTSUBSCRIPT [ italic_δ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ] , where δπ(s,a,s)=r(s,a)+γ\mathbbEaπ(|s)[Qπ(s,a)]Qπ(s,a)\delta^{\pi}(s,a,s^{\prime})=r(s,a)+\gamma\mathbb{E}_{a^{\prime}\sim\pi(\cdot|% s^{\prime})}[Q^{\pi}(s^{\prime},a^{\prime})]-Q^{\pi}(s,a)italic_δ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_r ( italic_s , italic_a ) + italic_γ italic_E start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) is a variant of the TD-error. MF-BPI then uses an asynchronous two-timescale stochastic approximation algorithm to learn Qsuperscript𝑄Q^{\star}italic_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and Msak[V]superscriptsubscript𝑀𝑠𝑎𝑘delimited-[]superscript𝑉M_{sa}^{k}[V^{\star}]italic_M start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ], {align} Q_t+1(s_t,a_t) &= Q_t(s_t,a_t) + α_t(s_t,a_t)(r_t+γmax_a Q_t(s_t+1,a)-Q_t(s_t,a_t)),
M_ t+1(s_t,a_t) = M_t(s_t,a_t) + β_t(s_t,a_t)((δ_t’/γ)^2^ k - M_ t(s_t,a_t)), where δt=rt+γmaxaQt+1(st+1,a)Qt+1(st,at)superscriptsubscript𝛿𝑡subscript𝑟𝑡𝛾subscript𝑎subscript𝑄𝑡1subscript𝑠𝑡1𝑎subscript𝑄𝑡1subscript𝑠𝑡subscript𝑎𝑡\delta_{t}^{\prime}=r_{t}+\gamma\max_{a}Q_{t+1}(s_{t+1},a)-Q_{t+1}(s_{t},a_{t})italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ) - italic_Q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and {(αt,βt)}t0subscriptsubscript𝛼𝑡subscript𝛽𝑡𝑡0\{(\alpha_{t},\beta_{t})\}_{t\geq 0}{ ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT are learning rates satisfying t0αt(s,a)=t0βt(s,a)=,t0(αt(s,a)2+βt(s,a)2)formulae-sequencesubscript𝑡0subscript𝛼𝑡𝑠𝑎subscript𝑡0subscript𝛽𝑡𝑠𝑎subscript𝑡0subscript𝛼𝑡superscript𝑠𝑎2subscript𝛽𝑡superscript𝑠𝑎2\sum_{t\geq 0}\alpha_{t}(s,a)=\sum_{t\geq 0}\beta_{t}(s,a)=\infty,\sum_{t\geq 0% }(\alpha_{t}(s,a)^{2}+\beta_{t}(s,a)^{2})\leq\infty∑ start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = ∞ , ∑ start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ ∞, and αt(s,a)βt(s,a)0subscript𝛼𝑡𝑠𝑎subscript𝛽𝑡𝑠𝑎0\frac{\alpha_{t}(s,a)}{\beta_{t}(s,a)}\to 0divide start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG → 0. MF-BPI uses bootstrap** to handle parametric uncertainty. We maintain an ensemble of (Q,M)𝑄𝑀(Q,M)( italic_Q , italic_M )-values, with B𝐵Bitalic_B members, from which we sample (Q^t,M^t)subscript^𝑄𝑡subscript^𝑀𝑡(\hat{Q}_{t},\hat{M}_{t})( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t. This sample is generated by sampling a uniform random variable ξ𝒰([0,1])similar-to𝜉𝒰01\xi\sim{\cal U}([0,1])italic_ξ ∼ caligraphic_U ( [ 0 , 1 ] ) and, for each (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) set Q^t(s,a)=Quantileξ(Qt,1(s,a),,Qt,B(s,a))subscript^𝑄𝑡𝑠𝑎subscriptQuantile𝜉subscript𝑄𝑡1𝑠𝑎subscript𝑄𝑡𝐵𝑠𝑎\hat{Q}_{t}(s,a)={\rm Quantile}_{\xi}({Q_{t,1}(s,a),\dots,Q_{t,B}(s,a)})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = roman_Quantile start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) , … , italic_Q start_POSTSUBSCRIPT italic_t , italic_B end_POSTSUBSCRIPT ( italic_s , italic_a ) ) (assuming a linear interpolation). This method is akin to sampling from the parametric uncertainty distribution (we perform the same operation also to compute M^tsubscript^𝑀𝑡\hat{M}_{t}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). This sample is used to compute the allocation ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT using \crefcorollary:upper_bound_new_bound by setting Δt(s,a)=maxaQ^t(s,a)Q^t(s,a)subscriptΔ𝑡𝑠𝑎subscriptsuperscript𝑎subscript^𝑄𝑡𝑠superscript𝑎subscript^𝑄𝑡𝑠𝑎\Delta_{t}(s,a)=\max_{a^{\prime}}\hat{Q}_{t}(s,a^{\prime})-\hat{Q}_{t}(s,a)roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) = roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ), πt(s)=\argmaxaQ^t(s,a)superscriptsubscript𝜋𝑡𝑠subscript\argmax𝑎subscript^𝑄𝑡𝑠𝑎\pi_{t}^{\star}(s)=\argmax_{a}\hat{Q}_{t}(s,a)italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) = start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ) and \deltaminestimatett=mins,aπt(s)Δt(s,a)\deltaminestimatet𝑡subscript𝑠𝑎superscriptsubscript𝜋𝑡𝑠subscriptΔ𝑡𝑠𝑎\deltaminestimatet{t}=\min_{s,a\neq\pi_{t}^{\star}(s)}\Delta_{t}(s,a)italic_t = roman_min start_POSTSUBSCRIPT italic_s , italic_a ≠ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a ). Note that, the allocation ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT can be mixed with a uniform policy, to guarantee asymptotic convergence of the estimates. Upon observing an experience, with probability p𝑝pitalic_p, MF-BPI updates a member of the ensemble using this new experience. p𝑝pitalic_p tunes the rate at which the models are updated, similar to sampling with replacement, speeding up the learning process. Selecting a high value for p𝑝pitalic_p compromises the estimation of the parametric uncertainty, whereas choosing a low value may slow down the learning process. Exploration without bootstrap**? To illustrate the need for our bootstrap** approach, we tried to use the allocation ω(t)superscript𝜔𝑡\omega^{(t)}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT mixed with a uniform allocation. In \creffig:forced_generative_performance, we show the results on Riverswim-like environments with 5555 states. While forced exploration ensures infinite visits to all state-action pairs, this guarantee only holds asymptotically. As a result, the allocation mainly focuses on the current MDP estimate, neglecting other plausible MDPs that could produce the same data. This makes the forced exploration approach too sluggish for effective convergence, suggesting its inadequacy for rapid policy learning. These results highlight the need to account for the uncertainty in Q,M𝑄𝑀Q,Mitalic_Q , italic_M when computing the allocation.

\includegraphics

[width=]figures/riverswim/forced_generative.pdf

Figure \thefigure: Forced exploration example with 5555 states. We explore according to ω(t)(st,a)=(1ϵt)ω~t(st,a)aω~t(st,a)+ϵt1|A|superscript𝜔𝑡subscript𝑠𝑡𝑎1subscriptitalic-ϵ𝑡superscriptsubscript~𝜔𝑡subscript𝑠𝑡𝑎subscriptsuperscript𝑎superscriptsubscript~𝜔𝑡subscript𝑠𝑡superscript𝑎subscriptitalic-ϵ𝑡1𝐴\omega^{(t)}(s_{t},a)=(1-\epsilon_{t})\frac{\tilde{\omega}_{t}^{\star}(s_{t},a% )}{\sum_{a^{\prime}}\tilde{\omega}_{t}^{\star}(s_{t},a^{\prime})}+\epsilon_{t}% \frac{1}{|A|}italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) = ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_A | end_ARG, mixing the estimate of the allocation ω~superscript~𝜔\tilde{\omega}^{\star}over~ start_ARG italic_ω end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT from \crefcorollary:upper_bound_new_bound with a uniform policy, with ϵt=max(103,1/Nt(st))subscriptitalic-ϵ𝑡superscript1031subscript𝑁𝑡subscript𝑠𝑡\epsilon_{t}=\max(10^{-3},1/N_{t}(s_{t}))italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_max ( 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 1 / italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) where Nt(s)subscript𝑁𝑡𝑠N_{t}(s)italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s ) indicates the number of times the agent visited state s𝑠sitalic_s up to time t𝑡titalic_t. Shade indicates 95%percent9595\%95 % confidence interval.
{algorithm}

[b] DBMF-BPI (Deep Bootstrapped Model Free BPI) {algorithmic}[1] \REQUIREParameters (λ,k)𝜆𝑘(\lambda,k)( italic_λ , italic_k ); ensemble size B𝐵Bitalic_B; exploration rate {ϵt}tsubscriptsubscriptitalic-ϵ𝑡𝑡\{\epsilon_{t}\}_{t}{ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; estimate \deltaminestimatet0\deltaminestimatet0\deltaminestimatet{0}; mask probability p𝑝pitalic_p. \STATEInitialize replay buffer 𝒟𝒟{\cal D}caligraphic_D, networks Qθb,Mτbsubscript𝑄subscript𝜃𝑏subscript𝑀subscript𝜏𝑏Q_{\theta_{b}},M_{\tau_{b}}italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT and targets Qθbsubscript𝑄subscriptsuperscript𝜃𝑏Q_{\theta^{\prime}_{b}}italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all b[B]𝑏delimited-[]𝐵b\in[B]italic_b ∈ [ italic_B ]. \FORt=0,1,2,,𝑡012t=0,1,2,\dots,italic_t = 0 , 1 , 2 , … , \STATESampling step. {ALC@g} \STATECompute allocation ω(t)𝙲𝚘𝚖𝚙𝚞𝚝𝚎𝙰𝚕𝚕𝚘𝚌𝚊𝚝𝚒𝚘𝚗(st,{Qθb,Mτb}b[B],\deltaminestimatett,γ,λ,k,ϵt)superscript𝜔𝑡𝙲𝚘𝚖𝚙𝚞𝚝𝚎𝙰𝚕𝚕𝚘𝚌𝚊𝚝𝚒𝚘𝚗subscript𝑠𝑡subscriptsubscript𝑄subscript𝜃𝑏subscript𝑀subscript𝜏𝑏𝑏delimited-[]𝐵\deltaminestimatet𝑡𝛾𝜆𝑘subscriptitalic-ϵ𝑡\omega^{(t)}\leftarrow{\tt ComputeAllocation}(s_{t},\{Q_{\theta_{b}},M_{\tau_{% b}}\}_{b\in[B]},\deltaminestimatet{t},\gamma,\lambda,k,\epsilon_{t})italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ← typewriter_ComputeAllocation ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , { italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b ∈ [ italic_B ] end_POSTSUBSCRIPT , italic_t , italic_γ , italic_λ , italic_k , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). \STATESample atω(t)(st,)similar-tosubscript𝑎𝑡superscript𝜔𝑡subscript𝑠𝑡a_{t}\sim\omega^{(t)}(s_{t},\cdot)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋅ ) and observe (rt,st+1)q(|st,at)P(|st,at)(r_{t},s_{t+1})\sim q(\cdot|s_{t},a_{t})\otimes P(\cdot|s_{t},a_{t})( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∼ italic_q ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊗ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). \STATEAdd transition zt=(st,at,rt,st+1)subscript𝑧𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1z_{t}=(s_{t},a_{t},r_{t},s_{t+1})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) to the replay buffer 𝒟𝒟{\cal D}caligraphic_D. \STATETraining step. {ALC@g} \STATESample a batch {\cal B}caligraphic_B from 𝒟𝒟{\cal D}caligraphic_D, and with probability p𝑝pitalic_p add the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT experience in {\cal B}caligraphic_B to a sub-batch bsubscript𝑏{\cal B}_{b}caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, b[B]for-all𝑏delimited-[]𝐵\forall b\in[B]∀ italic_b ∈ [ italic_B ]. Update the (Q,M)𝑄𝑀(Q,M)( italic_Q , italic_M )-values of the bthsuperscript𝑏𝑡b^{th}italic_b start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT member in the ensemble using bsubscript𝑏{\cal B}_{b}caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT: {Qθb,Qθb,Mτb}b[B]𝚃𝚛𝚊𝚒𝚗𝚒𝚗𝚐({b,Qθb,Qθb,Mτb}b[B])subscriptsubscript𝑄subscript𝜃𝑏subscript𝑄superscriptsubscript𝜃𝑏subscript𝑀subscript𝜏𝑏𝑏delimited-[]𝐵𝚃𝚛𝚊𝚒𝚗𝚒𝚗𝚐subscriptsubscript𝑏subscript𝑄subscript𝜃𝑏subscript𝑄superscriptsubscript𝜃𝑏subscript𝑀subscript𝜏𝑏𝑏delimited-[]𝐵\{Q_{\theta_{b}},Q_{\theta_{b}^{\prime}},M_{\tau_{b}}\}_{b\in[B]}\leftarrow{% \tt Training}(\{{\cal B}_{b},Q_{\theta_{b}},Q_{\theta_{b}^{\prime}},M_{\tau_{b% }}\}_{b\in[B]}){ italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b ∈ [ italic_B ] end_POSTSUBSCRIPT ← typewriter_Training ( { caligraphic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b ∈ [ italic_B ] end_POSTSUBSCRIPT ). \STATEUpdate estimate \deltaminestimatett+1𝙴𝚜𝚝𝚒𝚖𝚊𝚝𝚎𝙼𝚒𝚗𝚒𝚖𝚞𝚖𝙶𝚊𝚙(\deltaminestimatett,,{Qθb}b[B])\deltaminestimatet𝑡1𝙴𝚜𝚝𝚒𝚖𝚊𝚝𝚎𝙼𝚒𝚗𝚒𝚖𝚞𝚖𝙶𝚊𝚙\deltaminestimatet𝑡subscriptsubscript𝑄subscript𝜃𝑏𝑏delimited-[]𝐵\deltaminestimatet{t+1}\leftarrow{\tt EstimateMinimumGap}(\deltaminestimatet{t% },{\cal B},\{Q_{\theta_{b}}\}_{b\in[B]})italic_t + 1 ← typewriter_EstimateMinimumGap ( italic_t , caligraphic_B , { italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_b ∈ [ italic_B ] end_POSTSUBSCRIPT ). \ENDFOR

\thesubsection Extension to Deep Reinforcement Learning

To extend bootstrapped MF-BPI to continuous MDPs, we propose DBMF-BPI (see \crefalgo:dbomfbpi, or Appendix B). DBMF-BPI uses the mechanism of prior networks from BSP [osband2018randomized](bootstrap** with additive prior) to account for uncertainty that does not originate from the observed data. As before, we keep an ensemble {Qθ1,,QθB}subscript𝑄subscript𝜃1subscript𝑄subscript𝜃𝐵\{Q_{\theta_{1}},\dots,Q_{\theta_{B}}\}{ italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT } of Q𝑄Qitalic_Q-values (with their target networks) and an ensemble {Mτ1,,MτB}subscript𝑀subscript𝜏1subscript𝑀subscript𝜏𝐵\{M_{\tau_{1}},\dots,M_{\tau_{B}}\}{ italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT } of M𝑀Mitalic_M-values, as well as their prior networks. We use the same procedure as in the tabular case to compute (Q^t,M^t)subscript^𝑄𝑡subscript^𝑀𝑡(\hat{Q}_{t},\hat{M}_{t})( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time t𝑡titalic_t, except that we sample ξ𝒰([0,1])similar-to𝜉𝒰01\xi\sim{\cal U}([0,1])italic_ξ ∼ caligraphic_U ( [ 0 , 1 ] ) every Ts(1γ)1proportional-tosubscript𝑇𝑠superscript1𝛾1T_{s}\propto(1-\gamma)^{-1}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∝ ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT training steps (or at the end of an episode) to make the training procedure more stable. The quantity Q^tsubscript^𝑄𝑡\hat{Q}_{t}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used to compute πt(st)superscriptsubscript𝜋𝑡subscript𝑠𝑡\pi_{t}^{\star}(s_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Δt(st,a)subscriptΔ𝑡subscript𝑠𝑡𝑎\Delta_{t}(s_{t},a)roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ). We estimate \deltaminestimatett\deltaminestimatet𝑡\deltaminestimatet{t}italic_t via stochastic approximation, with the minimum gap from the last batch of transitions sampled from the replay buffer serving as a target. To derive the exploration strategy, we compute Ht(st,a)=2+8φ2M^t(st,a)21k(Δt(st,a)+λ)2subscript𝐻𝑡subscript𝑠𝑡𝑎28superscript𝜑2subscript^𝑀𝑡superscriptsubscript𝑠𝑡𝑎superscript21𝑘superscriptsubscriptΔ𝑡subscript𝑠𝑡𝑎𝜆2H_{t}(s_{t},a)=\frac{2+8\varphi^{2}\hat{M}_{t}(s_{t},a)^{2^{1-k}}}{(\Delta_{t}% (s_{t},a)+\lambda)^{2}}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) = divide start_ARG 2 + 8 italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT 1 - italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ( roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) + italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG and Ht=4(1+γ)2max(1,4γ2φ2M^t(st,πt(st))21k)(\deltaminestimatett+λ)2(1γ)2subscript𝐻𝑡4superscript1𝛾214superscript𝛾2superscript𝜑2subscript^𝑀𝑡superscriptsubscript𝑠𝑡superscriptsubscript𝜋𝑡subscript𝑠𝑡superscript21𝑘superscript\deltaminestimatet𝑡𝜆2superscript1𝛾2H_{t}=\frac{4(1+\gamma)^{2}\max(1,4\gamma^{2}\varphi^{2}\hat{M}_{t}(s_{t},\pi_% {t}^{\star}(s_{t}))^{2^{1-k}})}{(\deltaminestimatet{t}+\lambda)^{2}(1-\gamma)^% {2}}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 4 ( 1 + italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_max ( 1 , 4 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_φ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT 1 - italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG ( italic_t + italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Next, we set the allocation ωo(t)superscriptsubscript𝜔𝑜𝑡\omega_{o}^{(t)}italic_ω start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as follows: ωo(t)(st,a)=Ht(st,a)superscriptsubscript𝜔𝑜𝑡subscript𝑠𝑡𝑎subscript𝐻𝑡subscript𝑠𝑡𝑎\omega_{o}^{(t)}(s_{t},a)=H_{t}(s_{t},a)italic_ω start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) = italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) if aπt(st)𝑎superscriptsubscript𝜋𝑡subscript𝑠𝑡a\neq\pi_{t}^{\star}(s_{t})italic_a ≠ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and ωo(t)(st,a)=Htaπt(st)Ht(st,a)superscriptsubscript𝜔𝑜𝑡subscript𝑠𝑡𝑎subscript𝐻𝑡subscript𝑎superscriptsubscript𝜋𝑡subscript𝑠𝑡subscript𝐻𝑡subscript𝑠𝑡𝑎\omega_{o}^{(t)}(s_{t},a)=\sqrt{H_{t}\sum_{a\neq\pi_{t}^{\star}(s_{t})}H_{t}(s% _{t},a)}italic_ω start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) = square-root start_ARG italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ≠ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) end_ARG otherwise. Finally, we obtain an ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-soft exploration policy ω(t)(st,)superscript𝜔𝑡subscript𝑠𝑡\omega^{(t)}(s_{t},\cdot)italic_ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋅ ) by mixing ωo(t)(st,)/aωo(t)(st,a)superscriptsubscript𝜔𝑜𝑡subscript𝑠𝑡subscript𝑎superscriptsubscript𝜔𝑜𝑡subscript𝑠𝑡𝑎\omega_{o}^{(t)}(s_{t},\cdot)/\sum_{a}\omega_{o}^{(t)}(s_{t},a)italic_ω start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋅ ) / ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ) with a uniform distribution (using an exploration parameter ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT).