\mtcsetdepth

parttoc4

\doparttoc\faketableofcontents

The Curse of Diversity in Ensemble-Based
Exploration

Zhixuan Lin, Pierluca D’Oro, Evgenii Nikishin & Aaron Courville
Mila - Quebec AI Institute, Université de Montréal
Correspondence to [email protected].
Abstract

We uncover a surprising phenomenon in deep reinforcement learning: training a diverse ensemble of data-sharing agents – a well-established exploration strategy – can significantly impair the performance of the individual ensemble members when compared to standard single-agent training. Through careful analysis, we attribute the degradation in performance to the low proportion of self-generated data in the shared training data for each ensemble member, as well as the inefficiency of the individual ensemble members to learn from such highly off-policy data. We thus name this phenomenon the curse of diversity. We find that several intuitive solutions – such as a larger replay buffer or a smaller ensemble size – either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Finally, we demonstrate the potential of representation learning to counteract the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains. Our work offers valuable insights into an unexpected pitfall in ensemble-based exploration and raises important caveats for future applications of similar approaches.

1 Introduction

Ensemble-based exploration, i.e. training a diverse ensemble of data-sharing agents, underlies many successful deep reinforcement learning (deep RL) methods (Osband et al., 2016; 2018; Liu et al., 2020; Schmitt et al., 2020; Peng et al., 2020; Hong et al., 2020; Januszewski et al., 2021). The potential benefits of a diverse ensemble are twofold. At training time, it enables concurrent exploration with multiple distinct policies without the need for additional samples. At test time, the learned policies can be aggregated into a robust ensemble policy, via aggregation methods such as majority voting (Osband et al., 2016) or averaging (Januszewski et al., 2021).

Despite the generally positive perception of ensemble-based exploration, we argue that this approach has a potentially negative aspect that has been long overlooked. As shown in Figure 1, for each member in a data-sharing ensemble, only a small proportion of its training data comes from its own interaction with the environment. The majority of its training data is generated by other members of the ensemble, whose policies might be distinct from its own policy. This type of off-policy learning has been shown to be highly challenging in previous work (Ostrovski et al., 2021). We thus hypothesize that similar learning difficulties can also occur in ensemble-based exploration.

We verify our hypothesis in the Arcade Learning Environment (Bellemare et al., 2012) with the Bootstrapped DQN (Osband et al., 2016) algorithm and the Gym MuJoCo benchmark (Towers et al., 2023) with an ensemble SAC (Haarnoja et al., 2018a) algorithm. We show that, in many environments, the individual members of a data-sharing ensemble significantly underperform their single-agent counterparts. Moreover, while aggregating the policies of all ensemble members via voting or averaging sometimes compensates for the degradation in individual members’ performance, it is not always the case. These results suggest that ensemble-based exploration has a hidden negative effect that might weaken or even completely eliminate its advantages. We perform a series of experiments to confirm the connection between the observed performance degradation and the off-policy learning challenge posed by a diverse ensemble. We thus name this phenomenon the curse of diversity.

Refer to caption
Figure 1: Comparison between standard single-agent exploration and ensemble-based exploration. In single-agent training, one agent generates and learns from all the data. In ensemble-based exploration with N𝑁Nitalic_N ensemble members, each agent generates 1/N1𝑁1/N1 / italic_N of the data but learns from all the data.

We show that several intuitive solutions – such as a larger replay buffer or a smaller ensemble size – either fail to consistently mitigate the performance loss or undermine the advantages of ensembling. Inspired by previous work’s finding that network representations play a crucial role in related settings (Ostrovski et al., 2021; Kumar et al., 2021), we investigate whether representation learning can mitigate the curse of diversity. Specifically, we propose a novel method named Cross-Ensemble Representation Learning (CERL) in which individual ensemble members learn each other’s value function as an auxiliary task. Our results show that CERL mitigates the curse of diversity in both Atari and MuJoCo environments and outperforms the single-agent and ensemble-based baselines when combined with policy aggregation.

We summarize our contributions as follows:

  1. 1.

    We expose the curse of diversity phenomenon in ensemble-based exploration: individual members in a data-sharing ensemble can vastly underperform their single-agent counterparts.

  2. 2.

    We pinpoint the off-policy learning challenges posed by a diverse ensemble as the main cause of the curse of diversity and provide extensive analysis.

  3. 3.

    We show the potential of representation learning to mitigate the curse of diversity with a novel method named Cross-Ensemble Representation Learning (CERL) in both discrete and continuous control domains.

2 Preliminaries

We outline some important specifications that we use throughout this work.

Ensemble-based exploration strategy We focus our discussion on a simple ensemble-based exploration strategy that underlies many previous works (Osband et al., 2016; 2018; Liu et al., 2020; Schmitt et al., 2020; Peng et al., 2020; Hong et al., 2020; Januszewski et al., 2021), depicted in Figure 1. The defining characteristics of this strategy are as follows:

  1. 1.

    Temporally coherent exploration: At training time, within each episode, only the policy of one ensemble member is used for selecting the actions.

  2. 2.

    Relative independence between ensemble members: Each ensemble member has its own policy (may be implicit), value function, and target value function. Most importantly, the regression target in Temporal Difference (TD) updates should be computed separately for each ensemble member with their own target network.

  3. 3.

    Off-policy RL algorithms with a shared replay buffer: Different ensemble members share their collected data via a central replay buffer (Lin, 1992). To allow data sharing, the underlying RL algorithm should be off-policy in nature.

Environments and algorithms We use 55555555 Atari games from the Arcade Learning Environment and 4444 Gym MuJoCo tasks for our analysis. We train for 200200200200M frames for Atari games and 1111M steps for MuJoCo tasks. For aggregate results over 55555555 Atari games, we use the interquartile mean (IQM) of human-normalized scores (HNS) as recommended by Agarwal et al. (2021). In some results where HNS is less appropriate, we exploit another score normalization scheme where Double DQN and a random agent have a normalized score of 1111 and 00 respectively. This will be referred to as Double DQN normalized scores. More experimental details can be found in Appendix B.

For our analysis, we use Bootstrapped DQN (Osband et al., 2016) with Double DQN updates (Hasselt et al., 2015) for Atari and an ensemble version of the SAC (Haarnoja et al., 2018b) algorithm (referred to as Ensemble SAC in the following) for MuJoCo tasks. Ensemble SAC follows the same recipe as Bootstrapped DQN except with SAC as the base algorithm. We provide pseudocode for these algorithms in Appendix A. Correspondingly, for the single-agent baselines, we use Double DQN and SAC. For analysis purposes, for continuous control, we use a replay buffer of size 200200200200k by default (as opposed to the usual 1111M) since we find the curse of diversity is more evident with smaller replay buffers in these tasks. Also, to avoid the confounding factor of representation learning, we do not share the networks across the ensemble in Bootstrapped DQN by default. These factors will be analyzed in Section 3.3. Following Osband et al. (2016)’s setup for Atari, we do not use data bootstrap** in our analysis. Throughout this work, we use L𝐿Litalic_L to denote the number of shared layers across the ensemble members and N𝑁Nitalic_N to denote the ensemble size. Unless otherwise specified, we use L=0𝐿0L=0italic_L = 0 and N=10𝑁10N=10italic_N = 10. For each ensemble algorithm X – where X can be either Bootstrapped DQN or Ensemble SAC – we consider two different evaluation methods:

  1. 1.

    X (aggregated) or X (agg.): we aggregate the policies of all ensemble members during testing. For discrete action tasks, we use majority voting (Osband et al., 2016). For continuous control, we average the actions of all policies as in  Januszewski et al. (2021).

  2. 2.

    X (individual) or X (indiv.): for each evaluation episode, we randomly sample one ensemble member for acting. This interaction protocol is exactly the same as the one used during training and aims to measure the performance of the individual ensemble members.

We emphasize that these two methods only differ at test-time. We only train X once and then obtain the results of X (agg.) and X (indiv.) using the above interaction protocols during evaluation. Policy aggregation is never used during training. Also, the performance of X (agg.) is often what we eventually care about. More implementation details can be found in Appendix C.

3 The curse of diversity

The central finding of this paper is as follows:

Individual ensemble members in a data-sharing ensemble can suffer from severe performance degradation relative to standard single-agent training due to: (1) the low proportion of self-generated data in the shared training data for each ensemble member; and (2) the inefficiency of the individual ensemble members to learn from such highly off-policy data.

We name the performance degradation in individual ensemble members due to challenging off-policy learning posed by a diverse ensemble the curse of diversity. In the following, we demonstrate this phenomenon, verify its cause, and provide extensive analysis.

3.1 The negative effect of ensemble-based exploration

Refer to caption
Refer to caption
Refer to caption
Figure 2: (top-left) Comparison between Double DQN, Bootstrapped DQN (agg.), and Bootstrapped DQN (indiv.) in 55555555 Atari games. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds. (top-right) Per-game performance improvement of Bootstrapped DQN (indiv.) and Bootstrapped DQN (agg.) over Double DQN, measured as the difference in HNS. All methods use a replay buffer size of 1111M. (bottom) Comparison between SAC, Ensemble SAC (indiv.) and Ensemble SAC (agg.) in 4444 MuJoCo tasks with a replay buffer size of 200200200200k. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 30303030 seeds. All ensemble methods in this figure use N=10𝑁10N=10italic_N = 10 and L=0𝐿0L=0italic_L = 0.

We show the curse of diversity phenomenon with 55555555 Atari games and 4444 MuJoCo tasks in Figure 2. These results show a clear underperformance of the individual ensemble members (e.g., Bootstrapped DQN (indiv.)) relative to their single agent counterparts (e.g., Double DQN). Note that even though they are trained on different data distributions and thus are expected to behave differently, the agents in these two cases have access to the same amount of data and have the same network capacity. The underperformance happens in the majority of Atari games and 3333 out of the 4444 MuJoCo tasks, suggesting that this is a universal phenomenon. Surprisingly, simply aggregating the learned policies at test-time provides a huge performance boost in many environments, and in many cases fully compensates for the performance loss in the individual policies (e.g., Walker2d). This partially explains why previous works – which often only report the performance of the aggregated policies (Osband et al., 2016; Chiang et al., 2020; Agarwal et al., 2020; Meng et al., 2022) – fail to notice this phenomenon.

The significance of these results is threefold. First, it challenges some previous work’s claims that the improved performance of approaches such as Bootstrapped DQN in some tasks is mainly due to better exploration. As shown in Figure 2, in most games where Bootstrapped DQN (agg.) performs better than Double DQN, Bootstrapped DQN (indiv.) significantly underperforms Double DQN. This means most benefits of Bootstrapped DQN in these games come from majority voting, which is largely orthogonal to exploration. However, this does not imply that better exploration – and hence wider state-action space coverage – brought by a diverse ensemble is not beneficial, as its effects might be overshadowed by the curse of diversity and thus not visible. Second, it raises important caveats for future applications of ensemble-based exploration, especially in certain scenarios such as hyperparameter sweep with ensembles (Schmitt et al., 2020; Liu et al., 2020) where we mainly care about individual agents’ performance. Finally, it presents an opportunity for better ensemble algorithms that mitigate the curse of diversity while preserving the advantages of using an ensemble. To this end, we perform an analysis to better understand the cause of the performance degradation.

3.2 Understanding ensemble performance degradation

We hypothesize that the observed performance degradation is due to (1) the low proportion of self-generated data in the shared training data for each ensemble member, and (2) the inefficiency of the individual ensemble members to learn from such highly off-policy data. Inspired by the tandem RL setup in Ostrovski et al. (2021), we design a simple “p%percent𝑝p\%italic_p %-tandem” setup to verify this hypothesis. Similar to the original tandem RL setup, we train a pair of active and passive agents sharing the replay buffer and training batches. For each training episode, with probability 1p%1percent𝑝1-p\%1 - italic_p % we use the active agent for acting; otherwise, we use the passive agent for acting. In other words, the active agent generates 1p%1percent𝑝1-p\%1 - italic_p % of the data and the passive agent generates p%percent𝑝p\%italic_p % of the data. Note that this is different from the “p%percent𝑝p\%italic_p % self-generated data” experiment in Ostrovski et al. (2021) as they use two separate buffers for the two agents. In contrast, in our p%percent𝑝p\%italic_p %-tandem setup, the two agents share the replay buffer and the training batches, and thus any performance gap between the active and passive agents can only be due to the difference in the proportions of the two agents’ self-generated data and the inefficiency of the passive agent to learn from the shared data.

To understand why p%percent𝑝p\%italic_p %-tandem may support our hypothesis, it is useful to view Double DQN, p%percent𝑝p\%italic_p %-tandem with two Double DQN agents, and Bootstrapped DQN as variants of the same ensemble algorithm with N𝑁Nitalic_N members, where each member generates 1/N1𝑁1/N1 / italic_N of the data. Taking N=4𝑁4N=4italic_N = 4 and p%=1N=25%percent𝑝1𝑁percent25p\%=\frac{1}{N}=25\%italic_p % = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG = 25 % as an example, we have the following exact equivalence (shown in Figure 3 (left)):

  1. Algo 1

    Double DQN \Leftrightarrow 4444 ensemble members, but all of them are identical;

  2. Algo 2

    25%percent2525\%25 %-tandem \Leftrightarrow 4444 ensemble members, but the last 3333 of them are identical;

  3. Algo 3

    Bootstrapped DQN \Leftrightarrow 4444 ensemble members, and all of them are different.

Our core reasoning is as follows. If the first member in Algo 2 (i.e. the passive agent) suffers from severe inefficiency to learn from the shared data (i.e., it significantly underperforms the active agent), it should also suffer from the same learning inefficiency if we replace the other 3333 identical members (i.e., the active agent) with 3333 different members, which is just Algo 3/Bootstrapped DQN. Further, if the performance gap between the active and passive agents is comparable with or larger than the performance gap between Double DQN and Bootstrapped DQN (indiv.), then the observed learning inefficiency is sufficient to cause the performance degradation we see in Bootstrapped DQN (indiv.).

Refer to caption
Refer to caption
Refer to caption
Figure 3: (left) Different algorithms as variants of the same ensemble algorithm, using N=4𝑁4N=4italic_N = 4 as an example. Each block represented 25%percent2525\%25 % of the generated data. Data blocks of the same colors are generated by identical agents. (middle) Comparison between Double DQN, Bootstrapped DQN (indiv.) with N=10𝑁10N=10italic_N = 10 and L=0𝐿0L=0italic_L = 0, and the active and passive agents in the 10%percent1010\%10 %-tandem setup. All methods use a replay buffer of size 1111M. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. Results are aggregated over 5555 seeds and 55555555 games. (right) Correlation between (1) the performance gap between the active and passive agents and (2) the performance gap between Double DQN and Bootstrapped DQN (indiv.) in different games. Each point corresponds to a game. We use Double DQN normalized scores instead of HNS since the scale of the latter can vary a lot across games. Eight games where Double DQN’s performance is close to random (HNS<0.05HNS0.05\mathrm{HNS}<0.05roman_HNS < 0.05), and one game whose data point lies on the negative half of the y𝑦yitalic_y-axis in the plot are omitted since they trivially satisfy our hypothesis.

Verifying the hypothesis In Figure 3 (middle) we show the performance of Double DQN, Bootstrapped DQN (indiv.) with N=10𝑁10N=10italic_N = 10, and the active and the passive agents in the 10%percent1010\%10 %-tandem setup. As expected, we see that the passive agent significantly underperforms the active agent even though they share the training batches, indicating the inefficiency of the passive agent to learn from the shared data. Also, the performance gap between the active and passive agents is comparable to the performance gap between Double DQN and Bootstrapped DQN (indiv.). A similar analysis for MuJoCo tasks is presented in Appendix D.1 and shows similar patterns. In Figure 3 (right), we show a clear correlation between (1) the performance gap between the active and the passive agents and (2) the performance gap between Double DQN and Bootstrapped DQN (indiv.) in different games. These results offer strong evidence of a connection between the off-policy learning challenges in ensemble-based exploration and the observed performance degradation.

Remark on data coverage We comment that another important aspect of having a diverse ensemble is wider state-action space coverage in the data. Even though the degree of “off-policy-ness” and the state-action space coverage are often correlated in practice, they are different: state-action space coverage is purely a property of the data distribution, while “off-policy-ness” involves both the data distribution and the policies. Our p%percent𝑝p\%italic_p %-tandem experiment disentangles these two aspects by having the active and passive agents trained on the same data (thus the same coverage) but experience different degrees of “off-policy-ness”. Therefore, the active/passive gap implies that off-policy-ness is detrimental, but it does not indicate whether wider state-action space coverage is beneficial or harmful. On the other hand, the fact that (1) Bootstrapped DQN (indiv.) likely suffers from greater “off-policy-ness” than the passive agent due to the presence of more policies and that (2) Bootstrapped DQN (indiv.) outperforms the passive agent in Figure 3 indicates that the wider state-action space coverage of the data generated by Bootstrapped DQN (indiv.) is likely highly beneficial. We leave a rigorous analysis of the effect of state-action space coverage for future work.

3.3 Mitigating the curse of diversity: initial attempts

Having established the main cause of observed performance degradation, we examine whether several intuitive solutions can mitigate it in this section. More analysis can be found in Appendix D.3.

Refer to caption
Refer to caption
Figure 4: (left) The effects of replay buffer size in 4444 Atari games. Error bars show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds. (right) The effects of replay buffer size in 4444 MuJoCo tasks. Error bars show 95%percent9595\%95 % bootstrapped CIs over 30303030 seeds. We N=10𝑁10N=10italic_N = 10 and L=0𝐿0L=0italic_L = 0 for Bootstrapped DQN.
Refer to captionRefer to caption
Figure 5: (left) The effects of adjusting the ensemble size. We use L=0𝐿0L=0italic_L = 0 for Bootstrapped DQN. (right) The effects of varying the number of shared layers. We use N=10𝑁10N=10italic_N = 10 for Bootstrapped DQN. The top rows show Double DQN normalized scores. The bottom rows show the entropy of the normalized vote distributions. Error bars show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds. All methods use a replay buffer of size 1111M.

Larger replay buffers A larger replay buffer provides better state-action coverage and may mitigate issues due to insufficient self-generated data such as erroneous value extrapolation (Fujimoto et al., 2018a; Kumar et al., 2019; Ostrovski et al., 2021). In Figure 4, we probe the effects of replay buffer capacity in 4444 Atari games and 4444 MuJoCo tasks. For MuJoCo tasks, a larger replay buffer mitigates the curse of diversity (i.e., reduces the performance gap between SAC and Ensemble SAC (indiv.)) in Humanoid and Walker, though it still remains in Ant. However, in Atari games, the performance gap between Double DQN and Bootstrapped DQN (indiv.) largely remains with larger replay buffers, except for Amidar where the performance of Double DQN itself is severely impaired. The difference in results in the two domains may be due to the different numbers of samples required (1111M versus 200200200200M). It is possible that an even larger replay buffer may mitigate the curse of diversity in Atari, but this is extremely expensive and infeasible for us to test. Overall, these results suggest that though increasing the replay buffer capacity (within a reasonable memory budget) can mitigate the curse of diversity in some environments, it is not a consistent remedy.

Reducing diversity The most intuitive way to mitigate the curse of diversity is, naturally, by reducing diversity. We test two techniques to do so: (1) reducing the ensemble size and (2) increasing the number of shared layers across the ensemble. In Figure 5, we show the impact of these two techniques on performance and ensemble diversity in 4444 Atari games. To quantify diversity, we measure the entropy of the distribution of votes among the ensemble members. As expected, both techniques reduce diversity and improve the performance of Bootstrapped DQN (indiv.). However, as the diversity of the ensemble decreases the advantages of policy aggregation also reduce, which can be seen from the tapering performance gap between Bootstrapped DQN (indiv.) and Bootstrapped DQN (agg.). These results show that even though reducing diversity can mitigate the curse of diversity, it also compromises the advantages of policy aggregation. Ideally, we would want a solution that alleviates the curse of diversity while preserving the advantages of using ensembles. In the next section, we show that representation learning offers a promising solution.

4 Mitigating the curse of diversity with representation learning

Our method is motivated by a simple hypothesis. We conjecture that the reason why sharing network layers mitigates the curse of diversity is twofold. First, as mentioned above, sharing layers reduces ensemble diversity, thus making the generated data “less off-policy” to the individual members. Second, the shared network simultaneously learns the value functions of multiple ensemble members, leading to improved representations. This may lead to more efficient off-policy learning by allowing the Q-value networks to better generalize to state-action pairs that have high probability under the current policy but are under-represented in the data. As we have shown, the diversity reduction aspect has the undesirable side effect of reducing the benefits of policy aggregation. We thus propose Cross-Ensemble Representation Learning (CERL), a novel method that benefits from the similar representation learning effect of network sharing without actually needing to share the networks, thus preserving the diversity of the ensemble.

Refer to caption
Refer to caption
Figure 6: (left) The Q-value networks of a standard ensemble without CERL. (right) The Q-value networks of an ensemble with CERL. f𝑓fitalic_f and g𝑔gitalic_g represent neural networks. An arrow from X𝑋Xitalic_X to Y𝑌Yitalic_Y, where X𝑋Xitalic_X and Y𝑌Yitalic_Y are either gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or gijsuperscriptsubscript𝑔𝑖𝑗g_{i}^{j}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, indicates X𝑋Xitalic_X will be used as the target network of Y𝑌Yitalic_Y when performing TD updates. Auxiliary heads are in green. Dashed lines indicate the corresponding loss terms are CERL auxiliary losses. Arrows with the same color originate from the same main head.

Method CERL, shown in Figure 6 (right), is an auxiliary task that can be applied to most ensemble-based exploration methods that follow the recipe we outline in Section 2. For ease of exposition, we use Ensemble SAC as an example. Extension to other methods is trivial. In CERL, for each Q-value network, we conceptually split it into an encoder f𝑓fitalic_f and a head g𝑔gitalic_g. Our goal in CERL is to force the Q-value network encoder of each ensemble member to learn the value functions of all the ensemble members, similar to what happens when using explicit network sharing. To this end, each ensemble member i𝑖iitalic_i has N𝑁Nitalic_N Q-value heads {Qij(s,a)=[gij(fi(s,a))]}j=1Nsuperscriptsubscriptsuperscriptsubscript𝑄𝑖𝑗𝑠𝑎delimited-[]superscriptsubscript𝑔𝑖𝑗subscript𝑓𝑖𝑠𝑎𝑗1𝑁\{Q_{i}^{j}(s,a)=[g_{i}^{j}(f_{i}(s,a))]\}_{j=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ) = [ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ] } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. For ensemble member i𝑖iitalic_i, Qii(s,a)subscriptsuperscript𝑄𝑖𝑖𝑠𝑎Q^{i}_{i}(s,a)italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) is the “main head” that defines member i𝑖iitalic_i’s value function estimation. Each ensemble member i𝑖iitalic_i still has only one policy πi(a|s)subscript𝜋𝑖conditional𝑎𝑠\pi_{i}(a|s)italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a | italic_s ), and the main head Qii(s,a)superscriptsubscript𝑄𝑖𝑖𝑠𝑎Q_{i}^{i}(s,a)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , italic_a ) is the only head used to update the policy. A head Qij(s,a)superscriptsubscript𝑄𝑖𝑗𝑠𝑎Q_{i}^{j}(s,a)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ) of ensemble member i𝑖iitalic_i where ji𝑗𝑖j\neq iitalic_j ≠ italic_i is used to learn the value function of ensemble member j𝑗jitalic_j as an auxiliary task. These heads are referred to as “auxiliary heads” because their sole purpose is to provide better representations for the main heads. Specifically, given a transition (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we perform the following TD update for all N×N𝑁𝑁N\times Nitalic_N × italic_N heads in parallel as follows:

Qij(s,a)r+γQ¯jj(s,aj),for i=1,,N,for j=1,,Nformulae-sequencesuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝑟𝛾superscriptsubscript¯𝑄𝑗𝑗superscript𝑠superscriptsubscript𝑎𝑗formulae-sequencefor 𝑖1𝑁for 𝑗1𝑁Q_{i}^{j}(s,a)\leftarrow r+\gamma\bar{Q}_{j}^{j}(s^{\prime},a_{j}^{\prime}),% \quad\text{for }i=1,\ldots,N,\quad\text{for }j=1,\ldots,Nitalic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ) ← italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , for italic_i = 1 , … , italic_N , for italic_j = 1 , … , italic_N (1)

where Q¯jjsuperscriptsubscript¯𝑄𝑗𝑗\bar{Q}_{j}^{j}over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the target network for Qjjsuperscriptsubscript𝑄𝑗𝑗Q_{j}^{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and ajπj(|s)a_{j}^{\prime}\sim\pi_{j}(\cdot|s^{\prime})italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). As usual, the update rule is implemented as a one-step stochastic gradient descent.

Besides being conceptually simple, CERL is easy to implement. In our experiments, we find it sufficient to duplicate the last linear layers of the networks as the auxiliary heads, which can be implemented by increasing the networks’ final output dimensions. For the same reason, CERL is computationally efficient. For example, for the Nature DQN network (Mnih et al., 2015) used in this work, applying CERL to Bootstrapped DQN with N=10𝑁10N=10italic_N = 10 and L=0𝐿0L=0italic_L = 0 increases the number of parameters by no more than 5%percent55\%5 %, and the increase in wall clock time is barely noticeable. We provide pseudocode for CERL with Ensemble SAC and Bootstrapped DQN in Appendix A.

Refer to caption
Refer to caption
Figure 7: (top) Comparison between Double DQN, Bootstrapped DQN, and CERL in Atari. Results are aggregated over 55555555 games and 5555 seeds. We show the performance of the agg. and indiv. versions of each ensemble algorithm in the top left and top middle plots respectively. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. All methods use a replay buffer of size 1111M. (bottom) Comparison between SAC, Ensemble SAC, and CERL across different replay buffer sizes in MuJoCo tasks. Error bars show 95%percent9595\%95 % bootstrapped CIs over 30303030 seeds. All ensemble methods in this figure uses N=10𝑁10N=10italic_N = 10.

Experiments We focus on two questions: (1) Can CERL mitigate the curse of diversity, i.e. the performance gap between individual ensemble members and their single-agent counterparts? (2) Do the improvements in individual ensemble members translate into a better aggregate policy? To answer these questions, we test CERL on Bootstrapped DQN in 55555555 Atari games (Figure 7 (top)) and on Ensemble SAC in 4444 MuJoCo tasks (Figure 7 (bottom)). We compare with Bootstrapped DQN and Ensemble SAC without CERL as well as the single-agent Double DQN and SAC. To show the advantage of CERL over explicit network sharing, we also include Bootstrapped DQN with network sharing and report ensemble diversity as we did in Section 3.3. We use L𝐿Litalic_L to denote the number of shared layers across the ensemble. As the curse of diversity is sensitive to replay buffer size in MuJoCo tasks, we show results with different replay buffer sizes for these tasks. Additional results, including ensemble size ablations and an alternative design of CERL, can be found in Appendix D.4.

As shown in these results, CERL consistently mitigates the curse of diversity across the tested environments. For example, applying CERL to Ensemble SAC in Humanoid with a 0.20.20.20.2M-sized replay buffer reduces the performance gap between SAC and Ensemble SAC (indiv.) from roughly 3000300030003000 to around 500500500500. More importantly, the improvements in individual policies do translate into improvements in aggregate policies, which enables CERL to achieve the best performance with policy aggregation in both domains. In contrast, even though Bootstrapped DQN (L=3𝐿3L=3italic_L = 3) also performs better than Bootstrapped DQN (L=0𝐿0L=0italic_L = 0) when comparing individual policies, it does not provide any gain over Bootstrapped DQN (L=0𝐿0L=0italic_L = 0) when comparing aggregate policies, likely due to significantly lower ensemble diversity than Bootstrapped DQN (L=0𝐿0L=0italic_L = 0) as shown in Figure 7 (top right).

5 Related work

Ensemble-based exploration The idea of training an ensemble of data-sharing agents that concurrently explore has been employed in many deep RL algorithms (Osband et al., 2016; 2018; Liu et al., 2020; Schmitt et al., 2020; Peng et al., 2020; Hong et al., 2020; Januszewski et al., 2021). Most of these works focus on algorithmic design and the discussion of the potential negative effects of the increased “off-policy-ness” compared to single-agent training has largely been missing. To the best of our knowledge, the only work that explicitly discusses the potential difficulties of learning from the off-policy data generated by other ensemble members is Schmitt et al. (2020). However, since Schmitt et al. (2020) does not maintain explicit Q-values and relies on V-trace (Espeholt et al., 2018) for off-policy correction, algorithmic changes that allow stable off-policy learning are a requirement in their work. In contrast, we show that even for Q-learning-based methods that do not require explicit off-policy correction, ensemble-based exploration can still lead to performance degradation. There also exist methods that use multiple ensemble members on a per-step basis (Chen et al., 2018; Lee et al., 2021; Ishfaq et al., 2021; Li et al., 2023) as opposed to one ensemble member for each episode. The discussion of these methods is more subtle and is left for future work. Sun et al. (2022) uses an ensemble of Q-value networks to trade off exploration and exploitation. However, their method only uses one policy so our discussion does not apply to it.

Ensemble RL methods for other purposes Ensemble methods have also been employed in RL for purposes other than exploration, for example to produce robust value estimations  (Anschel et al., 2016; Lan et al., 2020; Agarwal et al., 2020; Peer et al., 2021; Chen et al., 2021; An et al., 2021; Wu et al., 2021; Liang et al., 2022) or model predictions (Chua et al., 2018; Kurutach et al., 2018). Despite the use of ensembles, most of these methods are still single-agent in nature (i.e., there is only one policy interacting with the environment). Thus, our discussion does not apply to these methods.

Mutual distillation Mutual/collaborative learning in supervised learning (Zhang et al., 2017; Anil et al., 2018; Guo et al., 2020; Wu & Gong, 2020) aims to train a cohort of networks and share knowledge between them via mutual distillation. Similar ideas have also been employed in RL (Czarnecki et al., 2018; Xue et al., 2020; Zhao & Hospedales, 2020; Reid & Mukhopadhyay, 2021). CERL is distinct from these works in that it does not try to distill different ensemble members’ predictions (i.e., the value functions) into each other. Instead, CERL is only an auxiliary task, and different ensemble members in CERL only affect each other’s representations via auxiliary losses.

Auxiliary tasks in RL Facilitating representation learning with auxiliary tasks has been shown to be effective in RL (Jaderberg et al., 2016; Mirowski et al., 2016; Fedus et al., 2019; Kartal et al., 2019; Dabney et al., 2020; Schwarzer et al., 2020). In the context of multi-agent RL, He & Boyd-Graber (2016), Hong et al. (2017); Hernandez-Leal et al. (2019) and Hernandez et al. (2022) model the policies of external agents as an auxiliary task, and Barde et al. (2019) promotes coordination between several trainable agents by maximizing their mutual action predictability. Besides the clear differences in the problem domains and motivations, these multi-agent works only predict the actions of other agents, which typically contain less information than the Q𝑄Qitalic_Q-function used in CERL’s auxiliary task.

6 Discussion and conclusion

In line with recent efforts to advance the understanding of deep RL through purposeful experiments (Ostrovski et al., 2021; Schaul et al., 2022; Nikishin et al., 2022; Sokar et al., 2023), our work builds on extensive, carefully designed empirical analyses. It offers valuable insights into a previously overlooked pitfall of the well-established approach of ensemble-based exploration and presents opportunities for future work. As with most empirical works, an important avenue for future research lies in develo** a theoretical understanding of the phenomenon we reveal in this work.

A limitation of CERL is its reliance on separate networks for high ensemble diversity, which may become infeasible with very large networks. A simple improvement to CERL is thus to combine CERL with network sharing and encourage diversity with other mechanisms, such as randomized prior functions (Osband et al., 2018) and explicit diversity regularization (Peng et al., 2020).

Reproducibility statement

Detailed pseudocode of the ensemble algorithms used in this work is provided in Appendix A. Experimental and implementation details are given in Appendix B and Appendix C respectively. The source code is available at the following repositories:

Acknowledgments

This research was enabled in part by support and compute resources provided by Mila (mila.quebec), Calcul Québec (www.calculquebec.ca), and the Digital Research Alliance of Canada (alliancecan.ca). We thank Sony for their financial support of Zhixuan Lin throughout this work.

References

  • Agarwal et al. (2020) Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In ICML, 2020.
  • Agarwal et al. (2021) Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, and Marc G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In NeurIPS, 2021.
  • An et al. (2021) Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. ArXiv, abs/2110.01548, 2021. URL https://api.semanticscholar.org/CorpusID:238259863.
  • Anil et al. (2018) Rohan Anil, Gabriel Pereyra, Alexandre Passos, Róbert Ormándi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. ArXiv, abs/1804.03235, 2018.
  • Anschel et al. (2016) Oron Anschel, Nir Baram, and Nahum Shimkin. Deep reinforcement learning with averaged target dqn. ArXiv, abs/1611.01929, 2016.
  • Barde et al. (2019) Paul Barde, Julien Roy, Félix G. Harvey, Derek Nowrouzezahrai, and Christopher Joseph Pal. Promoting coordination through policy regularization in multi-agent reinforcement learning. ArXiv, abs/1908.02269, 2019.
  • Bellemare et al. (2012) Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). In International Joint Conference on Artificial Intelligence, 2012.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Castro et al. (2018) Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A research framework for deep reinforcement learning. ArXiv, abs/1812.06110, 2018.
  • Castro et al. (2021) Pablo Samuel Castro, Tyler Kastner, P. Panangaden, and Mark Rowland. Mico: Improved representations via sampling-based state similarity for markov decision processes. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:239050564.
  • Chen et al. (2018) Richard Y. Chen, Szymon Sidor, P. Abbeel, and John Schulman. Ucb exploration via q-ensembles. arXiv: Learning, 2018.
  • Chen et al. (2021) Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross. Randomized ensembled double q-learning: Learning fast without a model. ArXiv, abs/2101.05982, 2021.
  • Chiang et al. (2020) Po-Han Chiang, Hsuan-Kung Yang, Zhang-Wei Hong, and Chun-Yi Lee. Mixture of step returns in bootstrapped dqn. ArXiv, abs/2007.08229, 2020.
  • Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, 2018.
  • Czarnecki et al. (2018) Wojciech M. Czarnecki, Siddhant M. Jayakumar, Max Jaderberg, Leonard Hasenclever, Yee Whye Teh, Nicolas Manfred Otto Heess, Simon Osindero, and Razvan Pascanu. Mix&match - agent curricula for reinforcement learning. ArXiv, abs/1806.01780, 2018.
  • Dabney et al. (2017) Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In AAAI Conference on Artificial Intelligence, 2017. URL https://api.semanticscholar.org/CorpusID:139930.
  • Dabney et al. (2020) Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G. Bellemare, and David Silver. The value-improvement path: Towards better representations for reinforcement learning. ArXiv, abs/2006.02243, 2020.
  • Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. ArXiv, abs/1802.01561, 2018.
  • Fedus et al. (2019) William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and H. Larochelle. Hyperbolic discounting and learning over multiple horizons. ArXiv, abs/1902.06865, 2019.
  • Fujimoto et al. (2018a) Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2018a.
  • Fujimoto et al. (2018b) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 2018b. URL https://api.semanticscholar.org/CorpusID:3544558.
  • Guo et al. (2020) Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and ** Luo. Online knowledge distillation via collaborative learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  11017–11026, 2020.
  • Haarnoja et al. (2018a) Tuomas Haarnoja, Aurick Zhou, P. Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018a.
  • Haarnoja et al. (2018b) Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, G. Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, P. Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. ArXiv, abs/1812.05905, 2018b.
  • Hasselt et al. (2015) H. V. Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artificial Intelligence, 2015.
  • He & Boyd-Graber (2016) He He and Jordan L. Boyd-Graber. Opponent modeling in deep reinforcement learning. ArXiv, abs/1609.05559, 2016.
  • Hernandez et al. (2022) Daniel Hernandez, Hendrik Baier, and Michael Kaisers. Brexit: On opponent modelling in expert iteration. ArXiv, abs/2206.00113, 2022.
  • Hernandez-Leal et al. (2019) Pablo Hernandez-Leal, Bilal Kartal, and Matthew E. Taylor. Agent modeling as auxiliary task for deep reinforcement learning. In Artificial Intelligence and Interactive Digital Entertainment Conference, 2019.
  • Hong et al. (2017) Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep policy inference q-network for multi-agent systems. ArXiv, abs/1712.07893, 2017.
  • Hong et al. (2020) Zhang-Wei Hong, Prabhat Nagarajan, and Guilherme J. Maeda. Periodic intra-ensemble knowledge distillation for reinforcement learning. ArXiv, abs/2002.00149, 2020.
  • Ishfaq et al. (2021) Haque Ishfaq, Qiwen Cui, Viet Huy Nguyen, Alex Ayoub, Zhuoran Yang, Zhaoran Wang, Doina Precup, and Lin F. Yang. Randomized exploration for reinforcement learning with general value function approximation. ArXiv, abs/2106.07841, 2021. URL https://api.semanticscholar.org/CorpusID:235435953.
  • Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech M. Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. ArXiv, abs/1611.05397, 2016.
  • Januszewski et al. (2021) Piotr Januszewski, Mateusz Olko, Michal Królikowski, Jakub Bartlomiej Swiatkowski, Marcin Andrychowicz, Lukasz Kuci’nski, and Piotr Milo’s. Continuous control with ensemble deep deterministic policy gradients. ArXiv, abs/2111.15382, 2021.
  • Kartal et al. (2019) Bilal Kartal, Pablo Hernandez-Leal, and Matthew E. Taylor. Terminal prediction as an auxiliary task for deep reinforcement learning. In Artificial Intelligence and Interactive Digital Entertainment Conference, 2019.
  • Kostrikov (2021) Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX, 10 2021. URL https://github.com/ikostrikov/jaxrl.
  • Kumar et al. (2019) Aviral Kumar, Justin Fu, G. Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrap** error reduction. In NeurIPS, 2019.
  • Kumar et al. (2021) Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron C. Courville, G. Tucker, and Sergey Levine. Dr3: Value-based deep reinforcement learning requires explicit regularization. ArXiv, abs/2112.04716, 2021. URL https://api.semanticscholar.org/CorpusID:245005650.
  • Kurutach et al. (2018) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. ArXiv, abs/1802.10592, 2018.
  • Lan et al. (2020) Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin q-learning: Controlling the estimation bias of q-learning. ArXiv, abs/2002.06487, 2020.
  • Lee et al. (2021) Kimin Lee, Michael Laskin, A. Srinivas, and P. Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In ICML, 2021.
  • Li et al. (2023) Qiyang Li, Aviral Kumar, Ilya Kostrikov, and Sergey Levine. Efficient deep reinforcement learning requires regulating overfitting. ArXiv, abs/2304.10466, 2023.
  • Liang et al. (2022) Litian Liang, Yaosheng Xu, Stephen McAleer, Dailin Hu, Alexander T. Ihler, P. Abbeel, and Roy Fox. Reducing variance in temporal-difference value estimation via ensemble of deep networks. ArXiv, abs/2209.07670, 2022.
  • Lin (1992) Longxin Lin. Reinforcement learning for robots using neural networks. 1992.
  • Liu et al. (2020) Ge Liu, Rui Wu, Heng-Tze Cheng, **g Wang, Jayden Ooi, Lihong Li, Ang Li, Wai Lok Sibon Li, Craig Boutilier, and Ed H. Chi. Data efficient training for reinforcement learning with adaptive behavior policy sharing. ArXiv, abs/2002.05229, 2020.
  • Machado et al. (2017) Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael H. Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. ArXiv, abs/1709.06009, 2017.
  • Meng et al. (2022) Li Meng, Morten Goodwin, Anis Yazidi, and Paal Einar Engelstad. Improving the diversity of bootstrapped dqn by replacing priors with noise. IEEE Transactions on Games, 2022.
  • Mirowski et al. (2016) Piotr Wojciech Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, L. Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments. ArXiv, abs/1611.03673, 2016.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
  • Nikishin et al. (2022) Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron C. Courville. The primacy bias in deep reinforcement learning. In ICML, 2022.
  • Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In NIPS, 2016.
  • Osband et al. (2018) Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. ArXiv, abs/1806.03335, 2018.
  • Ostrovski et al. (2021) Georg Ostrovski, Pablo Samuel Castro, and Will Dabney. The difficulty of passive learning in deep reinforcement learning. In NeurIPS, 2021.
  • Peer et al. (2021) Oren Peer, Chen Tessler, Nadav Merlis, and Ron Meir. Ensemble bootstrap** for q-learning. In ICML, 2021.
  • Peng et al. (2020) Zhenghao Peng, Hao Sun, and Bolei Zhou. Non-local policy optimization via diversity-regularized collaborative exploration. ArXiv, abs/2006.07781, 2020.
  • Quan & Ostrovski (2020) John Quan and Georg Ostrovski. DQN Zoo: Reference implementations of DQN-based agents, 2020. URL http://github.com/deepmind/dqn_zoo.
  • Reid & Mukhopadhyay (2021) Cameron Reid and Snehasis Mukhopadhyay. Mutual reinforcement learning with heterogenous agents. 2021 IEEE International Conference on Smart Computing (SMARTCOMP), pp.  395–397, 2021.
  • Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR, abs/1511.05952, 2015. URL https://api.semanticscholar.org/CorpusID:13022595.
  • Schaul et al. (2022) Tom Schaul, André Barreto, John Quan, and Georg Ostrovski. The phenomenon of policy churn. ArXiv, abs/2206.00730, 2022. URL https://api.semanticscholar.org/CorpusID:249282416.
  • Schmitt et al. (2020) Simon Schmitt, Matteo Hessel, and Karen Simonyan. Off-policy actor-critic with shared experience replay. In ICML, 2020.
  • Schwarzer et al. (2020) Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron C. Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations, 2020.
  • Sokar et al. (2023) Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:257219318.
  • Sun et al. (2022) Hao Sun, Lei Han, Rui Yang, Xiaoteng Ma, Jian Guo, and Bolei Zhou. Optimistic curiosity exploration and conservative exploitation with linear reward sha**. 2022. URL https://api.semanticscholar.org/CorpusID:252917631.
  • Towers et al. (2023) Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan ** Shen, and Omar G. Younis. Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
  • Wu & Gong (2020) Guile Wu and Shaogang Gong. Peer collaborative learning for online knowledge distillation. ArXiv, abs/2006.04147, 2020.
  • Wu et al. (2021) Yanqiu Wu, Xinyue Chen, Che Wang, Yiming Zhang, Zijian Zhou, and Keith W. Ross. Aggressive q-learning with ensembles: Achieving both high sample efficiency and high asymptotic performance. ArXiv, abs/2111.09159, 2021. URL https://api.semanticscholar.org/CorpusID:244270282.
  • Xue et al. (2020) Zeyue Xue, Shuang Luo, Chao Wu, Pan Zhou, Kaigui Bian, and Wei Du. Transfer heterogeneous knowledge among peer-to-peer teammates: A model distillation approach. ArXiv, abs/2002.02202, 2020.
  • Zhang et al. (2017) Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. Deep mutual learning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4320–4328, 2017.
  • Zhao & Hospedales (2020) Chenyang Zhao and Timothy M. Hospedales. Robust domain randomised reinforcement learning through peer-to-peer distillation. ArXiv, abs/2012.04839, 2020.

Appendix

\parttoc

Appendix A Algorithms

In this section, we provide the pseudocode for the ensemble algorithms used in this work. For simplicity, we only describe the case of batch size B=1𝐵1B=1italic_B = 1. For B>1𝐵1B>1italic_B > 1 we simply average the loss across the batch. Note we use θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ to denote the parameters of the entire ensemble. In other words, Qi(s,a;θ)subscript𝑄𝑖𝑠𝑎𝜃Q_{i}(s,a;\theta)italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_θ ) only depends on a subset of θ𝜃\thetaitalic_θ. All ensemble algorithms do not use data bootstrap**, as Osband et al. (2016) finds it barely affects performance in complex domains. All ensemble gradient updates are performed in parallel in practice, though in the pseudocode they are written as for loops.

The provided algorithms are:

  • Algorithm 1 and Algorithm 2: Bootstrapped DQN and Bootstrapped DQN + CERL. The changes needed for CERL are highlighted in yellow.

  • Algorithm 3 and Algorithm 4: Ensemble SAC and Ensemble SAC + CERL. The changes needed for CERL are highlighted in yellow. For Ensemble SAC + CERL, we find it helpful to use huber loss with a threshold of 10101010 for the CERL auxiliary loss to prevent certain diverging ensemble members from affecting all other members. This is shown in the algorithm description. Also note that Clipped Double Q-learning (Fujimoto et al., 2018b) is used in practice but omitted in the algorithm description.

Note the above only describes the behavior of these algorithms during training. When evaluating the individual ensemble members’ performance, these algorithms behave exactly the same as training time except that there are no learning updates. When evaluating the aggregate policies, we aggregate the policies of different ensemble members. For Bootstrapped DQN (+CERL), the policies are the greedy policies with respect to {Qi}i=1Nsuperscriptsubscriptsubscript𝑄𝑖𝑖1𝑁\{Q_{i}\}_{i=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (or the main heads {Qii}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑄𝑖𝑖𝑖1𝑁\{Q_{i}^{i}\}_{i=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for CERL) and are aggregated via majority voting; specifically, for each visited state during evaluation, for each action in the action space, we count the number of ensemble members that select that action (i.e., votes) and we select the action with the most votes. Ties are broken randomly. For SAC we simply use the average of actions sampled from different policies. For example, if we visit state s𝑠sitalic_s during evaluation, we will sample aiπi(|s)a_{i}\sim\pi_{i}(\cdot|s)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ | italic_s ) from each ensemble member, and then take the action 1Ni=1Nai1𝑁superscriptsubscript𝑖1𝑁subscript𝑎𝑖\frac{1}{N}\sum_{i=1}^{N}a_{i}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the environment.

total interaction steps M𝑀Mitalic_M, ensemble size N𝑁Nitalic_N, gradient update period P𝑃Pitalic_P, target update period T𝑇Titalic_T, N𝑁Nitalic_N value functions {Qi(s,a;θ)}i=1Nsuperscriptsubscriptsubscript𝑄𝑖𝑠𝑎𝜃𝑖1𝑁\{Q_{i}(s,a;\theta)\}_{i=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_θ ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N𝑁Nitalic_N target value functions {Q¯i(s,a;θ¯)}i=1Nsuperscriptsubscriptsubscript¯𝑄𝑖𝑠𝑎¯𝜃𝑖1𝑁\{\bar{Q}_{i}(s,a;\bar{\theta})\}_{i=1}^{N}{ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; over¯ start_ARG italic_θ end_ARG ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
tTrue𝑡Truet\leftarrow\mathrm{True}italic_t ← roman_True \triangleright Terminal state indicator/whether to start a new episode
for m1𝑚1m\leftarrow 1italic_m ← 1 to M𝑀Mitalic_M do
     if t=True𝑡Truet=\mathrm{True}italic_t = roman_True then \triangleright New episode
         smreset(env)subscript𝑠𝑚resetenvs_{m}\leftarrow\mathrm{reset}(\mathrm{env})italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_reset ( roman_env )
         kUniform({1,,N})similar-to𝑘Uniform1𝑁k\sim\mathrm{Uniform}(\{1,\ldots,N\})italic_k ∼ roman_Uniform ( { 1 , … , italic_N } ) \triangleright Randomly sample a member for acting
     else
         smsm1subscript𝑠𝑚superscriptsubscript𝑠𝑚1s_{m}\leftarrow s_{m-1}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
     end if
     amargmaxaQk(sm,a;θ)subscript𝑎𝑚subscript𝑎subscript𝑄𝑘subscript𝑠𝑚𝑎𝜃a_{m}\leftarrow\arg\max_{a}Q_{k}(s_{m},a;\theta)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a ; italic_θ )
     amϵgreedy(am,m)similar-tosubscript𝑎𝑚italic-ϵgreedysubscript𝑎𝑚𝑚a_{m}\sim\operatorname{\epsilon-\mathrm{greedy}}(a_{m},m)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ start_OPFUNCTION italic_ϵ - roman_greedy end_OPFUNCTION ( italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ) \triangleright Epsilon greedy policy with a decay schedule
     smsuperscriptsubscript𝑠𝑚s_{m}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT step(env,am)absentstepenvsubscript𝑎𝑚\leftarrow\mathrm{step}(\mathrm{env},a_{m})← roman_step ( roman_env , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
     Add (sm,am,rm,sm,tm)subscript𝑠𝑚subscript𝑎𝑚subscript𝑟𝑚superscriptsubscript𝑠𝑚subscript𝑡𝑚(s_{m},a_{m},r_{m},s_{m}^{\prime},t_{m})( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) to the shared replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
     if mmodP=0modulo𝑚𝑃0m\bmod P=0italic_m roman_mod italic_P = 0 then \triangleright Gradient update
         Sample a transition (s,a,r,s,t)𝑠𝑎𝑟superscript𝑠𝑡(s,a,r,s^{\prime},t)( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) from 𝒟𝒟\cal Dcaligraphic_D
         for j{1,,N}𝑗1𝑁j\in\{1,\ldots,N\}italic_j ∈ { 1 , … , italic_N } do
              aargmaxaQj(s,a;θ)superscript𝑎subscriptsuperscript𝑎subscript𝑄𝑗superscript𝑠superscript𝑎𝜃a^{\prime}\leftarrow\arg\max_{a^{\prime}}Q_{j}(s^{\prime},a^{\prime};\theta)italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ )
              yj{r+γQ¯j(s,a;θ¯)if t=Falserif t=Truesubscript𝑦𝑗cases𝑟𝛾subscript¯𝑄𝑗superscript𝑠superscript𝑎¯𝜃if 𝑡False𝑟if 𝑡Truey_{j}\leftarrow\begin{cases}r+\gamma\bar{Q}_{j}(s^{\prime},a^{\prime};\bar{% \theta})&\text{if }t=\mathrm{False}\\ r&\text{if }t=\mathrm{True}\end{cases}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_θ end_ARG ) end_CELL start_CELL if italic_t = roman_False end_CELL end_ROW start_ROW start_CELL italic_r end_CELL start_CELL if italic_t = roman_True end_CELL end_ROW \triangleright Double DQN update
         end for
         L(θ)i=1Nhuber_loss(Qi(s,a;θ)yi)𝐿𝜃superscriptsubscript𝑖1𝑁huber_losssubscript𝑄𝑖𝑠𝑎𝜃subscript𝑦𝑖L(\theta)\leftarrow\sum_{i=1}^{N}\operatorname{huber\_loss}(Q_{i}(s,a;\theta)-% y_{i})italic_L ( italic_θ ) ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_OPFUNCTION roman_huber _ roman_loss end_OPFUNCTION ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_θ ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
         δθθL(θ)𝛿𝜃subscript𝜃𝐿𝜃\delta\theta\leftarrow\nabla_{\theta}L(\theta)italic_δ italic_θ ← ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_θ )
         δθscale_grad(δθ)𝛿𝜃scale_grad𝛿𝜃\delta\theta\leftarrow\operatorname{scale\_grad}(\delta\theta)italic_δ italic_θ ← start_OPFUNCTION roman_scale _ roman_grad end_OPFUNCTION ( italic_δ italic_θ ) \triangleright Scale the gradients of the encoder(s). See Section C.2.
         θoptimizer(θ,δθ)𝜃optimizer𝜃𝛿𝜃\theta\leftarrow\operatorname{optimizer}(\theta,\delta\theta)italic_θ ← roman_optimizer ( italic_θ , italic_δ italic_θ )
     end if
     if mmodT=0modulo𝑚𝑇0m\bmod T=0italic_m roman_mod italic_T = 0 then
         θ¯θ¯𝜃𝜃\bar{\theta}\leftarrow\thetaover¯ start_ARG italic_θ end_ARG ← italic_θ \triangleright Update target network
     end if
end for
Algorithm 1 Bootstrapped DQN (training)
total interaction steps M𝑀Mitalic_M, ensemble size N𝑁Nitalic_N, gradient update period P𝑃Pitalic_P, target update period T𝑇Titalic_T, N×N𝑁𝑁N\times Nitalic_N × italic_N value functions {Qij(s,a;θ)}i,j=1Nsuperscriptsubscriptsuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝜃𝑖𝑗1𝑁\{Q_{i}^{j}(s,a;\theta)\}_{i,j=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ ) } start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N×N𝑁𝑁N\times Nitalic_N × italic_N target value functions {Q¯ij(s,a;θ¯)}i,j=1Nsuperscriptsubscriptsuperscriptsubscript¯𝑄𝑖𝑗𝑠𝑎¯𝜃𝑖𝑗1𝑁\{\bar{Q}_{i}^{j}(s,a;\bar{\theta})\}_{i,j=1}^{N}{ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; over¯ start_ARG italic_θ end_ARG ) } start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
tTrue𝑡Truet\leftarrow\mathrm{True}italic_t ← roman_True \triangleright Terminal state indicator/whether to start a new episode
for m1𝑚1m\leftarrow 1italic_m ← 1 to M𝑀Mitalic_M do
     if t=True𝑡Truet=\mathrm{True}italic_t = roman_True then \triangleright New episode
         smreset(env)subscript𝑠𝑚resetenvs_{m}\leftarrow\mathrm{reset}(\mathrm{env})italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_reset ( roman_env )
         kUniform({1,,N})similar-to𝑘Uniform1𝑁k\sim\mathrm{Uniform}(\{1,\ldots,N\})italic_k ∼ roman_Uniform ( { 1 , … , italic_N } ) \triangleright Randomly sample a member for acting
     else
         smsm1subscript𝑠𝑚superscriptsubscript𝑠𝑚1s_{m}\leftarrow s_{m-1}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
     end if
     amargmaxaQkk(sm,a;θ)subscript𝑎𝑚subscript𝑎superscriptsubscript𝑄𝑘𝑘subscript𝑠𝑚𝑎𝜃a_{m}\leftarrow\arg\max_{a}Q_{k}^{k}(s_{m},a;\theta)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a ; italic_θ )
     amϵgreedy(am,m)similar-tosubscript𝑎𝑚italic-ϵgreedysubscript𝑎𝑚𝑚a_{m}\sim\operatorname{\epsilon-\mathrm{greedy}}(a_{m},m)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ start_OPFUNCTION italic_ϵ - roman_greedy end_OPFUNCTION ( italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_m ) \triangleright Epsilon greedy policy with a decay schedule
     smsuperscriptsubscript𝑠𝑚s_{m}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT step(env,am)absentstepenvsubscript𝑎𝑚\leftarrow\mathrm{step}(\mathrm{env},a_{m})← roman_step ( roman_env , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
     Add (sm,am,rm,sm,tm)subscript𝑠𝑚subscript𝑎𝑚subscript𝑟𝑚superscriptsubscript𝑠𝑚subscript𝑡𝑚(s_{m},a_{m},r_{m},s_{m}^{\prime},t_{m})( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) to the shared replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
     if mmodP=0modulo𝑚𝑃0m\bmod P=0italic_m roman_mod italic_P = 0 then \triangleright Gradient update
         Sample a transition (s,a,r,s,t)𝑠𝑎𝑟superscript𝑠𝑡(s,a,r,s^{\prime},t)( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) from 𝒟𝒟\cal Dcaligraphic_D
         for j{1,,N}𝑗1𝑁j\in\{1,\ldots,N\}italic_j ∈ { 1 , … , italic_N } do
              aargmaxaQjj(s,a;θ)superscript𝑎subscriptsuperscript𝑎superscriptsubscript𝑄𝑗𝑗superscript𝑠superscript𝑎𝜃a^{\prime}\leftarrow\arg\max_{a^{\prime}}Q_{j}^{j}(s^{\prime},a^{\prime};\theta)italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ )
              yj{r+γQ¯jj(s,a;θ¯)if t=Falserif t=Truesubscript𝑦𝑗cases𝑟𝛾superscriptsubscript¯𝑄𝑗𝑗superscript𝑠superscript𝑎¯𝜃if 𝑡False𝑟if 𝑡Truey_{j}\leftarrow\begin{cases}r+\gamma\bar{Q}_{j}^{j}(s^{\prime},a^{\prime};\bar% {\theta})&\text{if }t=\mathrm{False}\\ r&\text{if }t=\mathrm{True}\end{cases}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_θ end_ARG ) end_CELL start_CELL if italic_t = roman_False end_CELL end_ROW start_ROW start_CELL italic_r end_CELL start_CELL if italic_t = roman_True end_CELL end_ROW \triangleright Double DQN update
         end for
         L(θ)i=1Nj=1Nhuber_loss(Qij(s,a;θ)yj)𝐿𝜃superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁huber_losssuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝜃subscript𝑦𝑗L(\theta)\leftarrow\sum_{i=1}^{N}\sum_{j=1}^{N}\operatorname{huber\_loss}(Q_{i% }^{j}(s,a;\theta)-y_{j})italic_L ( italic_θ ) ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_OPFUNCTION roman_huber _ roman_loss end_OPFUNCTION ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ ) - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
         δθθL(θ)𝛿𝜃subscript𝜃𝐿𝜃\delta\theta\leftarrow\nabla_{\theta}L(\theta)italic_δ italic_θ ← ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_θ )
         δθscale_grad(δθ)𝛿𝜃scale_grad𝛿𝜃\delta\theta\leftarrow\operatorname{scale\_grad}(\delta\theta)italic_δ italic_θ ← start_OPFUNCTION roman_scale _ roman_grad end_OPFUNCTION ( italic_δ italic_θ ) \triangleright Scale the gradients of the encoder(s). See Section C.2.
         θoptimizer(θ,δθ)𝜃optimizer𝜃𝛿𝜃\theta\leftarrow\operatorname{optimizer}(\theta,\delta\theta)italic_θ ← roman_optimizer ( italic_θ , italic_δ italic_θ )
     end if
     if mmodT=0modulo𝑚𝑇0m\bmod T=0italic_m roman_mod italic_T = 0 then
         θ¯θ¯𝜃𝜃\bar{\theta}\leftarrow\thetaover¯ start_ARG italic_θ end_ARG ← italic_θ \triangleright Update target network
     end if
end for
Algorithm 2 Bootstrapped DQN + CERL (training)
total interaction steps M𝑀Mitalic_M, ensemble size N𝑁Nitalic_N, gradient update period P𝑃Pitalic_P, target update temperature τ𝜏\tauitalic_τ, N𝑁Nitalic_N value functions {Qi(s,a;θ)}i=1Nsuperscriptsubscriptsubscript𝑄𝑖𝑠𝑎𝜃𝑖1𝑁\{Q_{i}(s,a;\theta)\}_{i=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_θ ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N𝑁Nitalic_N target value functions {Q¯i(s,a;θ¯)}i=1Nsuperscriptsubscriptsubscript¯𝑄𝑖𝑠𝑎¯𝜃𝑖1𝑁\{\bar{Q}_{i}(s,a;\bar{\theta})\}_{i=1}^{N}{ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; over¯ start_ARG italic_θ end_ARG ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N𝑁Nitalic_N policies {πi(a|s;ϕ)}i=1Nsuperscriptsubscriptsubscript𝜋𝑖conditional𝑎𝑠italic-ϕ𝑖1𝑁\{\pi_{i}(a|s;\phi)\}_{i=1}^{N}{ italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a | italic_s ; italic_ϕ ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N𝑁Nitalic_N entropy temperatures {αi}i=1Nsuperscriptsubscriptsubscript𝛼𝑖𝑖1𝑁\{\alpha_{i}\}_{i=1}^{N}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
tTrue𝑡Truet\leftarrow\mathrm{True}italic_t ← roman_True \triangleright Terminal state indicator/whether to start a new episode
for m1𝑚1m\leftarrow 1italic_m ← 1 to M𝑀Mitalic_M do
     if t=True𝑡Truet=\mathrm{True}italic_t = roman_True then \triangleright New episode
         smreset(env)subscript𝑠𝑚resetenvs_{m}\leftarrow\mathrm{reset}(\mathrm{env})italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_reset ( roman_env )
         kUniform({1,,N})similar-to𝑘Uniform1𝑁k\sim\mathrm{Uniform}(\{1,\ldots,N\})italic_k ∼ roman_Uniform ( { 1 , … , italic_N } ) \triangleright Randomly sample a member for acting
     else
         smsm1subscript𝑠𝑚superscriptsubscript𝑠𝑚1s_{m}\leftarrow s_{m-1}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
     end if
     amπk(|sm;ϕ)a_{m}\sim\pi_{k}(\cdot|s_{m};\phi)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; italic_ϕ )
     smsuperscriptsubscript𝑠𝑚s_{m}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT step(env,am)absentstepenvsubscript𝑎𝑚\leftarrow\mathrm{step}(\mathrm{env},a_{m})← roman_step ( roman_env , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
     Add (sm,am,rm,sm,tm)subscript𝑠𝑚subscript𝑎𝑚subscript𝑟𝑚superscriptsubscript𝑠𝑚subscript𝑡𝑚(s_{m},a_{m},r_{m},s_{m}^{\prime},t_{m})( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) to the shared replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
     if mmodP=0modulo𝑚𝑃0m\bmod P=0italic_m roman_mod italic_P = 0 then \triangleright Gradient update
         Sample a transition (s,a,r,s,t)𝑠𝑎𝑟superscript𝑠𝑡(s,a,r,s^{\prime},t)( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) from 𝒟𝒟\cal Dcaligraphic_D
         for j{1,,N}𝑗1𝑁j\in\{1,\ldots,N\}italic_j ∈ { 1 , … , italic_N } do
              aπj(|s;ϕ)a^{\prime}\sim\pi_{j}(\cdot|s^{\prime};\phi)italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_ϕ )
              yj{r+γQ¯j(s,a;θ¯)if t=Falserif t=Truesubscript𝑦𝑗cases𝑟𝛾subscript¯𝑄𝑗superscript𝑠superscript𝑎¯𝜃if 𝑡False𝑟if 𝑡Truey_{j}\leftarrow\begin{cases}r+\gamma\bar{Q}_{j}(s^{\prime},a^{\prime};\bar{% \theta})&\text{if }t=\mathrm{False}\\ r&\text{if }t=\mathrm{True}\par\end{cases}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_θ end_ARG ) end_CELL start_CELL if italic_t = roman_False end_CELL end_ROW start_ROW start_CELL italic_r end_CELL start_CELL if italic_t = roman_True end_CELL end_ROW
              aj(ϕ)πj(|s;ϕ)a_{j}(\phi)\sim\pi_{j}(\cdot|s;\phi)italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ ) ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s ; italic_ϕ ) \triangleright Note with the parametrization trick, ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT depends on ϕitalic-ϕ\phiitalic_ϕ
         end for
         Lcritic(θ)i=1N(Qi(s,a;θ)yi)2subscript𝐿critic𝜃superscriptsubscript𝑖1𝑁superscriptsubscript𝑄𝑖𝑠𝑎𝜃subscript𝑦𝑖2L_{\text{critic}}(\theta)\leftarrow\sum_{i=1}^{N}(Q_{i}(s,a;\theta)-y_{i})^{2}italic_L start_POSTSUBSCRIPT critic end_POSTSUBSCRIPT ( italic_θ ) ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ; italic_θ ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
         δθθLcritic(θ)𝛿𝜃subscript𝜃subscript𝐿critic𝜃\delta\theta\leftarrow\nabla_{\theta}L_{\text{critic}}(\theta)italic_δ italic_θ ← ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT critic end_POSTSUBSCRIPT ( italic_θ ) and θoptimizer(θ,δθ)𝜃optimizer𝜃𝛿𝜃\theta\leftarrow\operatorname{optimizer}(\theta,\delta\theta)italic_θ ← roman_optimizer ( italic_θ , italic_δ italic_θ ) \triangleright Update critic
         Lactor(ϕ)i=1N(Qi(s,ai(ϕ);θ)αilogπi(ai(ϕ)|s;ϕ))subscript𝐿actoritalic-ϕsuperscriptsubscript𝑖1𝑁subscript𝑄𝑖𝑠subscript𝑎𝑖italic-ϕ𝜃subscript𝛼𝑖subscript𝜋𝑖conditionalsubscript𝑎𝑖italic-ϕ𝑠italic-ϕL_{\text{actor}}(\phi)\leftarrow-\sum_{i=1}^{N}(Q_{i}(s,a_{i}(\phi);\theta)-% \alpha_{i}\log\pi_{i}(a_{i}(\phi)|s;\phi))italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_ϕ ) ← - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) ; italic_θ ) - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) | italic_s ; italic_ϕ ) )
         δϕϕLactor(ϕ)𝛿italic-ϕsubscriptitalic-ϕsubscript𝐿actoritalic-ϕ\delta\phi\leftarrow\nabla_{\phi}L_{\text{actor}}(\phi)italic_δ italic_ϕ ← ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_ϕ ) and ϕoptimizer(θ,δϕ)italic-ϕoptimizer𝜃𝛿italic-ϕ\phi\leftarrow\operatorname{optimizer}(\theta,\delta\phi)italic_ϕ ← roman_optimizer ( italic_θ , italic_δ italic_ϕ ) \triangleright Update actor
         Update the entropy temperatures {αi}subscript𝛼𝑖\{\alpha_{i}\}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for each member as in standard SAC
     end if
     θ¯(1τ)θ¯+τθ¯𝜃1𝜏¯𝜃𝜏𝜃\bar{\theta}\leftarrow(1-\tau)\bar{\theta}+\tau\thetaover¯ start_ARG italic_θ end_ARG ← ( 1 - italic_τ ) over¯ start_ARG italic_θ end_ARG + italic_τ italic_θ \triangleright Update target network
end for
Algorithm 3 Ensemble SAC (training)
interaction steps M𝑀Mitalic_M, ensemble size N𝑁Nitalic_N, gradient update period P𝑃Pitalic_P, target update temperature τ𝜏\tauitalic_τ, N×N𝑁𝑁N\times Nitalic_N × italic_N value functions {Qij(s,a;θ)}i,j=1Nsuperscriptsubscriptsuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝜃𝑖𝑗1𝑁\{Q_{i}^{j}(s,a;\theta)\}_{i,j=1}^{N}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ ) } start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N×N𝑁𝑁N\times Nitalic_N × italic_N target value functions {Q¯ij(s,a;θ¯)}i,j=1Nsuperscriptsubscriptsuperscriptsubscript¯𝑄𝑖𝑗𝑠𝑎¯𝜃𝑖𝑗1𝑁\{\bar{Q}_{i}^{j}(s,a;\bar{\theta})\}_{i,j=1}^{N}{ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; over¯ start_ARG italic_θ end_ARG ) } start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, N𝑁Nitalic_N entropy temperatures {αi}i=1Nsuperscriptsubscriptsubscript𝛼𝑖𝑖1𝑁\{\alpha_{i}\}_{i=1}^{N}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
tTrue𝑡Truet\leftarrow\mathrm{True}italic_t ← roman_True \triangleright Terminal state indicator/whether to start a new episode
for m1𝑚1m\leftarrow 1italic_m ← 1 to M𝑀Mitalic_M do
     if t=True𝑡Truet=\mathrm{True}italic_t = roman_True then \triangleright New episode
         smreset(env)subscript𝑠𝑚resetenvs_{m}\leftarrow\mathrm{reset}(\mathrm{env})italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← roman_reset ( roman_env )
         kUniform({1,,N})similar-to𝑘Uniform1𝑁k\sim\mathrm{Uniform}(\{1,\ldots,N\})italic_k ∼ roman_Uniform ( { 1 , … , italic_N } ) \triangleright Randomly sample a member for acting
     else
         smsm1subscript𝑠𝑚superscriptsubscript𝑠𝑚1s_{m}\leftarrow s_{m-1}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
     end if
     amπk(|sm;ϕ)a_{m}\sim\pi_{k}(\cdot|s_{m};\phi)italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ; italic_ϕ )
     smsuperscriptsubscript𝑠𝑚s_{m}^{\prime}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, rmsubscript𝑟𝑚r_{m}italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT step(env,am)absentstepenvsubscript𝑎𝑚\leftarrow\mathrm{step}(\mathrm{env},a_{m})← roman_step ( roman_env , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
     Add (sm,am,rm,sm,tm)subscript𝑠𝑚subscript𝑎𝑚subscript𝑟𝑚superscriptsubscript𝑠𝑚subscript𝑡𝑚(s_{m},a_{m},r_{m},s_{m}^{\prime},t_{m})( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) to the shared replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
     if mmodP=0modulo𝑚𝑃0m\bmod P=0italic_m roman_mod italic_P = 0 then \triangleright Gradient update
         Sample a transition (s,a,r,s,t)𝑠𝑎𝑟superscript𝑠𝑡(s,a,r,s^{\prime},t)( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) from 𝒟𝒟\cal Dcaligraphic_D
         for j{1,,N}𝑗1𝑁j\in\{1,\ldots,N\}italic_j ∈ { 1 , … , italic_N } do
              aπj(|s;ϕ)a^{\prime}\sim\pi_{j}(\cdot|s^{\prime};\phi)italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_ϕ )
              yj{r+γQ¯jj(s,a;θ¯)if t=Falserif t=Truesubscript𝑦𝑗cases𝑟𝛾superscriptsubscript¯𝑄𝑗𝑗superscript𝑠superscript𝑎¯𝜃if 𝑡False𝑟if 𝑡Truey_{j}\leftarrow\begin{cases}r+\gamma\bar{Q}_{j}^{j}(s^{\prime},a^{\prime};\bar% {\theta})&\text{if }t=\mathrm{False}\\ r&\text{if }t=\mathrm{True}\end{cases}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← { start_ROW start_CELL italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; over¯ start_ARG italic_θ end_ARG ) end_CELL start_CELL if italic_t = roman_False end_CELL end_ROW start_ROW start_CELL italic_r end_CELL start_CELL if italic_t = roman_True end_CELL end_ROW
              aj(ϕ)πj(|s;ϕ)a_{j}(\phi)\sim\pi_{j}(\cdot|s;\phi)italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_ϕ ) ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s ; italic_ϕ ) \triangleright Note with the parametrization trick, ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT depends on ϕitalic-ϕ\phiitalic_ϕ
         end for
         Lcritic(θ)i=1Nj=1,jiNhuber_loss(Qij(s,a;θ)yj)+i=1N(Qii(s,a;θ)yi)2subscript𝐿critic𝜃superscriptsubscript𝑖1𝑁superscriptsubscriptformulae-sequence𝑗1𝑗𝑖𝑁huber_losssuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝜃subscript𝑦𝑗superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript𝑄𝑖𝑖𝑠𝑎𝜃subscript𝑦𝑖2L_{\text{critic}}(\theta)\leftarrow\sum_{i=1}^{N}\sum_{j=1,j\neq i}^{N}% \operatorname{huber\_loss}(Q_{i}^{j}(s,a;\theta)-y_{j})+\sum_{i=1}^{N}(Q_{i}^{% i}(s,a;\theta)-y_{i})^{2}italic_L start_POSTSUBSCRIPT critic end_POSTSUBSCRIPT ( italic_θ ) ← ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_OPFUNCTION roman_huber _ roman_loss end_OPFUNCTION ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ ) - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , italic_a ; italic_θ ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
         δθθLcritic(θ)𝛿𝜃subscript𝜃subscript𝐿critic𝜃\delta\theta\leftarrow\nabla_{\theta}L_{\text{critic}}(\theta)italic_δ italic_θ ← ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT critic end_POSTSUBSCRIPT ( italic_θ ) and θoptimizer(θ,δθ)𝜃optimizer𝜃𝛿𝜃\theta\leftarrow\operatorname{optimizer}(\theta,\delta\theta)italic_θ ← roman_optimizer ( italic_θ , italic_δ italic_θ ) \triangleright Update critic
         Lactor(ϕ)i=1N(Qii(s,ai(ϕ);θ)αilogπi(ai(ϕ)|s;ϕ))subscript𝐿actoritalic-ϕsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑄𝑖𝑖𝑠subscript𝑎𝑖italic-ϕ𝜃subscript𝛼𝑖subscript𝜋𝑖conditionalsubscript𝑎𝑖italic-ϕ𝑠italic-ϕL_{\text{actor}}(\phi)\leftarrow-\sum_{i=1}^{N}(Q_{i}^{i}(s,a_{i}(\phi);\theta% )-\alpha_{i}\log\pi_{i}(a_{i}(\phi)|s;\phi))italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_ϕ ) ← - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) ; italic_θ ) - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) | italic_s ; italic_ϕ ) )
         δϕϕLactor(ϕ)𝛿italic-ϕsubscriptitalic-ϕsubscript𝐿actoritalic-ϕ\delta\phi\leftarrow\nabla_{\phi}L_{\text{actor}}(\phi)italic_δ italic_ϕ ← ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT ( italic_ϕ ) and ϕoptimizer(θ,δϕ)italic-ϕoptimizer𝜃𝛿italic-ϕ\phi\leftarrow\operatorname{optimizer}(\theta,\delta\phi)italic_ϕ ← roman_optimizer ( italic_θ , italic_δ italic_ϕ ) \triangleright Update actor
         Update the entropy temperatures {αi}subscript𝛼𝑖\{\alpha_{i}\}{ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for each member as in standard SAC
     end if
     θ¯(1τ)θ¯+τθ¯𝜃1𝜏¯𝜃𝜏𝜃\bar{\theta}\leftarrow(1-\tau)\bar{\theta}+\tau\thetaover¯ start_ARG italic_θ end_ARG ← ( 1 - italic_τ ) over¯ start_ARG italic_θ end_ARG + italic_τ italic_θ \triangleright Update target network
end for
Algorithm 4 Ensemble SAC + CERL (training)

Appendix B Experimental details

B.1 Atari

For Atari, we use the same set of 55555555 games as  Agarwal et al. (2021). Following (Castro et al., 2018), the training process is divided into 200200200200 iterations, each of which contains 1111M frames. At the end of each iteration, the networks are frozen and evaluated with at least 500500500500k frames. All Atari experiments use 5555 seeds.

For each game, the human normalized scores (HNS) are computed as follows

ScorenormalizedsubscriptScorenormalized\displaystyle\mathrm{Score}_{\mathrm{normalized}}roman_Score start_POSTSUBSCRIPT roman_normalized end_POSTSUBSCRIPT =ScoreAgentScoreRandomScoreHumanScoreRandomabsentsubscriptScoreAgentsubscriptScoreRandomsubscriptScoreHumansubscriptScoreRandom\displaystyle=\frac{\mathrm{Score}_{\mathrm{Agent}}-\mathrm{Score}_{\mathrm{% Random}}}{\mathrm{Score}_{\mathrm{Human}}-\mathrm{Score}_{\mathrm{Random}}}= divide start_ARG roman_Score start_POSTSUBSCRIPT roman_Agent end_POSTSUBSCRIPT - roman_Score start_POSTSUBSCRIPT roman_Random end_POSTSUBSCRIPT end_ARG start_ARG roman_Score start_POSTSUBSCRIPT roman_Human end_POSTSUBSCRIPT - roman_Score start_POSTSUBSCRIPT roman_Random end_POSTSUBSCRIPT end_ARG

where ScoreAgentsubscriptScoreAgent\mathrm{Score}_{\mathrm{Agent}}roman_Score start_POSTSUBSCRIPT roman_Agent end_POSTSUBSCRIPT is the raw score of the considered agent, and ScoreHumansubscriptScoreHuman\mathrm{Score}_{\mathrm{Human}}roman_Score start_POSTSUBSCRIPT roman_Human end_POSTSUBSCRIPT and ScoreRandomsubscriptScoreRandom\mathrm{Score}_{\mathrm{Random}}roman_Score start_POSTSUBSCRIPT roman_Random end_POSTSUBSCRIPT are the raw scores of the human and the random agent respectively. The raw scores of an agent are calculated as the undiscounted evaluation returns averaged over the last 10101010 training iteration and 5555 seeds. The scores of the human and random agents are taken from DQN Zoo (Quan & Ostrovski, 2020). To obtain results in Double DQN normalized scores, simply replace ScoreHumansubscriptScoreHuman\mathrm{Score}_{\mathrm{Human}}roman_Score start_POSTSUBSCRIPT roman_Human end_POSTSUBSCRIPT with ScoreDDQNsubscriptScoreDDQN\mathrm{Score}_{\mathrm{DDQN}}roman_Score start_POSTSUBSCRIPT roman_DDQN end_POSTSUBSCRIPT in the above equations. ScoreDDQNsubscriptScoreDDQN\mathrm{Score}_{\mathrm{DDQN}}roman_Score start_POSTSUBSCRIPT roman_DDQN end_POSTSUBSCRIPT is the raw score obtained by the Double DQN agent, averaged over the last 10101010 training iteration and 5555 seeds.

When showing per-game improvements, we simply compute the difference in human-normalized scores. Per-game improvement bar plots (e.g., Figure 2 (top-right)) are shown in log-scale with a linear threshold at 0.10.10.10.1.

B.2 MuJoCo

Each agent is trained for a total of 1111M steps. Evaluation is performed every 20202020k steps with 30303030 evaluation episodes. When reporting the final performance in an environment, we average the last 5555 evaluation results at the end of training. All MuJoCo experiments use 30303030 seeds except for the 10%percent1010\%10 %-tandem experiments, where we use 10101010 seeds.

B.3 p%percent𝑝p\%italic_p %-tandem experiments

Taking Bootstrapped DQN as an example, the easiest way to understand the experimental setup is by noticing that setting p=50%𝑝percent50p=50\%italic_p = 50 % recovers the standard Bootstrapped DQN with 2222 ensemble members. The only differences of p%percent𝑝p\%italic_p %-tandem to Bootstrapped DQN with N=2𝑁2N=2italic_N = 2 are (1) for each training episode, instead of sampling each agent for acting with the same 50%percent5050\%50 % probability, we sample the active agent with probability 1p%1percent𝑝1-p\%1 - italic_p % and the passive agent with probability p%percent𝑝p\%italic_p % and (2) we record the performances of both the active and passive agents separately during evaluation.

Appendix C Implementation details

C.1 Implementation and hyperparameters

The SAC and ensemble SAC algorithms are built on JAXRL (Kostrikov, 2021) and use the default hyperparameters. In the actor implementation, we use tanh instead of hard clip** for the log_std parameter for better stability. Double DQN and Bootstrapped DQN are implemented using the Dopamine (Castro et al., 2018) framework in JAX (Bradbury et al., 2018). As our work heavily refers to Osband et al. (2016) and Ostrovski et al. (2021), we mostly follow the hyperparameter setup in DQN Zoo (Quan & Ostrovski, 2020) because those are more similar to the ones used in Osband et al. (2016) and Ostrovski et al. (2021). We do not directly build on DQN Zoo mainly because the implementation in Dopamine is more efficient. The hyperparameters for Double DQN are listed in Table 1. Note one “agent steps” corresponds to 4444 environment frames due to the use of action repetitions/frame-skip.

Bootstrapped DQN reuses all the hyperparameters of Double DQN, with two additional hyperparameters: ensemble size N𝑁Nitalic_N and the number of shared bottom layers L𝐿Litalic_L. We set N=10𝑁10N=10italic_N = 10 and L=0𝐿0L=0italic_L = 0 by default.

As mentioned in Appendix A, for Ensemble SAC + CERL, we find it helpful to use huber loss with a threshold of 10 for the CERL auxiliary loss to prevent certain diverging ensemble members from affecting all other members. The main head still uses the regular MSE loss so the use of Huber loss only affects the auxiliary task.

Table 1: Hyperparameters for Double DQN
Parameter Value
Gray-scaling True
Observation down-sampling 84×84848484\times 8484 × 84
Frames stacked 4444
Action repetitions 4444
Sticky actions False
Reward clip** [1,1]11[-1,1][ - 1 , 1 ]
Terminal on loss of life False
Max frames per episode 108108108108k
Q𝑄Qitalic_Q update rule Double DQN
Discount factor 0.990.990.990.99
Minibatch size 32323232
Replay buffer size 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
Optimizer RMSProp
Optimizer: learning rate 0.000250.000250.000250.00025
Optimizer: RMSProp decay 0.950.950.950.95
Optimizer: RMSProp centered True
Optimizer: ϵitalic-ϵ\epsilonitalic_ϵ 1/3221superscript3221/32^{2}1 / 32 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Huber loss True
Evaluation frames 500500500500k
Evaluation period in frames 1111M
Min replay size for sampling 50000500005000050000
Gradient update period in agent steps 4444
Target network update period in agent steps 30000300003000030000
Exploration: ϵitalic-ϵ\epsilonitalic_ϵ during training 1.00.011.00.011.0\to 0.011.0 → 0.01
Exploration: ϵitalic-ϵ\epsilonitalic_ϵ decay period in agent steps 16161616M
Exploration: ϵitalic-ϵ\epsilonitalic_ϵ during evaluation 0.010.010.010.01
Q network: channels 32323232, 64646464, 64646464
Q network: filter size 8×8888\times 88 × 8, 4×4444\times 44 × 4, 3×3333\times 33 × 3
Q network: stride 4444, 2222, 1111
Q network: hidden units 512512512512
Q network: padding type valid convolution
Q network: share bias across action heads True
Q network: initialization See Quan & Ostrovski (2020)

C.2 Gradient scaling

Following Osband et al. (2016), we scale the gradients of the encoder(s) to ensure that the magnitude of the gradients does not increase with the number of heads. For example, in CERL, the gradients of the encoders fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will be divided by N𝑁Nitalic_N, which is the number of heads on top of this encoder.

C.3 Computational costs

On a server with NVIDIA RTX8000 GPU and AMD EPYC 7502 CPU, each seed of our implementation of Double DQN, Bootstrapped DQN (N=10𝑁10N=10italic_N = 10, share 00 layers), and Bootstrapped DQN (N=10𝑁10N=10italic_N = 10, share 3333 layers) take roughly 2222, 3.53.53.53.5, and 3333 days respectively for 200200200200M frames. Applying CERL does not noticeably increase the wall clock time in our experiments. Running 10101010 seeds of SAC, Ensemble SAC, and Ensemble SAC + CERL with N=10𝑁10N=10italic_N = 10 on Humanoid-v4 takes roughly 5555, 12.512.512.512.5, 13131313 hours respectively.

Appendix D Additional results

D.1 10%percent1010\%10 %-tandem experiments for continuous control tasks

In Figure 8 we show the results of the 10%percent1010\%10 %-tandem experiment in MuJoCo tasks, which shares the same patterns as the one in Atari. In MuJoCo tasks, the performance gap between the active and the passive agent is larger than that between Ensemble SAC (indiv.) and SAC. This might be because Ensemble SAC (indiv.) provides better exploration and partially compensates for the performance loss due to challenging off-policy learning.

Refer to caption
Figure 8: 10%percent1010\%10 %-tandem experiments in MuJoCo tasks with a replay buffer size of 200200200200k. Results are aggregated over 10101010 seeds for the active and passive agents and 30303030 seeds for others. Shaded areas show 95%percent9595\%95 % CIs.

D.2 Per-game results

D.2.1 Per-game learning curves

Figure 9 shows per-game comparisons between Double DQN, vanilla Bootstrapped DQN, and CERL with policy aggregation. Figure 10 shows similar comparisons without policy aggregation. Figure 11 shows the per-game results for the 10%percent1010\%10 %-tandem experiment.

Refer to caption
Figure 9: Per-game comparisons between Double, vanilla Bootstrapped DQN, and CERL with voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
Refer to caption
Figure 10: Per-game comparisons between Double, vanilla Bootstrapped DQN, and CERL without voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
Refer to caption
Figure 11: Per-game comparisons of the 10%percent1010\%10 %-tandem experiment. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
D.2.2 Per-game analysis of the benefits of majority voting

In Figure 12 we show the performance gap between Bootstrapped DQN (agg.) and Bootstrapped DQN (indiv.) in each game. As shown in the results, majority voting provides significant performance gains in almost all games.

Refer to caption
Figure 12: Per-game performance gap between Bootstrapped DQN (agg.) and Bootstrapped DQN (indiv.) in HNS.
D.2.3 Per-game performance of CERL

Per-game performance of CERL is shown in Figure 10 and Figure 9. We also show per-game improvements of CERL over Bootstrapped DQN in Figure 13.

Refer to caption
Figure 13: Per-game improvements of Bootstrapped DQN + CERL over Bootstrapped DQN in HNS, with and without policy aggregation.

D.3 Additional analysis of the curse of diversity

D.3.1 Over-sampling self-generated data in the training batches

Even though we cannot control the proportion of self-generated data in the replay buffer for each ensemble member, it is possible to control this proportion in the training batches. In this experiment, for each ensemble member, with 50%percent5050\%50 % probability we only sample the training batches from self-generated transitions; otherwise, we sample uniformly from all the data as usual. This ensures the proportion of self-generated data in the training batches for each ensemble member is at least 50%percent5050\%50 %.

Figure 14 shows the effects of over-sampling self-generated data in the training batches for each ensemble member. As can be seen, this technique does not mitigate the performance loss relative to single-agent Double DQN.

Refer to captionRefer to caption
Figure 14: Effects of oversampling self-generated data in the training batches. (top) Performance with voting. (bottom) Performance without voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
D.3.2 Episode termination condition

In Atari, we have the option to terminate an episode when a life is lost, or only when the game is over. The former option may help the agent quickly learn the significance of death. In the context of ensemble-based exploration, shorter episodes mean it is less likely that a single ensemble will dominate the environment interaction for a long stretch of time, during which other ensemble members perform no interaction at all.

Our work follows the recommendation of Machado et al. (2017) and only terminates an episode when the game is over. In Figure 15 we show the performance of Bootstrapped DQN and Double DQN when we set the termination condition to life loss. As can be seen, the curse of diversity still remains.

Refer to captionRefer to caption
Figure 15: Results with the episode termination condition set to life loss. (top) Performance with voting. (bottom) Performance without voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
D.3.3 Distributional RL

Recent work (Agarwal et al., 2020) has shown that the QR-DQN algorithm (Dabney et al., 2017) can be more effective than DQN in offline RL. In Figure 16 we test a variant of Bootstrapped DQN where we train an ensemble of QR-DQN agents instead of Double DQN agents (named Bootstrapped QR-DQN). We use the implementation and default hyperparameters of QR-DQN in Dopamine (Castro et al., 2018) except that we do not use prioritized experience replay (Schaul et al., 2015) as it is unclear whether it is appropriate to prioritize transitions based on the TD error of the entire ensemble.

As shown in Figure 16, there exists a significant performance gap between QR-DQN and Bootstrapped QR-DQN. This suggests that the curse of diversity is also present in distributional RL.

Refer to caption
Figure 16: Performance of Bootstrapped DQN when using QR-DQN as the base algorithm. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
D.3.4 Switching ensemble members on a per-step basis

The original Bootstrapped DQN switches the ensemble member used for acting at the start of each episode. In Figure 17, we test a variant of Bootstrapped DQN where we switch the ensemble member for acting on a per-step basis. Note that this no longer allows “temporally extended exploration” (Osband et al., 2016) which is the original motivation of Bootstrapped DQN, but it might mitigate the off-policy learning issue as it allows all ensemble members (as opposed to just one of them) to generate data within the period of an episode.

As shown in Figure 17, switching ensemble members more frequently does not mitigate the curse of diversity. This is not surprising as it does not change the proportion of self-generated data for each ensemble member in the replay buffer.

Refer to captionRefer to caption
Figure 17: Effects of switching ensemble members on a per-step basis instead of per-episode. (top) Performance with voting. (bottom) Performance without voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.
D.3.5 Layer sharing experiment with a larger replay buffer

In Figure 18 we repeat the layer-sharing experiment in Section 3.3 but with a replay buffer of 4444M transitions. This allows Bootstrapped DQN (agg.) to outperform Double DQN in three games. However, the trade-off between the advantages and disadvantages of diversity remains: as we increase the number of shared layers and hence reduce diversity,

  • The curse of diversity, i.e., the gap between Bootstrapped DQN (indiv.) and Double DQN reduces;

  • The performance gain we get from majority voting, i.e. the gap between Bootstrapped DQN (agg.) and Bootstrapped DQN (indiv.), also reduces. An exception is Space Invaders, where the gap seems to slightly increase when L𝐿Litalic_L increases from 0 to 3, which might be related to certain properties of this game.

Refer to caption
Figure 18: The effects of varying the number of shared layers when using a replay buffer of 4444M transitions. The top row shows Double DQN normalized scores. The bottom row shows the entropy of the normalized vote distributions. Error bars show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds.
D.3.6 Different levels of passivity for the p%percent𝑝p\%italic_p %-tandem experiment

In Figure 19 we show the effect of varying p𝑝pitalic_p in the p%percent𝑝p\%italic_p %-tandem experiment. As expected, increasing p𝑝pitalic_p (i.e., reducing the degree of passivity) reduces the performance gap between the active and passive agents.

Refer to caption
Figure 19: The effects of varying p𝑝pitalic_p in the p%percent𝑝p\%italic_p %-tandem experiment. Error bars show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds.
D.3.7 Data bootstrap**

As mentioned in Section 2 and Appendix A, we do not use data bootstrap** in Bootstrapped DQN (i.e., we set the masking probability p𝑝pitalic_p to 1.01.01.01.0), as Osband et al. (2016) does not find it to be useful in Atari games. In Figure 20 we probe the effect of data bootstrap** on performance, with bootstrap masking probability p=0.75𝑝0.75p=0.75italic_p = 0.75 (see Osband et al. (2016) for how p𝑝pitalic_p is used to approximate bootstrap**). As shown in the results, data bootstrap** damages performance on the environments we tested. This is not surprising because masking out samples essentially reduces the number of transitions each member has access to. It also effectively reduces the batch size. Besides, the fact that bootstrap** promotes diversity can exacerbate the curse of diversity. However, it is difficult to attribute the precise cause of the damaged performance.

Refer to captionRefer to caption
Figure 20: Effects of data bootstrap**. p𝑝pitalic_p refers to the probability of masking out a certain sample for each ensemble member. Performance with voting. (bottom) Performance without voting. All results are aggregated over 5555 seeds. Shaded areas show 95%percent9595\%95 % bootstrapped CIs. The learning curves are smoothed with a sliding window of 5555 iterations.

D.4 Additional analysis of CERL

D.4.1 Ensemble size ablation

In Figure 21 we test different ensemble sizes for CERL. With CERL, we see a clear trend that increasing the ensemble size gives better performance, though it saturates at N=10𝑁10N=10italic_N = 10 for two environments.

Refer to caption
Figure 21: Impact of ensemble size on Bootstrapped DQN and CERL, with or without majority voting. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds.
D.4.2 An alternative design of CERL

We consider the following alternative update rule for CERL:

Qij(s,a)r+γQ¯ij(s,aj),for i=1,,N,for j=1,,Nformulae-sequencesuperscriptsubscript𝑄𝑖𝑗𝑠𝑎𝑟𝛾superscriptsubscript¯𝑄𝑖𝑗superscript𝑠superscriptsubscript𝑎𝑗formulae-sequencefor 𝑖1𝑁for 𝑗1𝑁Q_{i}^{j}(s,a)\leftarrow r+\gamma\bar{Q}_{i}^{j}(s^{\prime},a_{j}^{\prime}),% \quad\text{for }i=1,\ldots,N,\quad\text{for }j=1,\ldots,Nitalic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s , italic_a ) ← italic_r + italic_γ over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , for italic_i = 1 , … , italic_N , for italic_j = 1 , … , italic_N (2)

where Q¯ijsuperscriptsubscript¯𝑄𝑖𝑗\bar{Q}_{i}^{j}over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the target network for Qijsuperscriptsubscript𝑄𝑖𝑗Q_{i}^{j}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and ajπj(|s)a_{j}^{\prime}\sim\pi_{j}(\cdot|s^{\prime})italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The only difference between the original CERL is that the original CERL uses Q¯jjsuperscriptsubscript¯𝑄𝑗𝑗\bar{Q}_{j}^{j}over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT on the right-hand side while here we use Q¯ijsuperscriptsubscript¯𝑄𝑖𝑗\bar{Q}_{i}^{j}over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT on the right-hand side. Since the next action still comes from πjsubscript𝜋𝑗\pi_{j}italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, this update rule is still trying to make the j𝑗jitalic_j-th head of ensemble i𝑖iitalic_i learn the value functions of ensemble member j𝑗jitalic_j, but in a different way. Specifically, each auxiliary head uses itself to compute the TD target instead of other ensemble members’ main heads. We refer to this variant as CERL (self-target).

In Figure 22 we show the performance of this variant on 4444 games. As can be seen, it performs very well. However, in our preliminary experiments in MuJoCo tasks with Ensemble SAC, this variant leads to instability in learning and we do not have a good explanation at this moment. Also, this variant requires additional forward passes for SAC (note we need to evaluate Q¯ij(s,aj)superscriptsubscript¯𝑄𝑖𝑗superscript𝑠superscriptsubscript𝑎𝑗\bar{Q}_{i}^{j}(s^{\prime},a_{j}^{\prime})over¯ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for each i𝑖iitalic_i and j𝑗jitalic_j).

Refer to captionRefer to caption
Figure 22: Performance of CERL (self-target) in 4444 Atari games. (top) Performance with voting. (bottom) Performance without voting. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds. The learning curves are smoothed with a sliding window of 5555 iterations.
D.4.3 Combining CERL with encoder sharing

Even though our original motivation is to obtain the representation learning effect of encoder sharing without actually sharing the encoders, we comment that it is possible to use CERL with encoder sharing. This results in a hierarchical architecture where the network “branches” twice near the output. In Figure 23 we apply CERL to Bootstrapped DQN (L=3𝐿3L=3italic_L = 3). Unfortunately, this provides almost no improvement over Bootstrapped DQN (L=3𝐿3L=3italic_L = 3). We conjecture that there are two reasons for this result. First, both methods shape the representations via jointly learning multiple value functions and their effects will likely overlap. Second, sharing layers reduces the diversity, and thus the learning signals from CERL is less informative.

Refer to caption
Figure 23: Effect of combining CERL and encoder sharing. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds.
D.4.4 Number of parameters

As mentioned in the main text, applying CERL to Bootstrapped DQN (N=10)𝑁10(N=10)( italic_N = 10 ) without network sharing increases the number of parameters by no more than 5%percent55\%5 %. Further, these parameters are used for auxiliary tasks and thus do not affect the capacity of the part of the network that actually predicts the value functions. Though it is extremely unlikely that CERL’s improvements come from increased parameters, for completeness we also test Bootstrapped DQN without CERL but with more parameters. We do so by increasing the output size of the penultimate layer such that the number of increased parameters is roughly the same as that introduced by CERL. As shown in Figure 24, this has almost no impact on performance.

Refer to captionRefer to caption
Figure 24: Bootstrapped DQN with more parameters. (top) Performance with voting. (bottom) Performance without voting. Shaded areas show 95%percent9595\%95 % bootstrapped CIs over 5555 seeds. The learning curves are smoothed with a sliding window of 5555 iterations.
D.4.5 Other representation learning methods: a preliminary investigation

The success of CERL suggests that representation learning in general may be promising for mitigating the curse of diversity. We perform a preliminary investigation with MICo (Castro et al., 2021) and multi-horizon auxiliary task (MH) (Fedus et al., 2019) in 55555555 Atari games. MICo is a metric-based method that explicitly shapes the representations, while MH does so implicitly by learning value functions of different horizons as an auxiliary task.

For the multi-horizon auxiliary task implementation, we jointly learn K=10𝐾10K=10italic_K = 10 value functions with {γ}i=1Ksuperscriptsubscript𝛾𝑖1𝐾\{\gamma\}_{i=1}^{K}{ italic_γ } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, where γi=11i(Hmax/K)subscript𝛾𝑖11𝑖subscript𝐻𝐾\gamma_{i}=1-\frac{1}{i\cdot(H_{\max}/K)}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_i ⋅ ( italic_H start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_K ) end_ARG, where Hmax=100subscript𝐻100H_{\max}=100italic_H start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 100 leading to γK=0.99subscript𝛾𝐾0.99\gamma_{K}=0.99italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 0.99. The architecture change involved is the same as that for CERL. Only the heads that correspond to the longest horizon γKsubscript𝛾𝐾\gamma_{K}italic_γ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are used for acting. For MICo, we follow the author implementation111https://github.com/google-research/google-research/tree/master/mico with β=0.1𝛽0.1\beta=0.1italic_β = 0.1. We search the MICo weight coefficient α𝛼\alphaitalic_α in {0.01,0.1,0.5}0.010.10.5\{0.01,0.1,0.5\}{ 0.01 , 0.1 , 0.5 } on the four Atari games we used in the main text based on the performance of Bootstrapped DQN + MICo. This results in the final selection of α=0.01𝛼0.01\alpha=0.01italic_α = 0.01.

We show the aggregate performance of Bootstrapped DQN + MH and Bootstrapped DQN + MICo in Figure 25 and Figure 26 respectively. Per-game results are summarized in Figure 27 and Figure 28 respectively. As these methods are also applicable to single-agent methods, we also show the performance of Double DQN + MH and Double DQN + MICo for completeness. As shown in these results, these methods do not provide a clear improvement to Bootstrapped DQN (indiv.) and Bootstrapped DQN (agg.) as CERL does. Note that in the result we find that MICo damages the performance of Double DQN. This is likely because our hyperparameter setup (largely based on DQN Zoo (Quan & Ostrovski, 2020), as mentioned in Appendix C) is very from those used in the original MICo paper, which is based on the setup used in Dopamine (Castro et al., 2018).

We emphasize that these are preliminary investigations, and the results may vary a lot based on the base algorithm, hyperparameters, and environments. A thorough analysis of what types of representations are most suited for addressing the curse of diversity is left for future work.

Refer to caption
Figure 25: Comparison between Double DQN, Double DQN + MH, Bootstrapped DQN, and Bootstrapped DQN + MH in Atari. Results are aggregated over 55555555 games and 5555 seeds. We show the performance of the agg. and indiv. versions of each ensemble algorithm in the top left and top middle plots respectively. Shaded areas show 95%percent9595\%95 % bootstrapped CIs.
Refer to caption
Figure 26: Comparison between Double DQN, Double DQN + MICo, Bootstrapped DQN, and Bootstrapped DQN + MICo in Atari. Results are aggregated over 55555555 games and 5555 seeds. We show the performance of the agg. and indiv. versions of each ensemble algorithm in the top left and top middle plots respectively. Shaded areas show 95%percent9595\%95 % bootstrapped CIs.
Refer to caption
Figure 27: Per-game improvements of Bootstrapped DQN + MH over Bootstrapped DQN in HNS, with and without policy aggregation.
Refer to caption
Figure 28: Per-game improvements of Bootstrapped DQN + MICo over Bootstrapped DQN in HNS, with and without policy aggregation.