iQRL – Implicitly Quantized Representations for Sample-efficient Reinforcement Learning

Aidan Scannell
Aalto University
[email protected]
&Kalle Kujanpää
Aalto University
[email protected]
&Yi Zhao
Aalto University
[email protected]
&Mohammadreza Nakhaei
Aalto University
[email protected]
&Arno Solin
Aalto University
[email protected]
&Joni Pajarinen
Aalto University
[email protected]
Abstract

Learning representations for reinforcement learning (RL) has shown much promise for continuous control. We propose an efficient representation learning method using only a self-supervised latent-state consistency loss. Our approach employs an encoder and a dynamics model to map observations to latent states and predict future latent states, respectively. We achieve high performance and prevent representation collapse by quantizing the latent representation such that the rank of the representation is empirically preserved. Our method, named iQRL: implicitly Quantized Reinforcement Learning, is straightforward, compatible with any model-free RL algorithm, and demonstrates excellent performance by outperforming other recently proposed representation learning methods in continuous control benchmarks from DeepMind Control Suite.

1 Introduction

Reinforcement learning (RL, e.g., [1]) has shown much promise for solving complex continuous control tasks. However, applying RL in real-world environments is challenging as it typically requires millions of data points which can be unpractical—i.e. RL is sample inefficient. On the other hand, representation learning has become a widely adopted solution for improving sample efficiency in deep learning. The core idea is to learn features which capture the underlying structure and patterns of the data. In the context of RL, such features can be learned independently from the downstream task. Whilst representation learning has had successes in RL, these have mainly been restricted to image-based observations (e.g., CURL [2], DrQ [3], DrQ-v2 [4], and TACO [5]).

The investigation of representation learning for state-based RL is much less common. This is likely due to the fact that learning a compact representation of an already compact state vector seems unnecessary. However, recent work by Fujimoto et al. [6], Zhao et al. [7] suggests that the difficulty of a task is due to the complexity of the underlying transition dynamics, as opposed to the size of the observation space. As such, investigating representation learning for state-based RL is a promising research direction.

Recently, TCRL [7] and SPR [8] have obtained state-of-the-art performance on continuous control benchmarks by learning representations with self-supervised losses. Self-supervised learning (SSL) approaches (which do not reconstruct observations) attempt to learn good features without labels [9]. Whilst they can learn robust representations, self-supervised losses are susceptible to a problem known as representation collapse (see Definition 3.1), where the encoder learns to map all observations to a constant latent representation [10]. As such, when leveraging SSL approaches to learn representations for RL, it is common to combine the self-supervised latent-state consistency loss with other loss terms, such as minimizing the reward prediction error in the latent space [11, 7, 12, 13, 14]. This helps to prevent representation collapse at the cost of learning a task-specific representation.

𝒛tsubscript𝒛𝑡{\bm{z}}_{\scalebox{0.8}{$t$}}\vphantom{\hat{\bm{z}}_{\scalebox{0.8}{$t${+}1}}}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT𝒛^t+1subscript^𝒛t+1\hat{\bm{z}}_{\scalebox{0.8}{$t${+}1}}over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPT𝒛^t+2subscript^𝒛t+2\hat{\bm{z}}_{\scalebox{0.8}{$t${+}2}}over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t +2 end_POSTSUBSCRIPT Encoder EMA enc. EMA enc. \pgfmathresultpt𝒛¯t+1subscript¯𝒛t+1\bar{\bm{z}}_{\scalebox{0.8}{$t${+}1}}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPT\pgfmathresultpt𝒛¯t+2subscript¯𝒛t+2\bar{\bm{z}}_{\scalebox{0.8}{$t${+}2}}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t +2 end_POSTSUBSCRIPT\pgfmathresultpt𝒂tsubscript𝒂𝑡{\bm{a}}_{\scalebox{0.8}{$t$}}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT\pgfmathresultpt𝒂t+1subscript𝒂t+1{\bm{a}}_{\scalebox{0.8}{$t${+}1}}bold_italic_a start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPTRefer to captionRefer to captionRefer to captionotsubscript𝑜𝑡o_{\scalebox{0.8}{$t$}}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTot+1subscript𝑜t+1o_{\scalebox{0.8}{$t${+}1}}italic_o start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPTot+2subscript𝑜t+2o_{\scalebox{0.8}{$t${+}2}}italic_o start_POSTSUBSCRIPT italic_t +2 end_POSTSUBSCRIPTDynamicsDynamicsFSQ
Discrete latent codes Discrete latent codes Discrete latent codes Refer to captionRefer to captionotsubscript𝑜𝑡o_{\scalebox{0.8}{$t$}}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTot+1subscript𝑜t+1o_{\scalebox{0.8}{$t${+}1}}italic_o start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPT Encoder Encoder 𝒛tsubscript𝒛𝑡{\bm{z}}_{\scalebox{0.8}{$t$}}\vphantom{\hat{\bm{z}}_{\scalebox{0.8}{$t${+}1}}}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT𝒛t+1subscript𝒛t+1{\bm{z}}_{\scalebox{0.8}{$t${+}1}}bold_italic_z start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPT𝒂tsubscript𝒂𝑡{\bm{a}}_{\scalebox{0.8}{$t$}}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPTrt+1subscript𝑟t+1r_{\scalebox{0.8}{$t${+}1}}italic_r start_POSTSUBSCRIPT italic_t +1 end_POSTSUBSCRIPT
Latent transition Actor: π(𝒛t)𝜋subscript𝒛𝑡\displaystyle\pi({\bm{z}}_{\scalebox{0.8}{$t$}})italic_π ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Critic: Q(𝒛t,𝒂t)𝑄subscript𝒛𝑡subscript𝒂𝑡\displaystyle Q({\bm{z}}_{\scalebox{0.8}{$t$}},{\bm{a}}_{\scalebox{0.8}{$t$}})italic_Q ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) Any model-free RL algorithm
Figure 1: Overview. iQRL is a stand-alone representation learning technique that is compatible with any model-free RL algorithm (we use TD3 [15]). Importantly, iQRL quantizes the latent representation with Finite Scalar Quantization (FSQ,
), using only a self-supervised latent-state consistency loss, i.e. no decoder (see Eq. 5). Making the latent representation discrete with an implicit codebook ( ) contributes to the very high sample efficiency of iQRL and empirically prevents representation collapse. Thanks to the FSQ-based quantization, iQRL does not need a reward prediction head to prevent representation collapse, a well-known issue with self-supervised learning, making the representation task-agnostic.

In this paper, we propose a simple representation learning technique which learns a task-agnostic representation using only a self-supervised loss. It is based solely on the latent-state consistency loss, i.e. a commonly used self-supervised loss for continuous RL. Importantly, our method empirically prevents representation collapse as it preserves the rank of the representation. We accomplish this by quantizing our latent representation with Finite Scalar Quantization [16], without using any reconstruction loss. As a result, our latent space is bounded and associated with an implicit codebook, whose size we can control. Our method can be combined with any model-free RL method (we use TD3, [15]). See Fig. 1 for an overview of our representation learning method. Importantly, our method (i) alleviates representation collapse, (ii) demonstrates excellent sample efficiency outperforming TCRL and TD7 on a wide range of different continuous control tasks, (iii) is simple to implement, and (iv) learns a task-agnostic representation that could be helpful in downstream tasks.

2 Related Work

In this section, we recap methods for representation learning in RL. In particular, we motivate why researchers are moving towards learning representations using self-supervised learning. Then, as our method builds upon self-supervised representation learning, which is susceptible to representation and dimensional collapse (see Definitions 3.1 and 3.2), we review contrastive self-supervised representation learning approaches; an alternative approach to preventing representation collapse.

Representation learning

Learning representations for RL has been investigated for decades [17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. However, these approaches are usually limited to simple environments. More recently, Fujimoto et al. [6] proposed TD7, an extension of TD3 which learns state and action embeddings and then performs TD3 with this representation, making it highly similar to our method, which also uses TD3 as the base algorithm. However, their method uses a self-supervised loss with no explicit mechanism to prevent representation collapse. In contrast to TD7 and motivated by representation collapse, we quantize our latent space, which we show empirically prevents representation collapse.

Observation reconstruction

A prominent idea in both model-based and model-free RL has been to learn latent representations with reconstruction objectives (e.g. VAE, [27]) [28, 29, 30, 25, 31, 32]. However, as these approaches use observation reconstruction to learn a representation their latent representation contains information about the observation which cannot be controlled by the agent and is not relevant for solving the task, which distorts the optimization landscape [33, 34]. In our experiments, we show that learning representations with observation reconstruction for model-free RL not only harms sample efficiency, but in the complex DMC Dog Run task, it can prevent the agent from solving the task.

Latent-state consistency

How can effective representations be learned efficiently without resorting to reconstruction? A common solution has been to attach an auxiliary loss to the RL objective and perform representation learning [35, 36]. A promising approach for learning suitable representations is the use of self-predictive abstractions, where the model is trained to predict future latent states through an auxiliary loss [37]. Ye et al. [38] introduce a self-supervised consistency loss on the learned latent representation. Instead of relying on a reconstruction-based loss function, Schwarzer et al. [39] propose a cosine similarity loss between predicted future latent states and the true future latent states and then perform Q-learning in the learned latent space. Our approach shares similarities with SPR [39], however, we focus on state-based observations, which leads us to quantizing our representation to prevent representation collapse, instead of using a projection head [40].

Latent-state consistency for model-based RL

Similarly to model-free RL, using the reconstruction loss for learning representation is also unreliable in model-based RL [41] and can have a detrimental effect on the performance of model-based methods in various benchmarks [42, 43]. Therefore, in the context of model-based RL, TD-MPC/TD-MPC2 [12, 44] use a consistency loss to learn representations for planning with Model Predictive Path Integral control together with reward and value functions learned through temporal difference methods [45]. Zhao et al. [7] show that the planning component of TD-MPC is not strictly necessary for high performance and applying model-free RL on top of the self-consistent representations is sufficient for performance competitive with state-of-the-art. We build on top of TCRL to show that we can combat representation collapse through latent-space quantization. As a result, we can drop the reward prediction head to learn a task-agnostic representation.

Contrastive learning

An alternative approach to preventing representation collapse in self-supervised learning is to use contrastive losses. In the context of RL, this was done by CURL [2] and TACO [5]. Whilst CURL and TACO are designed for image-based observations, their contrastive learning approaches could still hold value in state-based RL. The main idea in contrastive learning is to prevent representation collapse (see Definition 3.1) by pushing the latent vectors associated with different observations away from each other. Nevertheless, contrastive methods still experience dimensional collapse [10]. In contrast to TACO, we do not use a contrastive loss and instead leverage quantization to help prevent both representation and dimensional collapse. To offer a fair comparison between contrastive learning and our quantization scheme, we compare our method against a version of TACO tuned for state-based RL.

3 Preliminaries

In this section, we introduce and formally define representation collapse.

Representation collapse

Self-supervised learning methods learn representations by minimizing distances between two embedding vectors. As such, there is a trivial solution where the encoder outputs a constant for all inputs. Formally, this can be defined as follows:

Definition 3.1 (Complete representation collapse).

Given an encoder eθ:𝒪𝒵:subscript𝑒𝜃𝒪𝒵e_{\theta}:\mathcal{O}\rightarrow\mathcal{Z}italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_O → caligraphic_Z which maps observations 𝒐𝒪𝒐𝒪{\bm{o}}\in\mathcal{O}bold_italic_o ∈ caligraphic_O to latent states 𝒛𝒵𝒛𝒵{\bm{z}}\in\mathcal{Z}bold_italic_z ∈ caligraphic_Z, the representation is said to be completely collapsed when the latent representation is constant for all observations, i.e., eθ(𝒐)=c,𝒐𝒪formulae-sequencesubscript𝑒𝜃𝒐𝑐for-all𝒐𝒪e_{\theta}({\bm{o}})=c,\forall{\bm{o}}\in\mathcal{O}italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o ) = italic_c , ∀ bold_italic_o ∈ caligraphic_O.

In the context of SSL, **g et al. [10] investigated another type of representation collapse known as dimensional collapse, which is defined as follows:

Definition 3.2 (Dimensional collapse).

Given an encoder eθ:𝒪𝒵:subscript𝑒𝜃𝒪𝒵e_{\theta}:\mathcal{O}\rightarrow\mathcal{Z}italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_O → caligraphic_Z which maps observations 𝒐𝒪𝒐𝒪{\bm{o}}\in\mathcal{O}bold_italic_o ∈ caligraphic_O to latent states 𝒛t𝒵=dsubscript𝒛𝑡𝒵superscript𝑑{\bm{z}}_{t}\in\mathcal{Z}=\mathbb{R}^{d}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Z = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, of dimension d𝑑ditalic_d, the representation is said to be dimensionally collapsed when the latent vectors only span a lower dimensional space.

Whilst complete representation collapse is a clear issue when learning representations for RL, it is not immediately obvious if dimensional collapse is an issue because the goal of representation learning is often considered to be learning a lower-dimensional representation of the observations. In other words, there is a trade-off: We want to learn a lower-dimensional representation, but at the same time, we want to ensure that the representation contains all the information required for predicting future states and, thus, state values [36]. Our experiments show that whilst dimensional collapse is not always an issue, in some more complex environments, it can prevent agents from learning to solve a task (see Fig. 3).

Algorithm 1 iQRL
Input: Encoder eθsubscript𝑒𝜃e_{\theta}italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, dynamics dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, critics {qψ1,qψ2}subscript𝑞subscript𝜓1subscript𝑞subscript𝜓2\{q_{\psi_{1}},q_{\psi_{2}}\}{ italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, policy πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, learning rate α𝛼\alphaitalic_α, target network update rate τ𝜏\tauitalic_τ
for i𝑖iitalic_i to Nepisodessubscript𝑁episodesN_{\text{episodes}}italic_N start_POSTSUBSCRIPT episodes end_POSTSUBSCRIPT do
   𝒟𝒟{𝒐t,𝒂t,𝒐t+1,rt+1}t=0T𝒟𝒟subscriptsuperscriptsubscript𝒐𝑡subscript𝒂𝑡subscript𝒐𝑡1subscript𝑟𝑡1𝑇𝑡0\mathcal{D}\leftarrow\mathcal{D}\cup\{{\bm{o}}_{t},{\bm{a}}_{t},{\bm{o}}_{t+1}% ,r_{t+1}\}^{T}_{t=0}caligraphic_D ← caligraphic_D ∪ { bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT \triangleright Collect data in environment
   for i=1𝑖1i=1italic_i = 1 to T𝑇Titalic_T do
      [θ,ϕ][θ,ϕ]+α(rep(θ,ϕ;𝒟))𝜃italic-ϕ𝜃italic-ϕ𝛼subscriptrep𝜃italic-ϕ𝒟[\theta,\phi]\leftarrow[\theta,\phi]+\alpha\nabla\left(\mathcal{L}_{\text{rep}% }(\theta,\phi;\mathcal{D})\right)[ italic_θ , italic_ϕ ] ← [ italic_θ , italic_ϕ ] + italic_α ∇ ( caligraphic_L start_POSTSUBSCRIPT rep end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ; caligraphic_D ) ) \triangleright pdate representation, Eq. 5
      ψψ+α(q(ψ;𝒟))𝜓𝜓𝛼subscript𝑞𝜓𝒟\psi\leftarrow\psi+\alpha\nabla\left(\mathcal{L}_{q}(\psi;\mathcal{D})\right)italic_ψ ← italic_ψ + italic_α ∇ ( caligraphic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_ψ ; caligraphic_D ) ) \triangleright Update critic, Eq. 7
      if i𝑖iitalic_i % 2 == 0 then
         ηη+α(π(η;𝒟))𝜂𝜂𝛼subscript𝜋𝜂𝒟\eta\leftarrow\eta+\alpha\nabla\left(\mathcal{L}_{\pi}(\eta;\mathcal{D})\right)italic_η ← italic_η + italic_α ∇ ( caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_η ; caligraphic_D ) ) \triangleright Update actor less frequently than critic, Eq. 8
      end if
      [θ¯,ψ¯,η¯](1τ)[θ¯,ψ¯,η¯]+τ[θ,ψ,η]¯𝜃¯𝜓¯𝜂1𝜏¯𝜃¯𝜓¯𝜂𝜏𝜃𝜓𝜂[\bar{\theta},\bar{\psi},\bar{\eta}]\leftarrow(1-\tau)[\bar{\theta},\bar{\psi}% ,\bar{\eta}]+\tau[{\theta},{\psi},\eta][ over¯ start_ARG italic_θ end_ARG , over¯ start_ARG italic_ψ end_ARG , over¯ start_ARG italic_η end_ARG ] ← ( 1 - italic_τ ) [ over¯ start_ARG italic_θ end_ARG , over¯ start_ARG italic_ψ end_ARG , over¯ start_ARG italic_η end_ARG ] + italic_τ [ italic_θ , italic_ψ , italic_η ] \triangleright Update target networks
   end for
end for

4 Method

In this section, we detail our method, named implicitly Quantized Reinforcement Learning (iQRL). iQRL is conceptually simple, it (i) learns a representation of the observation space and then, (ii) performs model-free RL (e.g., TD3) on this representation. See Fig. 1 and Algorithm 1.

We consider Markov Decision Processes (MDPs, [46]) =(𝒪,𝒜,𝒫,,γ)𝒪𝒜𝒫𝛾\mathcal{M}=(\mathcal{O},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)caligraphic_M = ( caligraphic_O , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ), where an agent receives an observation 𝒐t𝒪subscript𝒐𝑡𝒪{\bm{o}}_{t}\in\mathcal{O}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_O at time step t𝑡titalic_t, performs an action 𝒂t𝒜subscript𝒂𝑡𝒜{\bm{a}}_{t}\in\mathcal{A}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A, and then obtains the next observation 𝒐t+1=𝒫(𝒐t,𝒂t){\bm{o}}_{t+1}=\mathcal{P}(\cdot\mid{\bm{o}}_{t},{\bm{a}}_{t})bold_italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = caligraphic_P ( ⋅ ∣ bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and reward rt=(𝒐t,𝒂t)subscript𝑟𝑡subscript𝒐𝑡subscript𝒂𝑡r_{t}=\mathcal{R}({\bm{o}}_{t},{\bm{a}}_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_R ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The discount factor is denoted γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ).

Method components

iQRL has four main components which we wish to learn:

Encoder: 𝒛tsubscript𝒛𝑡\displaystyle{\bm{z}}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =f(eθ(𝒐t))absent𝑓subscript𝑒𝜃subscript𝒐𝑡\displaystyle=f(e_{\theta}({\bm{o}}_{t}))= italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (1)
Dynamics: 𝒛^t+1subscript^𝒛𝑡1\displaystyle\hat{{\bm{z}}}_{t+1}over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =f(𝒛t+dϕ(𝒛t,𝒂t))absent𝑓subscript𝒛𝑡subscript𝑑italic-ϕsubscript𝒛𝑡subscript𝒂𝑡\displaystyle={f}({{\bm{z}}}_{t}+d_{\phi}({\bm{z}}_{t},{\bm{a}}_{t}))= italic_f ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (2)
Value: 𝒒tsubscript𝒒𝑡\displaystyle{\bm{q}}_{t}bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝐪ψ(𝒛t,𝒂t)absentsubscript𝐪𝜓subscript𝒛𝑡subscript𝒂𝑡\displaystyle=\mathbf{q}_{\psi}({\bm{z}}_{t},{\bm{a}}_{t})= bold_q start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3)
Policy: 𝒂tsubscript𝒂𝑡\displaystyle{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πη(𝒛t)similar-toabsentsubscript𝜋𝜂subscript𝒛𝑡\displaystyle\sim\pi_{\eta}({\bm{z}}_{t})∼ italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (4)

The encoder eθsubscript𝑒𝜃e_{\theta}italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and latent-space dynamics model dϕsubscript𝑑italic-ϕd_{\phi}italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT are responsible for representation learning. f()𝑓f(\cdot)italic_f ( ⋅ ) denotes our quantization scheme, which implicitly quantizes our latent representation (more details to follow). The encoder (with quantization) feθ()𝑓subscript𝑒𝜃f\circ e_{\theta}(\cdot)italic_f ∘ italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) maps observations 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to latent states 𝒛tsubscript𝒛𝑡{\bm{z}}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and is responsible for learning a representation which can aid RL. The latent-space dynamics model (with quantization) dϕ()subscript𝑑italic-ϕd_{\phi}(\cdot)italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) predicts the next latent states 𝒛^t+1subscript^𝒛𝑡1\hat{\bm{z}}_{t+1}over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given a latent state 𝒛tsubscript𝒛𝑡{\bm{z}}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and an action 𝒂tsubscript𝒂𝑡{\bm{a}}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Its sole purpose is to aid representation learning by making the latent states temporally consistent. Note that we do not use it for model-based RL. Once we have the representation learned by our encoder, we map all observations to the latent space and perform model-free RL in this latent space. Throughout this paper, we use Twin Delayed Deep Deterministic Policy Gradient (TD3, [15]) as the base algorithm. It consists of two state-action value functions {qψ1,qψ2}subscript𝑞subscript𝜓1subscript𝑞subscript𝜓2\{q_{\psi_{1}},q_{\psi_{2}}\}{ italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, known as critics, and a deterministic actor πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT. Following prior works [4], we use a linear exploration noise schedule which decays from 1111 to 0.10.10.10.1 during training.

Representation learning

Our representation learning uses the latent-state consistency loss,

rep(θ,ϕ;τ)subscriptrep𝜃italic-ϕ𝜏\displaystyle\mathcal{L}_{\text{rep}}(\theta,\phi;\tau)caligraphic_L start_POSTSUBSCRIPT rep end_POSTSUBSCRIPT ( italic_θ , italic_ϕ ; italic_τ ) =h=0H1γreph(f(𝒛^t+h+dϕ(𝒛^t+h,𝒂t+h))f(𝒛^t+h+dϕ(𝒛^t+h,𝒂t+h))2)(f(eθ¯(𝒐t+h+1))f(eθ¯(𝒐t+h+1))2),absentsuperscriptsubscript0𝐻1superscriptsubscript𝛾repsuperscript𝑓subscript^𝒛𝑡subscript𝑑italic-ϕsubscript^𝒛𝑡subscript𝒂𝑡subscriptnorm𝑓subscript^𝒛𝑡subscript𝑑italic-ϕsubscript^𝒛𝑡subscript𝒂𝑡2top𝑓subscript𝑒¯𝜃subscript𝒐𝑡1subscriptnorm𝑓subscript𝑒¯𝜃subscript𝒐𝑡12\displaystyle=\sum_{h=0}^{H-1}\gamma_{\text{rep}}^{h}\left(\frac{{f}(\hat{\bm{% z}}_{t+h}+d_{\phi}(\hat{{\bm{z}}}_{t+h},{\bm{a}}_{t+h}))}{\|{f}(\hat{\bm{z}}_{% t+h}+d_{\phi}(\hat{{\bm{z}}}_{t+h},{\bm{a}}_{t+h}))\|_{2}}\right)^{\top}\left(% \frac{{f}(e_{\bar{\theta}}({\bm{o}}_{t+h+1}))}{\|{f}(e_{\bar{\theta}}({\bm{o}}% _{t+h+1}))\|_{2}}\right),= ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT rep end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( divide start_ARG italic_f ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∥ italic_f ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( divide start_ARG italic_f ( italic_e start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t + italic_h + 1 end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∥ italic_f ( italic_e start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t + italic_h + 1 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) , (5)

which minimizes the cosine similarity between the next state predicted by the dynamics model 𝒛^t+1=f(𝒛^t+dϕ(𝒛^t,𝒂t))subscript^𝒛𝑡1𝑓subscript^𝒛𝑡subscript𝑑italic-ϕsubscript^𝒛𝑡subscript𝒂𝑡\hat{{\bm{z}}}_{t+1}=f(\hat{\bm{z}}_{t}+d_{\phi}(\hat{{\bm{z}}}_{t},{\bm{a}}_{% t}))over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and the next state predicted by the momentum encoder 𝒛¯t+1=f(eθ¯(𝒐t+1))subscript¯𝒛𝑡1𝑓subscript𝑒¯𝜃subscript𝒐𝑡1\bar{{\bm{z}}}_{t+1}=f(e_{\bar{\theta}}({\bm{o}}_{t+1}))over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_e start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ). The latent states are obtained with multi-step predictions in the latent space 𝒛^t+1=f(𝒛^t+dϕ(𝒛^t,𝒂t))subscript^𝒛𝑡1𝑓subscript^𝒛𝑡subscript𝑑italic-ϕsubscript^𝒛𝑡subscript𝒂𝑡\hat{{\bm{z}}}_{t+1}=f(\hat{\bm{z}}_{t}+d_{\phi}(\hat{{\bm{z}}}_{t},{\bm{a}}_{% t}))over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). The initial map** to the latent space 𝒛^0=f(eθ(𝒐0))subscript^𝒛0𝑓subscript𝑒𝜃subscript𝒐0\hat{{\bm{z}}}_{0}=f(e_{\theta}({\bm{o}}_{0}))over^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) uses the online encoder which is being trained jointly with the dynamics model dϕ(𝒛t^,𝒂t)subscript𝑑italic-ϕ^subscript𝒛𝑡subscript𝒂𝑡d_{\phi}(\hat{{\bm{z}}_{t}},{\bm{a}}_{t})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The target eθ¯(𝒐t+1)subscript𝑒¯𝜃subscript𝒐𝑡1e_{\bar{\theta}}({\bm{o}}_{t+1})italic_e start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is calculated with the momentum encoder which uses an exponential moving average (EMA) of the encoder’s weights θ¯(1τ)θ¯+τθ¯𝜃1𝜏¯𝜃𝜏𝜃\bar{\theta}\leftarrow(1-\tau)\bar{\theta}+\tau\thetaover¯ start_ARG italic_θ end_ARG ← ( 1 - italic_τ ) over¯ start_ARG italic_θ end_ARG + italic_τ italic_θ. The target network update rate is denoted τ𝜏\tauitalic_τ. Note that we do not use reward or value prediction for learning our representation and as a result, our representation is task-agnostic.

Quantization

Motivated by preventing dimensional collapse we quantize our latent space following the approach from Finite Scalar Quantization (FSQ, [16]). Their important observation is that carefully bounding each dimension gives rise to an implicit codebook 𝒞𝒞\mathcal{C}caligraphic_C of a chosen size |𝒞|𝒞|\mathcal{C}|| caligraphic_C |. Having requested a d-dimensional𝑑-dimensionald\text{-dimensional}italic_d -dimensional latent space, iQRL configures the encoder to output c𝑐citalic_c channels per dimension such that the representation from the encoder 𝒙=eθ(𝒐)d×c𝒙subscript𝑒𝜃𝒐superscript𝑑𝑐{\bm{x}}=e_{\theta}({\bm{o}})\in\mathbb{R}^{d\times c}bold_italic_x = italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT and the dynamics model 𝒙^=𝒛+dϕ(𝒛,𝒂)d×c^𝒙𝒛subscript𝑑italic-ϕ𝒛𝒂superscript𝑑𝑐\hat{{\bm{x}}}={{\bm{z}}}+d_{\phi}({\bm{z}},{\bm{a}})\in\mathbb{R}^{d\times c}over^ start_ARG bold_italic_x end_ARG = bold_italic_z + italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT are in d×csuperscript𝑑𝑐\mathbb{R}^{d\times c}blackboard_R start_POSTSUPERSCRIPT italic_d × italic_c end_POSTSUPERSCRIPT. To quantize 𝒙𝒙{\bm{x}}bold_italic_x (and 𝒙^^𝒙\hat{{\bm{x}}}over^ start_ARG bold_italic_x end_ARG) into a finite set of codewords, we first apply a bounding function f()𝑓f(\cdot)italic_f ( ⋅ ) and then we round to integers. Let us consider a single dimension j𝑗jitalic_j of the encoder’s output 𝒗=[𝒙]j,:c𝒗subscriptdelimited-[]𝒙𝑗:superscript𝑐{\bm{v}}=[{\bm{x}}]_{j,:}\in\mathbb{R}^{c}bold_italic_v = [ bold_italic_x ] start_POSTSUBSCRIPT italic_j , : end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT which consists of c-channels𝑐-channelsc\text{-channels}italic_c -channels, and demonstrate how it is quantized. We follow FSQ and choose f()𝑓f(\cdot)italic_f ( ⋅ ) such that each entry in 𝒗~=round(f(𝒗))~𝒗round𝑓𝒗\tilde{{\bm{v}}}=\mathrm{round}(f({\bm{v}}))over~ start_ARG bold_italic_v end_ARG = roman_round ( italic_f ( bold_italic_v ) ) takes one of Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT unique values,

f:𝒗Li/2tanh(𝒗),:𝑓𝒗subscript𝐿𝑖2tanh𝒗\displaystyle f:{\bm{v}}\rightarrow\lfloor L_{i}/2\rfloor\text{tanh}({\bm{v}}),italic_f : bold_italic_v → ⌊ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 ⌋ tanh ( bold_italic_v ) , (6)

where Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a hyperparameter for channel i𝑖iitalic_i, specified as FSQ levels ={L1,,Lc}subscript𝐿1subscript𝐿𝑐\mathcal{L}=\{L_{1},\ldots,L_{c}\}caligraphic_L = { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }. This gives an entry in our codebook 𝒗~𝒞~𝒗𝒞\tilde{\bm{v}}\in\mathcal{C}over~ start_ARG bold_italic_v end_ARG ∈ caligraphic_C, where the implied codebook is given by the product of these per-channel codebook sets. The vectors in 𝒞𝒞\mathcal{C}caligraphic_C can be enumerated giving a bijection from any 𝒗~~𝒗\tilde{{\bm{v}}}over~ start_ARG bold_italic_v end_ARG to an integer in {1,2,,Lc}12superscript𝐿𝑐\{1,2,\ldots,L^{c}\}{ 1 , 2 , … , italic_L start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT }. As an example, in some of our experiments, we used d=512𝑑512d=512italic_d = 512 latent dimensions each with c=2𝑐2c=2italic_c = 2 channels consisting of 8 levels, i.e. we used FSQ levels ={L1=8,L2=8}formulae-sequencesubscript𝐿18subscript𝐿28\mathcal{L}=\{L_{1}=8,L_{2}=8\}caligraphic_L = { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8 , italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 8 }. This corresponds to a codebook of size |𝒞|=i=1cLi=8×8=64=26𝒞superscriptsubscriptproduct𝑖1𝑐subscript𝐿𝑖8864superscript26|\mathcal{C}|=\prod_{i=1}^{c}L_{i}=8\times 8=64=2^{6}| caligraphic_C | = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 8 × 8 = 64 = 2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT for each dimension.

Note that this quantization requires a round operation. As such, to propagate gradients through the round operation we use straight-through gradient estimation (STE). This is easily accomplished in deep learning libraries using stop gradient sgsg\mathrm{sg}roman_sg as round_ste(x):xx+sg(round(x)x):round_ste𝑥𝑥𝑥sground𝑥𝑥\mathrm{round\_ste}(x):x\rightarrow x+\mathrm{sg}(\mathrm{round}(x)-x)roman_round _ roman_ste ( italic_x ) : italic_x → italic_x + roman_sg ( roman_round ( italic_x ) - italic_x ). FSQ has the following hyperparameters: we must specify the number of channels c𝑐citalic_c and the number of levels per channel ={L1,,Lc}subscript𝐿1subscript𝐿𝑐\mathcal{L}=\{L_{1},\ldots,L_{c}\}caligraphic_L = { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }. Table 1 shows the recommended number of channels and number of levels per channel to obtain codebooks of different sizes [16].

Table 1: FSQ levels \mathcal{L}caligraphic_L to approximate different codebook sizes |𝒞|𝒞|\mathcal{C}|| caligraphic_C |.
Target size |𝒞|𝒞|\mathcal{C}|| caligraphic_C | 24superscript242^{4}2 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 26superscript262^{6}2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 28superscript282^{8}2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT 29superscript292^{9}2 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT 210superscript2102^{10}2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT
Proposed \mathcal{L}caligraphic_L {5,3}53\{5,3\}{ 5 , 3 } {8,8}88\{8,8\}{ 8 , 8 } {8,6,5}865\{8,6,5\}{ 8 , 6 , 5 } {8,8,8}888\{8,8,8\}{ 8 , 8 , 8 } {8,5,5,5}8555\{8,5,5,5\}{ 8 , 5 , 5 , 5 }

In practice, we found codebooks of size |𝒞|=26𝒞superscript26|\mathcal{C}|=2^{6}| caligraphic_C | = 2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT sufficient for all environments in the DeepMind Control suite. However, for more complex environments we hypothesize that larger codebooks will be required.


Refer to caption

Figure 2: DeepMind Control Suite results. iQRL (red) is significantly more sample efficient than other model-free baselines TCRL (green), TD7 (purple), TACO (blue) and TD3 (orange). iQRL performs particularly well in the high-dimensional locomotion tasks and outperforms TCRL, which is the most similar baseline. Results are for 20 DMC tasks with UTD=1. We plot the mean (solid line) and the 95%percent9595\%95 % confidence intervals (shaded) across 5 random seeds, where each seed averages over 10 evaluation episodes. See Fig. 11 for results in other DMC tasks.

Model-free reinforcement learning

We learn the policy (actor) and action-value function (critic) using TD3 [15]. However, we follow Yarats et al. [4], Zhao et al. [7] and augment the loss with n-step𝑛-stepn\text{-step}italic_n -step returns. The only difference to TD3 is that instead of using the original observations 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we map them through the online encoder 𝒛t=f(eθ(𝒐t))subscript𝒛𝑡𝑓subscript𝑒𝜃subscript𝒐𝑡{\bm{z}}_{t}=f(e_{{\theta}}({\bm{o}}_{t}))bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and learn the actor/critic in the quantized latent space 𝒛tsubscript𝒛𝑡{\bm{z}}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The critic is then updated by minimizing the following objective:

q(ψ;τ)subscript𝑞𝜓𝜏\displaystyle\mathcal{L}_{q}(\psi;\tau)caligraphic_L start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_ψ ; italic_τ ) =𝔼τ𝒟[k=12(qψk(f(eθ(𝒐t)),𝒂t)y)2],k1,2formulae-sequenceabsentsubscript𝔼similar-to𝜏𝒟delimited-[]superscriptsubscript𝑘12superscriptsubscript𝑞subscript𝜓𝑘𝑓subscript𝑒𝜃subscript𝒐𝑡subscript𝒂𝑡𝑦2for-all𝑘12\displaystyle=\mathbb{E}_{\tau\sim\mathcal{D}}\left[\textstyle\sum_{k=1}^{2}(q% _{\psi_{k}}(f(e_{{\theta}}({\bm{o}}_{t})),{\bm{a}}_{t})-y)^{2}\right],\quad% \forall k\in 1,2= blackboard_E start_POSTSUBSCRIPT italic_τ ∼ caligraphic_D end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , ∀ italic_k ∈ 1 , 2 (7)
y𝑦\displaystyle yitalic_y =n=0N1rt+n+γnmink{1,2}qψ¯k(eθ(𝒐t+n+1),𝒂t+n+1),with𝒂t+n=πη¯(𝒛t+n)+ϵt+n,formulae-sequenceabsentsuperscriptsubscript𝑛0𝑁1subscript𝑟𝑡𝑛superscript𝛾𝑛subscript𝑘12subscript𝑞subscript¯𝜓𝑘subscript𝑒𝜃subscript𝒐𝑡𝑛1subscript𝒂𝑡𝑛1withsubscript𝒂𝑡𝑛subscript𝜋¯𝜂subscript𝒛𝑡𝑛subscriptitalic-ϵ𝑡𝑛\displaystyle=\sum_{n=0}^{N-1}r_{t+n}+\gamma^{n}\min_{k\in\{1,2\}}q_{\bar{\psi% }_{k}}(e_{{\theta}}({\bm{o}}_{t+n+1}),{\bm{a}}_{t+n+1}),\quad\text{with}\ {\bm% {a}}_{t+n}=\pi_{\bar{\eta}}({\bm{z}}_{t+n})+\epsilon_{t+n},= ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_k ∈ { 1 , 2 } end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT over¯ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT ) , bold_italic_a start_POSTSUBSCRIPT italic_t + italic_n + 1 end_POSTSUBSCRIPT ) , with bold_italic_a start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ,

where we use policy smoothing by adding clipped Gaussian noise ϵt+nclip(𝒩(0,σ2),c,c)similar-tosubscriptitalic-ϵ𝑡𝑛clip𝒩0superscript𝜎2𝑐𝑐\epsilon_{t+n}\sim\text{clip}\left(\mathcal{N}(0,\sigma^{2}),-c,c\right)italic_ϵ start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ∼ clip ( caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , - italic_c , italic_c ) to the action 𝒂t+n=πη¯(𝒛t+n)+ϵt+nsubscript𝒂𝑡𝑛subscript𝜋¯𝜂subscript𝒛𝑡𝑛subscriptitalic-ϵ𝑡𝑛{\bm{a}}_{t+n}=\pi_{\bar{\eta}}({\bm{z}}_{t+n})+\epsilon_{t+n}bold_italic_a start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT. Note that we use the online encoder to get the latent states in both the prediction and the target. We then use the target action-value functions 𝐪ψ¯subscript𝐪¯𝜓\mathbf{q}_{\bar{\psi}}bold_q start_POSTSUBSCRIPT over¯ start_ARG italic_ψ end_ARG end_POSTSUBSCRIPT and the target policy πη¯subscript𝜋¯𝜂\pi_{\bar{\eta}}italic_π start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT to calculate the TD target. Following TD3, we learn the actor’s parameters by minimizing

π(η;τ)=𝔼𝒐t𝒟[mink{1,2}qψk(f(eθ(𝒐t))𝒛t,πη(f(eθ(𝒐t))))].subscript𝜋𝜂𝜏subscript𝔼similar-tosubscript𝒐𝑡𝒟delimited-[]subscript𝑘12subscript𝑞subscript𝜓𝑘subscript𝑓subscript𝑒𝜃subscript𝒐𝑡subscript𝒛𝑡subscript𝜋𝜂𝑓subscript𝑒𝜃subscript𝒐𝑡\displaystyle\mathcal{L}_{\pi}(\eta;\tau)=-\mathbb{E}_{{\bm{o}}_{t}\sim% \mathcal{D}}\bigg{[}\min_{k\in\{1,2\}}q_{\psi_{k}}(\underbrace{f(e_{{\theta}}(% {\bm{o}}_{t}))}_{{\bm{z}}_{t}},\pi_{\eta}(f(e_{{\theta}}({\bm{o}}_{t}))))\bigg% {]}.caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_η ; italic_τ ) = - blackboard_E start_POSTSUBSCRIPT bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_min start_POSTSUBSCRIPT italic_k ∈ { 1 , 2 } end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( under⏟ start_ARG italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_f ( italic_e start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) ] . (8)

That is, we maximize the Q-value using the clipped double Q-learning trick to combat overestimation in Q-learning. Note that we do not use the momentum encoder in the actor/critic objectives. In our experiments, using the momentum encoder resulted in worse performance.

Whilst our method shares similarities with TCRL [7], it is important to note that our transition model does not predict the reward. Instead, iQRL leverages quantization to help alleviate representation collapse, and, as a result, learns a task-agnostic representation.


Refer to caption

Figure 3: Ablation of quantization. We show how our quantization scheme prevents dimensional collapse. In all tasks, our FSQ scheme prevents dimensional collapse (red) as the rank of the representation remains high. In contrast, when our quantization is not used (blue) the representation undergoes dimensional collapse, indicated by the rank reducing. In the Dog Run task, this results in the agent not learning to solve the task. We plot the mean (solid line) and the 95%percent9595\%95 % confidence intervals (shaded) across 5 random seeds, where each seed averages over 10 evaluation episodes.

5 Experiments

In this section, we evaluate iQRL in a variety of continuous control tasks from the DeepMind Control (DMC) Suite [47]. We aim to answer the following questions:

  1. 1.

    How does iQRL compare to state-of-the-art model-free RL algorithms, especially in the hard DMC tasks?

  2. 2.

    Does our FSQ-based quantization help combat representation and dimensional collapse?

  3. 3.

    Is learning a representation with only latent-state consistency really better than including reward predictions?

  4. 4.

    What impact does reconstruction loss have on the performance of iQRL?

iQRL is simple, fast, and performant

We compare iQRL to the model-free baseline Twin Delayed DDPG (TD3, [15]), and the representation learning-based RL methods Temporal Consistency Reinforcement Learning (TCRL, [7]), TD7 ([6]), and Temporal Action-driven Contrastive Learning (TACO, [5]). In Fig. 2, we evaluate sample efficiency by plotting the average performance of the algorithms across 20 DeepMind Control Suite tasks as a function of environment steps. We see that, on average, iQRL outperforms the baselines and shows significant advantages in many environments. We outperform TCRL, which is the most similar baseline to our work. Furthermore, TD3 is noncompetitive with iQRL, highlighting the importance of representation learning in state-based reinforcement learning. For complete results on all 20 tasks, see Fig. 11 in Appendix G. For more details of the tasks on which the algorithms were evaluated, see Appendix C. For more details of the baselines used in our work and how we implemented them, see Appendix B.

High-dimensional control

Many tasks in DeepMind Control Suite are particularly high-dimensional. For instance, the observation space of the Dog tasks is 𝒪223𝒪superscript223\mathcal{O}\in\mathbb{R}^{223}caligraphic_O ∈ blackboard_R start_POSTSUPERSCRIPT 223 end_POSTSUPERSCRIPT and the action space is 𝒜38𝒜superscript38\mathcal{A}\in\mathbb{R}^{38}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT 38 end_POSTSUPERSCRIPT, and for Humanoid, the observation space is 𝒪67𝒪superscript67\mathcal{O}\in\mathbb{R}^{67}caligraphic_O ∈ blackboard_R start_POSTSUPERSCRIPT 67 end_POSTSUPERSCRIPT and the action space 𝒜24𝒜superscript24\mathcal{A}\in\mathbb{R}^{24}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPT. Fig. 2 and Fig. 11 show that iQRL excels in the high dimensional Dog and Humanoid environments when compared to the baselines. We hypothesize that our discretized representations are particularly beneficial for simplifying learning the transition dynamics in high-dimensional spaces, making iQRL highly sample efficient in these tasks.

iQRL does not suffer from rank collapse

We examine the behaviour of adding quantization to our MLP encoder during training. Following Ni et al. [36], we estimate the rank of the linear operator associated with the MLP encoder by calculating the matrix rank111Rank of an m×n𝑚𝑛m\times nitalic_m × italic_n matrix 𝑨𝑨{\bm{A}}bold_italic_A is the dimension of the image of the map** g:nm:𝑔superscript𝑛superscript𝑚g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, with g(𝒙)=𝑨𝒙𝑔𝒙𝑨𝒙g({\bm{x}})={\bm{A}}{\bm{x}}italic_g ( bold_italic_x ) = bold_italic_A bold_italic_x of the latent states for a batch of inputs. We ensure full rank at the start of training by orthogonally initializing the MLP encoders. Fig. 3 shows the orthogonality-preserving effect of our quantization scheme as the matrix rank stays close to the maximum. Without quantization, a dimensional collapse occurs, which can have significant harmful effects as the representational power of the latent state diminishes [10]. Correspondingly, in three of the four environments, removing the quantization has a deteriorating impact on the sample efficiency of iQRL, and in Dog Run, the algorithm completely fails to learn to solve the task without the quantization.


Refer to caption

Figure 4: Reward prediction is not necessary for representation learning. We compare iQRL to a variant of our method with a reward prediction head trained to predict the reward from the current latent state. Adding a reward prediction head to iQRL leads into a slight increase in performance in Dog Run, but has a slightly harmful impact on sample efficiency in Humanoid Walk and Quadruped Run. We plot the mean (solid line) and the 95%percent9595\%95 % confidence intervals (shaded) across 5 random seeds, where each seed averages over 10 evaluation episodes.

Reward prediction is not necessary for representation learning

Unlike prior methods such as TCRL [7], TD-MPC [12], Dreamer-V2 [48] and TACO [49], our representation learning loss (Eq. 5) does not include a term for learning to predict the reward or the value of the latent state. Instead, we rely solely on the self-predictive temporal consistency loss. To analyze the impact of not including the reward prediction term, we compare our method to a variant of our method, where we have included a reward prediction head similar to that of Zhao et al. [7]. Formally, we define a reward head as r^t=gξ(𝒛t,𝒂t)subscript^𝑟𝑡subscript𝑔𝜉subscript𝒛𝑡subscript𝒂𝑡\hat{r}_{t}=g_{\xi}({\bm{z}}_{t},{\bm{a}}_{t})over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (see also Eq. 2), and include a reward prediction term (discounted MSE loss) in the representation loss:

rew=h=0H1γrephr^t+hrt+h22,subscriptrewsuperscriptsubscript0𝐻1superscriptsubscript𝛾repsuperscriptsubscriptnormsubscript^𝑟𝑡subscript𝑟𝑡22\displaystyle\mathcal{L}_{\text{rew}}=\sum_{h=0}^{H-1}\gamma_{\text{rep}}^{h}% \|\hat{r}_{t+h}-r_{t+h}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT rew end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT rep end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where rt+hsubscript𝑟𝑡r_{t+h}italic_r start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT is the ground-truth H-step reward and r^t+hsubscript^𝑟𝑡\hat{r}_{t+h}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + italic_h end_POSTSUBSCRIPT is the predicted H-step reward.

The results for this ablation study are shown in Fig. 4. The plots show that our method, iQRL, without a reward prediction term in the loss, has equal or superior performance to the variant with a reward prediction term except in Dog Run. Our results imply that learning to predict the reward is not necessary for learning a suitable latent representation. The upside of not including the reward prediction head is that it makes the representation task-agnostic, which we believe to be important for downstream applications such as speeding up learning in an incremental multi-task setting in the same domain.

We also evaluated whether including a reward head alone without our FSQ-based normalization scheme is sufficient for preserving the rank of the latent representation and found that iQRL with a reward prediction head but without FSQ suffers from poor performance and dimensional collapse. Therefore, the reward prediction head is not a substitute for our quantization. For more details of this experiment, see Appendix F.

Reconstruction loss has a detrimental impact

Learning to minimize the observation reconstruction error has been widely applied in model-based RL [1, 50, 51], and an observation decoder has been a component of many of the most successful RL algorithms to date [52]. However, recent work in representation learning for RL [7] and model-based RL [12] has shown that incorporating a reconstruction term into the representation loss can hurt the performance, as learning to reconstruct the observations is inefficient due to the observations containing irrelevant details and visuals like shading that are uncontrollable by the agent and do not affect the tasks.

To provide a thorough analysis of iQRL, we include results where we add a reconstruction term to our representation loss in Eq. 5:

𝒐=𝔼𝒐t𝒟[𝒐^t𝒐t22],𝒐^t=hκ(𝒛t),formulae-sequencesubscript𝒐subscript𝔼similar-tosubscript𝒐𝑡𝒟delimited-[]superscriptsubscriptnormsubscript^𝒐𝑡subscript𝒐𝑡22subscript^𝒐𝑡subscript𝜅subscript𝒛𝑡\displaystyle\mathcal{L}_{{\bm{o}}}=\mathbb{E}_{{\bm{o}}_{t}\sim\mathcal{D}}[% \|\hat{{\bm{o}}}_{t}-{\bm{o}}_{t}\|_{2}^{2}],\quad\hat{{\bm{o}}}_{t}=h_{\kappa% }({\bm{z}}_{t}),caligraphic_L start_POSTSUBSCRIPT bold_italic_o end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , over^ start_ARG bold_italic_o end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (9)

where hκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT is a learned observation decoder that takes the latent state as the input and outputs the reconstructed observation. The decoder hκsubscript𝜅h_{\kappa}italic_h start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT is a standard MLP. We perform reconstruction at each time step in the horizon. The results in Fig. 5 show that in no environments does reconstruction aid learning, and in some tasks, such as the difficult Dog Run and Humanoid Walk tasks, including the reconstruction term has a significant detrimental effect on the performance, and can even prevent learning completely. Our results support the observations of Zhao et al. [7] and Hansen et al. [12] about the lack of need for a reconstruction target in continuous control tasks.


Refer to caption

Figure 5: Reconstruction loss has detrimental impact. Unlike many methods, such as SAC-AE [53], iQRL neither has an observation decoder nor a reconstruction term in the loss function. We show that adding a reconstruction loss harms the performance of iQRL across a mixture of easy and hard evaluation environments. We plot the mean (solid line) and the 95%percent9595\%95 % confidence intervals (shaded) across 5 random seeds, where each seed averages over 10 evaluation episodes.

Projection head

Wen and Li [40] and Schwarzer et al. [8] investigated the role of a learnable projection head in non-contrastive self-supervised learning and found that it helps RL algorithms learn more diversified and therefore, superior representations. Whilst iQRL shares similarities with SPR [8], in particular, a temporal consistency loss using cosine similarity, it differs in that it does not use a learnable projection head and quantizes the representation instead. In Fig. 10, we show the impact of adding a projection head to iQRL. It shows that the projection head decreases the sample efficiency of iQRL. Whilst projection heads are effective for learning representations from images, our results suggest that they have a significant negative impact on sample efficiency when learning representations of state-based observations, reaffirming that state-based RL has a different set of challenges to image-based RL and techniques designed to combat representation collapse are not always transferable between the settings.

Stop gradient

[36] proved that using stop gradients should suffice for preventing representation collapse. However, their experiments suggested that using an EMA encoder improves performance over simply using stop gradients. In Fig. 9, we show how replacing iQRL’s EMA encoder with a stop gradient operation can have a negative impact on performace. For example, using stop gradient in the Acrobot Swingup task results in the agent struggling to solve the task.

Codebook size

In Appendix D we evaluate how the size of the codebook |𝒞|𝒞|\mathcal{C}|| caligraphic_C | influences the performance of the agent. It shows that size of the codebook and the activeness is intuitive: the smaller the codebook the larger the active proportion. The best codebook size varies between environments but the rank of the representation appears to be preserved for all codebook sizes.

Latent dimension

As a final experiment, we evaluate how the dimension of the latent space d𝑑ditalic_d impacts iQRL’s performance. We find that iQRL is fairly robust to different latent dimensions. We find that a latent dimension of d=1024𝑑1024d=1024italic_d = 1024 with FSQ levels =[8,8]88\mathcal{L}=[8,8]caligraphic_L = [ 8 , 8 ], which corresponds to a codebook size |𝒞|=26𝒞superscript26|\mathcal{C}|=2^{6}| caligraphic_C | = 2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, performs best in the harder DMC tasks. See Appendix E for more details.

6 Conclusion

We have presented iQRL, a technique for learning representations using only a self-supervised temporal consistency loss, which demonstrates strong performance in continuous control tasks, including the complex DMC Humanoid and Dog tasks. Our quantization of the latent space empirically preserves the representation’s matrix rank, indicating that it alleviates representation and dimensional collapse. Our experiments further demonstrate that iQRL is extremely sample efficient whilst being fast to train, which we believe is a strong selling point. Importantly, our method is (i) straightforward, (ii) compatible with any model-free RL algorithm, and (iii) learns a task-agnostic representation.

Limitations and future work

Given that iQRL learns a task-agnostic representation, exploring its use for multi-task RL is an exciting direction for future work. Can iQRL learn a single representation which is shared across a wide variety of tasks? In this paper, we have only evaluated iQRL in deterministic environments so extending iQRL to stochastic environments is another important direction for future work.

Acknowledgments and Disclosure of Funding

AJS and KK were supported by the Research Council of Finland from the Flagship program: Finnish Center for Artificial Intelligence (FCAI). YZ is funded by Research Council of Finland (grant id 345521) and MN is funded by Business Finland (BIOND4.0 - Data Driven Control for Bioprocesses). AHS acknowledges funding from the Research Council of Finland (grant id 339730) and JP acknowledges funding from Research Council of Finland (grant ids 345521 and 353198). We acknowledge CSC – IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through CSC. We acknowledge the computational resources provided by the Aalto Science-IT project.

References

  • Sutton and Barto [2018] R.S. Sutton and A.G. Barto. Reinforcement Learning, Second Edition: An Introduction. Adaptive Computation and Machine Learning Series. MIT Press, 2018.
  • Laskin et al. [2020] Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, pages 5639–5650. PMLR, November 2020.
  • Yarats et al. [2020] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. In International Conference on Learning Representations, October 2020.
  • Yarats et al. [2021a] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning. In International Conference on Learning Representations, October 2021a.
  • Zheng et al. [2023] Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 36, pages 48203–48225, December 2023.
  • Fujimoto et al. [2023] Scott Fujimoto, Wei-Di Chang, Edward Smith, Shixiang (Shane) Gu, Doina Precup, and David Meger. For SALE: State-Action Representation Learning for Deep Reinforcement Learning. Advances in Neural Information Processing Systems, 36:61573–61624, December 2023.
  • Zhao et al. [2023] Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, and Joni Pajarinen. Simplified Temporal Consistency Reinforcement Learning. In Proceedings of the 40th International Conference on Machine Learning, pages 42227–42246. PMLR, July 2023.
  • Schwarzer et al. [2020a] Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-Efficient Reinforcement Learning with Self-Predictive Representations. In International Conference on Learning Representations, October 2020a.
  • Anand et al. [2019] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised State Representation Learning in Atari. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • **g et al. [2021] Li **g, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding Dimensional Collapse in Contrastive Self-supervised Learning. In International Conference on Learning Representations, October 2021.
  • Zhang et al. [2020] Amy Zhang, Rowan Thomas McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning Invariant Representations for Reinforcement Learning without Reconstruction. In International Conference on Learning Representations, October 2020.
  • Hansen et al. [2022] Nicklas A. Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Machine Learning, pages 8387–8406. PMLR, June 2022.
  • Gelada et al. [2019] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. In Proceedings of the 36th International Conference on Machine Learning, pages 2170–2179. PMLR, May 2019.
  • Rezaei-Shoshtari et al. [2022] Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, and Doina Precup. Continuous MDP Homomorphisms and Homomorphic Policy Gradient. In Advances in Neural Information Processing Systems, volume 35, pages 20189–20204, December 2022.
  • Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, pages 1587–1596. PMLR, July 2018.
  • Mentzer et al. [2023] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-VAE Made Simple, September 2023.
  • Abel et al. [2016] David Abel, David Hershkowitz, and Michael Littman. Near Optimal Behavior via Approximate State Abstraction. In Proceedings of The 33rd International Conference on Machine Learning, pages 2915–2923. PMLR, June 2016.
  • Mannor et al. [2004] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 71, New York, NY, USA, July 2004. Association for Computing Machinery.
  • Li et al. [2006] Lihong Li, Thomas Walsh, and Michael Littman. Towards a Unified Theory of State Abstraction for MDPs. In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, January 2006.
  • Andre and Russell [2002] David Andre and Stuart J. Russell. State abstraction for programmable reinforcement learning agents. In Eighteenth National Conference on Artificial Intelligence, pages 119–125, USA, July 2002. American Association for Artificial Intelligence.
  • Dearden and Boutilier [1997] Richard Dearden and Craig Boutilier. Abstraction and approximate decision-theoretic planning. Artificial Intelligence, 89(1):219–283, January 1997.
  • Singh et al. [1994] Satinder Singh, Tommi Jaakkola, and Michael Jordan. Reinforcement Learning with Soft State Aggregation. In Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994.
  • Higgins et al. [2018] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a Definition of Disentangled Representations, December 2018.
  • van Hoof et al. [2016] Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3928–3934, October 2016.
  • Watter et al. [2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • Ghosh and Bellemare [2020] Dibya Ghosh and Marc G. Bellemare. Representations for Stable Off-Policy Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, pages 3556–3565. PMLR, November 2020.
  • Kingma and Welling [2014] Diederik P. Kingma and M. Welling. Auto-Encoding Variational Bayes. ICLR, 2014.
  • Finn et al. [2016] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519, May 2016.
  • Higgins et al. [2017] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, pages 1480–1490. PMLR, July 2017.
  • Lange et al. [2012] Sascha Lange, Martin Riedmiller, and Arne Voigtländer. Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 International Joint Conference on Neural Networks (IJCNN), pages 1–8, June 2012.
  • Hafner et al. [2019a] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.
  • Rubinstein [1997] Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
  • Zhang et al. [2018] Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement learning. arXiv preprint arXiv:1811.06032, 2018.
  • Zintgraf et al. [2021] Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: Variational bayes-adaptive deep rl via meta-learning. The Journal of Machine Learning Research, 22(1):13198–13236, 2021.
  • Tomar et al. [2021] Manan Tomar, Utkarsh A Mishra, Amy Zhang, and Matthew E Taylor. Learning representations for pixel-based control: What matters and why? arXiv preprint arXiv:2111.07775, 2021.
  • Ni et al. [2023] Tianwei Ni, Benjamin Eysenbach, Erfan SeyedSalehi, Michel Ma, Clement Gehring, Aditya Mahajan, and Pierre-Luc Bacon. Bridging State and History Representations: Understanding Self-Predictive RL. In The Twelfth International Conference on Learning Representations, October 2023.
  • Subramanian et al. [2022] Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems. The Journal of Machine Learning Research, 23(1):483–565, 2022.
  • Ye et al. [2021] Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
  • Schwarzer et al. [2020b] Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020b.
  • Wen and Li [2022] Zixin Wen and Yuanzhi Li. The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. Advances in Neural Information Processing Systems, 35:24794–24809, December 2022.
  • Lutter et al. [2021] Michael Lutter, Leonard Hasenclever, Arunkumar Byravan, Gabriel Dulac-Arnold, Piotr Trochim, Nicolas Heess, Josh Merel, and Yuval Tassa. Learning dynamics models for model predictive agents. arXiv preprint arXiv:2109.14311, 2021.
  • Kostrikov et al. [2020] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649, 2020.
  • Yarats et al. [2021b] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021b.
  • Hansen et al. [2023] Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In The Twelfth International Conference on Learning Representations, October 2023.
  • Williams et al. [2015] Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.
  • Bellman [1957] Richard Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
  • Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  • Hafner et al. [2022] Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. In International Conference on Learning Representations, February 2022.
  • Zheng et al. [2024] Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Hafner et al. [2019b] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, May 2019b.
  • Hafner et al. [2023] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  • Yarats et al. [2021c] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving Sample Efficiency in Model-Free Reinforcement Learning from Images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10674–10681, May 2021c.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization, July 2016.

Appendices

Appendix A Implementation Details

Architecture

We implemented iQRL with PyTorch [54] and used the AdamW optimizer [55] for training the models. All components (encoder, dynamics, actor and critic) are implemented as MLPs. Following Hansen et al. [44] we let all intermediate layers be linear layers followed by LayerNorm [56]. Using LayerNorm is what led to our base TD3 implementation performing so well. We use Mish activation functions throughout. Below we summarize the iQRL architecture for our base model.

iQRL(
(fsq): FSQ(
(project_in): Identity()
(project_out): Identity()
)
(encoder): ModuleDict(
(state): Sequential(
(0): NormedLinear(in_features=O, out_features=256, act=Mish)
(1): Linear(in_features=256, out_features=512)
)
)
(encoder_tar): ModuleDict(
(state): Sequential(
(0): NormedLinear(in_features=O, out_features=256, act=Mish)
(1): Linear(in_features=256, out_features=512)
)
)
(dynamics): Sequential(
(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=512)
)
(pi): Actor(
(_pi): Sequential(
(0): NormedLinear(in_features=512, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=A)
)
)
(pi_tar): Actor(
(_pi): Sequential(
(0): NormedLinear(in_features=512, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=A)
)
)
(critic): Critic(
(_q1): Sequential(
(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=1)
)
(_q2): Sequential(
(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=1)
)
)
(critic_tar): Critic(
(_q1): Sequential(
(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=1, bias=True)
)
(_q2): Sequential(
(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)
(1): NormedLinear(in_features=512, out_features=512, act=Mish)
(2): Linear(in_features=512, out_features=1)
)
)
)

where O𝑂Oitalic_O is the dimensionality of the observation space and A𝐴Aitalic_A is the dimensionality of the action spaces.

Hyperparameters

Table 2 lists all of the hyperparameters for training iQRL which were used for the main experiments and the ablations.

Table 2: iQRL hyperparameters. We kept most hyperparameters fixed across all tasks.
Hyperparameter Value Description
Training
Action repeat 2
Max episode length 500 Action repeat makes this 1000
Num. eval episodes 10101010
Random episodes 10101010 Num. random episodes at start
TD3
Actor update freq. 2222 Update actor less than critic
Batch size 256256256256
Buffer size 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
Discount factor γ𝛾\gammaitalic_γ 0.990.990.990.99
Exploration noise Linear(1.0,0.1,50)Linear1.00.150\mathrm{Linear}(1.0,0.1,50)roman_Linear ( 1.0 , 0.1 , 50 ) (easy)
Linear(1.0,0.1,150)Linear1.00.1150\mathrm{Linear}(1.0,0.1,150)roman_Linear ( 1.0 , 0.1 , 150 ) (medium)
Linear(1.0,0.1,500)Linear1.00.1500\mathrm{Linear}(1.0,0.1,500)roman_Linear ( 1.0 , 0.1 , 500 ) (hard)
Learning rate 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
MLP dims [512,512]512512[512,512][ 512 , 512 ] For actor/critic/dynamics
Momentum coef. (τ𝜏\tauitalic_τ) 0.0050.0050.0050.005
Noise clip 0.30.30.30.3
N-step TD 1111 or 3333
Policy noise 0.20.20.20.2
Update-to-data (UTD) ratio 1111
Encoder
Discount factor γrepsubscript𝛾rep\gamma_{\text{rep}}italic_γ start_POSTSUBSCRIPT rep end_POSTSUBSCRIPT 0.90.90.90.9
Encoder learning rate 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Encoder MLP dims [256]delimited-[]256[256][ 256 ]
Encoder momentum coef. (τ𝜏\tauitalic_τ) 0.005
FSQ levels [8,8]88[8,8][ 8 , 8 ]
Horizon (H)𝐻(H)( italic_H ) 5555 For representation learning
Latent dimension (d𝑑ditalic_d) 512512512512
1024102410241024 (Humanoid/Dog)

Statistical significance

We used five seeds for the main figures, at least three seeds for all ablations, and plotted the 95 % confidence intervals as the shaded area, which corresponds to approximately two standard errors of the mean.

Hardware

We used Nvidia A100s and AMD Instinct MI250X GPUs to run our experiments. All our experiments have been run on a single GPU with a single-digit number of CPU workers.

Open-source code

For full details of the implementation, model architectures, and training, please check the code, which is available in the submitted supplementary material and will be made public upon acceptance to guarantee seamless reproducibility.

–appendices continue on next page–

Appendix B Baselines

In this section, we provide further details of the baselines we compare against. In particular, we provide details of how we modified the original codebases and tuned the hyperparameters in an effort to offer a fair comparison.

  • Temporal Consistency Reinforcement Learning (TCRL, [7]) is a reinforcement learning algorithm consisting of four components, an encoder and transition, policy and value functions, similarly to iQRL. TCRL uses a temporal consistency loss similar to model-based reinforcement learning to learn a representation used for model-free policy and value function training. The most crucial difference between TCRL and iQRL is that we replace the reward prediction head in the transition function with the FSQ-based normalization scheme. We used the official TCRL implementation on GitHub to run the TCRL experiments in our paper. For the DeepMind Control Suite (DMC) tasks, we used the tuned hyperparameters from the original paper. We used the official PyTorch implementation222https://github.com/zhaoyi11/tcrl.

  • Temporal Action-driven Contrastive Learning (TACO, [5]) is a temporal contrastive learning framework that learns a latent representation of states and actions with a contrastive loss that optimizes the mutual information between the representations of current states and the following action sequences, and those of the corresponding future states. TACO was primarily designed for vision-based tasks, whereas our benchmarks are state-based. We adapted TACO to the state-based setting by increasing the learning rate and update-to-data ratios to match those of iQRL. We also replaced their CNN-based encoder with the MLP-based encoder of iQRL. Then, we performed a grid search over feature dimensions of 50 and 128, hidden dimensions of 256, 512, and 1024, and frame stacking and no frame stacking. We found the combination of a feature dimension of 50 and a hidden dimension of 1024 without frame stacking to perform the best.

  • Twin Delayed DDPG (TD3, [15]) is a model-free RL algorithm for continuous control, extending deep deterministic policy gradient (DDPG) to deal with value overestimation bias. Compared to DDPG, this algorithm uses two critics and takes the minimum over the two for training, adds clipped noise to the actions selected for bootstrap** (policy smoothing), and updates the actor less frequently compared to the critics. iQRL is based on TD3 and we simply replace the observations with their corresponding latent representation by map** them through our encoder. This baseline uses our TD3 implementation which obtains very strong results. Comparing to this baseline allows us to investigate the impact of representation learning on sample-efficiency.

  • TD7 [6] is a model-free reinforcement learning algorithm for continuous control that builds on TD3. TD7 builds on a representation learning method, state-action learned embeddings (SALE). The embeddings are learned using a temporal consistency term in the latent state. Other improvements that TD7 has over TD3 are prioritized experience replay and checkpointing. TD7 was initially evaluated on MuJoCo. To adapt it for DeepMind Control Suite, we added action repeats, essential for good performance on DMC. Then, we compared the original hyperparameters of TD7 to those of iQRL and found iQRL to perform the best, so we used those for the final evaluation. In particular, the exploration noise decay of iQRL was crucial for high performance in the DMC environments, and without it, TD7 struggled. Note that both TD7 and iQRL use TD3 as the underlying algorithm, allowing us to reliably compare the impact of SALE and our FSQ-based representations. We used the official PyTorch implementation of TD7333https://github.com/sfujim/TD7.

–appendices continue on next page–

Appendix C Tasks

We evaluate our method in 20 tasks from the DeepMind Control suite [47]. Table 3 provides details of the environments we used, including the dimensionality of the observation and action spaces.

Table 3: DMControl. We consider a total of 20 continuous control tasks from the DeepMind Control suite.
Task Observation dim Action dim Sparse?
Acrobot Swingup 6 1 N
Cheetah Run 17 6 N
Cup Catch 8 2 Y
Dog Run 223 38 N
Dog Trot 223 38 N
Dog Stand 223 38 N
Dog Walk 223 38 N
Fish Swim 24 5 N
Hopper Hop 15 4 N
Hopper Stand 15 4 N
Humanoid Run 67 24 N
Humanoid Stand 67 24 N
Humanoid Walk 67 24 N
Quadruped Run 78 12 N
Quadruped Walk 78 12 N
Reacher Easy 6 2 Y
Reacher Hard 6 2 Y
Walker Run 24 6 N
Walker Stand 24 6 N
Walker Walk 24 6 N

–appendices continue on next page–

Appendix D Ablation of Codebook Size

In this section, we evaluate how the size of the codebook |𝒞|𝒞|\mathcal{C}|| caligraphic_C | influences training. We indirectly configure different codebook sizes via the FSQ levels ={L1,,Lc}subscript𝐿1subscript𝐿𝑐\mathcal{L}=\{L_{1},\ldots,L_{c}\}caligraphic_L = { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } hyperparameter. This is because the codebook size is given by |𝒞|=i=1cLi𝒞superscriptsubscriptproduct𝑖1𝑐subscript𝐿𝑖|\mathcal{C}|=\prod_{i=1}^{c}L_{i}| caligraphic_C | = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The top row of Fig. 6 compares the training curves for different codebook sizes. The algorithm’s performance is not particularly sensitive to the codebook size. A codebook that is too large can result in slower learning. The best codebook size varies between environments. The most difficult environment, Humanoid Run, benefits from the largest codebook.

Given that a codebook has a particular size, we can gain insights into how quickly iQRL’s encoder starts to activate all of the codebook. The connection between the codebook size and the activeness of the codebook is intuitive: the middle row of Fig. 6 shows that the smaller the codebook, the larger the active proportion.

In the bottom row of Fig. 6, we evaluate how different codebook sizes affect the encoder’s ability to preserve the rank of the representation. We see that the rank of the representation is maintained no matter the codebook size.

Refer to caption

Figure 6: Codebook size ablation. We compare how the codebook size affects the performance of iQRL (top), the percentage of the codebook that is active during training (middle), and how the different codebook sizes affect the encoder’s ability to preserve the rank of the representation (bottom). In general, smaller codebooks become fully active faster than larger codebooks, and the rank of the representation is maintained for all codebook sizes. We plot the mean and the 95%percent9595\%95 % confidence intervals (shaded) across 3 random seeds for all environments.

–appendices continue on next page–

Appendix E Ablation of Latent Dimension d𝑑ditalic_d

This section investigates how the latent dimension d𝑑ditalic_d affects the behavior and performance of iQRL in four different environments. The latent dimension d𝑑ditalic_d corresponds to the dimension of the representation corresponding to each FSQ level before and after quantization is applied. In the top row of Fig. 7, we see that the performance of our algorithm is robust to the latent dimension d𝑑ditalic_d, although a latent dimension too small can result in inferior performance, especially in the more difficult environments. The bottom row of Fig. 7 demonstrates that iQRL learns to use the complete codebook irrespective of the latent dimension. However, a larger d𝑑ditalic_d can also correspond to the codebook becoming fully active slightly slower.

Refer to caption

Figure 7: Latent dim d𝑑ditalic_d ablation. We compare how the latent dimension d𝑑ditalic_d affects the performance of iQRL (top) and the percentage of the codebook that is active during training (bottom). In general, our algorithm is robust to the latent dimension of the representation, although in more difficult environments, such as Humanoid Walk, a d𝑑ditalic_d too small can harm the agent’s performance. We plot the mean and the 95%percent9595\%95 % confidence intervals (shaded) across 3 random seeds for all environments.

–appendices continue on next page–

Appendix F Further Ablations

Refer to caption

Figure 8: Adding a reward head is not enough to prevent loss of rank. We show how removing our quantization scheme leads to dimensional collapse measured in terms of the rank of the representation, and in addition, how adding a reward prediction head to iQRL without quantization is insufficient to counteract this and maintain full rank. We plot the mean and the 95%percent9595\%95 % confidence intervals (shaded) across 3 random seeds for all environments.

Refer to caption

Figure 9: Replacing EMA encoder with stop gradient. We show that removing iQRL’s EMA encoder and replacing it with only stop gradient hurts performance in DMC tasks. This is particularly apparent in the Acrobot Swingup task. We plot the mean and the 95%percent9595\%95 % confidence intervals (shaded) across 3 random seeds for all environments.

Refer to caption

Figure 10: Adding a projection head decreases sample efficiency. We show that adding a projection head to iQRL, similar to what is done in SPR [8], decreases iQRL’s sample efficiency. We plot the mean and the 95%percent9595\%95 % confidence intervals (shaded) across 3 random seeds for all environments.

–appendices continue on next page–

Appendix G Further DMC Results

Fig. 11 compares iQRL to the baselines in the 20 DMC tasks. iQRL’s representation learning significantly improves sample efficiency when compared to TD3. Note that iQRL uses the same TD3 implementation with the same hyperparameters, so the only difference is our representation learning. iQRL also outperforms TCRL in terms of sample efficiency, even without the reward prediction head. Our experiments indicate that this improvement is due to our quantization and the inclusion of LayerNorm in our encoder. We compare iQRL to TACO (which uses a contrastive loss) and observe that iQRL outperforms TACO in most environments. TACO seems to particularly struggle in the Dog tasks. Finally, iQRL outperforms TD7, a state-of-the-art representation learning method for state-based RL.

Refer to caption

Figure 11: DeepMind Control results. iQRL performs well across a variety of DMC tasks. We plot the mean (solid line) and the 95%percent9595\%95 % confidence intervals (shaded) across 5 random seeds, where each seed averages over 10 evaluation episodes.