iQRL – Implicitly Quantized Representations for Sample-efficient Reinforcement Learning

Aidan Scannell
Aalto University
[email protected]
&Kalle Kujanpää
Aalto University
[email protected]
&Yi Zhao
Aalto University
[email protected]
&Mohammadreza Nakhaei
Aalto University
[email protected]
&Arno Solin
Aalto University
[email protected]
&Joni Pajarinen
Aalto University
[email protected]

Abstract

Learning representations for reinforcement learning (RL) has shown much promise for continuous control. We propose an efficient representation learning method using only a self-supervised latent-state consistency loss. Our approach employs an encoder and a dynamics model to map observations to latent states and predict future latent states, respectively. We achieve high performance and prevent representation collapse by quantizing the latent representation such that the rank of the representation is empirically preserved. Our method, named iQRL: implicitly Quantized Reinforcement Learning, is straightforward, compatible with any model-free RL algorithm, and demonstrates excellent performance by outperforming other recently proposed representation learning methods in continuous control benchmarks from DeepMind Control Suite.

1 Introduction

Reinforcement learning (RL, e.g., [1]) has shown much promise for solving complex continuous control tasks. However, applying RL in real-world environments is challenging as it typically requires millions of data points which can be unpractical—i.e. RL is sample inefficient. On the other hand, representation learning has become a widely adopted solution for improving sample efficiency in deep learning. The core idea is to learn features which capture the underlying structure and patterns of the data. In the context of RL, such features can be learned independently from the downstream task. Whilst representation learning has had successes in RL, these have mainly been restricted to image-based observations (e.g., CURL [2], DrQ [3], DrQ-v2 [4], and TACO [5]).

The investigation of representation learning for state-based RL is much less common. This is likely due to the fact that learning a compact representation of an already compact state vector seems unnecessary. However, recent work by Fujimoto et al. [6], Zhao et al. [7] suggests that the difficulty of a task is due to the complexity of the underlying transition dynamics, as opposed to the size of the observation space. As such, investigating representation learning for state-based RL is a promising research direction.

Recently, TCRL [7] and SPR [8] have obtained state-of-the-art performance on continuous control benchmarks by learning representations with self-supervised losses. Self-supervised learning (SSL) approaches (which do not reconstruct observations) attempt to learn good features without labels [9]. Whilst they can learn robust representations, self-supervised losses are susceptible to a problem known as representation collapse (see Definition 3.1), where the encoder learns to map all observations to a constant latent representation [10]. As such, when leveraging SSL approaches to learn representations for RL, it is common to combine the self-supervised latent-state consistency loss with other loss terms, such as minimizing the reward prediction error in the latent space [11, 7, 12, 13, 14]. This helps to prevent representation collapse at the cost of learning a task-specific representation.

Refer to caption — Figure 1: Overview. iQRL is a stand-alone representation learning technique that is compatible with any model-free RL algorithm (we use TD3 [15]). Importantly, iQRL quantizes the latent representation with Finite Scalar Quantization (FSQ,

), using only a self-supervised latent-state consistency loss, i.e. no decoder (see Eq. 5). Making the latent representation discrete with an implicit codebook ( ) contributes to the very high sample efficiency of iQRL and empirically prevents representation collapse. Thanks to the FSQ-based quantization, iQRL does not need a reward prediction head to prevent representation collapse, a well-known issue with self-supervised learning, making the representation task-agnostic.

In this paper, we propose a simple representation learning technique which learns a task-agnostic representation using only a self-supervised loss. It is based solely on the latent-state consistency loss, i.e. a commonly used self-supervised loss for continuous RL. Importantly, our method empirically prevents representation collapse as it preserves the rank of the representation. We accomplish this by quantizing our latent representation with Finite Scalar Quantization [16], without using any reconstruction loss. As a result, our latent space is bounded and associated with an implicit codebook, whose size we can control. Our method can be combined with any model-free RL method (we use TD3, [15]). See Fig. 1 for an overview of our representation learning method. Importantly, our method (i) alleviates representation collapse, (ii) demonstrates excellent sample efficiency outperforming TCRL and TD7 on a wide range of different continuous control tasks, (iii) is simple to implement, and (iv) learns a task-agnostic representation that could be helpful in downstream tasks.

2 Related Work

In this section, we recap methods for representation learning in RL. In particular, we motivate why researchers are moving towards learning representations using self-supervised learning. Then, as our method builds upon self-supervised representation learning, which is susceptible to representation and dimensional collapse (see Definitions 3.1 and 3.2), we review contrastive self-supervised representation learning approaches; an alternative approach to preventing representation collapse.

Representation learning

Learning representations for RL has been investigated for decades [17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. However, these approaches are usually limited to simple environments. More recently, Fujimoto et al. [6] proposed TD7, an extension of TD3 which learns state and action embeddings and then performs TD3 with this representation, making it highly similar to our method, which also uses TD3 as the base algorithm. However, their method uses a self-supervised loss with no explicit mechanism to prevent representation collapse. In contrast to TD7 and motivated by representation collapse, we quantize our latent space, which we show empirically prevents representation collapse.

Observation reconstruction

A prominent idea in both model-based and model-free RL has been to learn latent representations with reconstruction objectives (e.g. VAE, [27]) [28, 29, 30, 25, 31, 32]. However, as these approaches use observation reconstruction to learn a representation their latent representation contains information about the observation which cannot be controlled by the agent and is not relevant for solving the task, which distorts the optimization landscape [33, 34]. In our experiments, we show that learning representations with observation reconstruction for model-free RL not only harms sample efficiency, but in the complex DMC Dog Run task, it can prevent the agent from solving the task.

Latent-state consistency

How can effective representations be learned efficiently without resorting to reconstruction? A common solution has been to attach an auxiliary loss to the RL objective and perform representation learning [35, 36]. A promising approach for learning suitable representations is the use of self-predictive abstractions, where the model is trained to predict future latent states through an auxiliary loss [37]. Ye et al. [38] introduce a self-supervised consistency loss on the learned latent representation. Instead of relying on a reconstruction-based loss function, Schwarzer et al. [39] propose a cosine similarity loss between predicted future latent states and the true future latent states and then perform Q-learning in the learned latent space. Our approach shares similarities with SPR [39], however, we focus on state-based observations, which leads us to quantizing our representation to prevent representation collapse, instead of using a projection head [40].

Latent-state consistency for model-based RL

Similarly to model-free RL, using the reconstruction loss for learning representation is also unreliable in model-based RL [41] and can have a detrimental effect on the performance of model-based methods in various benchmarks [42, 43]. Therefore, in the context of model-based RL, TD-MPC/TD-MPC2 [12, 44] use a consistency loss to learn representations for planning with Model Predictive Path Integral control together with reward and value functions learned through temporal difference methods [45]. Zhao et al. [7] show that the planning component of TD-MPC is not strictly necessary for high performance and applying model-free RL on top of the self-consistent representations is sufficient for performance competitive with state-of-the-art. We build on top of TCRL to show that we can combat representation collapse through latent-space quantization. As a result, we can drop the reward prediction head to learn a task-agnostic representation.

Contrastive learning

An alternative approach to preventing representation collapse in self-supervised learning is to use contrastive losses. In the context of RL, this was done by CURL [2] and TACO [5]. Whilst CURL and TACO are designed for image-based observations, their contrastive learning approaches could still hold value in state-based RL. The main idea in contrastive learning is to prevent representation collapse (see Definition 3.1) by pushing the latent vectors associated with different observations away from each other. Nevertheless, contrastive methods still experience dimensional collapse [10]. In contrast to TACO, we do not use a contrastive loss and instead leverage quantization to help prevent both representation and dimensional collapse. To offer a fair comparison between contrastive learning and our quantization scheme, we compare our method against a version of TACO tuned for state-based RL.

3 Preliminaries

In this section, we introduce and formally define representation collapse.

Representation collapse

Self-supervised learning methods learn representations by minimizing distances between two embedding vectors. As such, there is a trivial solution where the encoder outputs a constant for all inputs. Formally, this can be defined as follows:

Definition 3.1 (Complete representation collapse).

Given an encoder $e_{\theta}:\mathcal{O}\rightarrow\mathcal{Z}$ which maps observations ${\bm{o}}\in\mathcal{O}$ to latent states ${\bm{z}}\in\mathcal{Z}$ , the representation is said to be completely collapsed when the latent representation is constant for all observations, i.e., $e_{\theta}({\bm{o}})=c,\forall{\bm{o}}\in\mathcal{O}$ .

In the context of SSL, **g et al. [10] investigated another type of representation collapse known as dimensional collapse, which is defined as follows:

Definition 3.2 (Dimensional collapse).

Given an encoder $e_{\theta}:\mathcal{O}\rightarrow\mathcal{Z}$ which maps observations ${\bm{o}}\in\mathcal{O}$ to latent states ${\bm{z}}_{t}\in\mathcal{Z}=\mathbb{R}^{d}$ , of dimension $d$ , the representation is said to be dimensionally collapsed when the latent vectors only span a lower dimensional space.

Whilst complete representation collapse is a clear issue when learning representations for RL, it is not immediately obvious if dimensional collapse is an issue because the goal of representation learning is often considered to be learning a lower-dimensional representation of the observations. In other words, there is a trade-off: We want to learn a lower-dimensional representation, but at the same time, we want to ensure that the representation contains all the information required for predicting future states and, thus, state values [36]. Our experiments show that whilst dimensional collapse is not always an issue, in some more complex environments, it can prevent agents from learning to solve a task (see Fig. 3).

Algorithm 1 iQRL

Input: Encoder

e_{\theta}

, dynamics

d_{\phi}

, critics

\{q_{\psi_{1}},q_{\psi_{2}}\}

, policy

\pi_{\eta}

, learning rate

\alpha

, target network update rate

\tau

for

i

N_{\text{episodes}}

\mathcal{D}\leftarrow\mathcal{D}\cup\{{\bm{o}}_{t},{\bm{a}}_{t},{\bm{o}}_{t+1}% ,r_{t+1}\}^{T}_{t=0}

\triangleright

Collect data in environment

for

i=1

T

[\theta,\phi]\leftarrow[\theta,\phi]+\alpha\nabla\left(\mathcal{L}_{\text{rep}% }(\theta,\phi;\mathcal{D})\right)

\triangleright

pdate representation, Eq. 5

\psi\leftarrow\psi+\alpha\nabla\left(\mathcal{L}_{q}(\psi;\mathcal{D})\right)

\triangleright

Update critic, Eq. 7

i

% 2 == 0 then

\eta\leftarrow\eta+\alpha\nabla\left(\mathcal{L}_{\pi}(\eta;\mathcal{D})\right)

\triangleright

Update actor less frequently than critic, Eq. 8

end if

[\bar{\theta},\bar{\psi},\bar{\eta}]\leftarrow(1-\tau)[\bar{\theta},\bar{\psi}% ,\bar{\eta}]+\tau[{\theta},{\psi},\eta]

\triangleright

Update target networks

end for

4 Method

In this section, we detail our method, named implicitly Quantized Reinforcement Learning (iQRL). iQRL is conceptually simple, it (i) learns a representation of the observation space and then, (ii) performs model-free RL (e.g., TD3) on this representation. See Fig. 1 and Algorithm 1.

We consider Markov Decision Processes (MDPs, [46]) $\mathcal{M}=(\mathcal{O},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)$ , where an agent receives an observation ${\bm{o}}_{t}\in\mathcal{O}$ at time step $t$ , performs an action ${\bm{a}}_{t}\in\mathcal{A}$ , and then obtains the next observation ${\bm{o}}_{t+1}=\mathcal{P}(\cdot\mid{\bm{o}}_{t},{\bm{a}}_{t})$ and reward $r_{t}=\mathcal{R}({\bm{o}}_{t},{\bm{a}}_{t})$ . The discount factor is denoted $\gamma\in[0,1)$ .

Method components

iQRL has four main components which we wish to learn:

Encoder:	$\displaystyle{\bm{z}}_{t}$	$\displaystyle=f(e_{\theta}({\bm{o}}_{t}))$	(1)
Dynamics:	$\displaystyle\hat{{\bm{z}}}_{t+1}$	$\displaystyle={f}({{\bm{z}}}_{t}+d_{\phi}({\bm{z}}_{t},{\bm{a}}_{t}))$	(2)
Value:	$\displaystyle{\bm{q}}_{t}$	$\displaystyle=\mathbf{q}_{\psi}({\bm{z}}_{t},{\bm{a}}_{t})$	(3)
Policy:	$\displaystyle{\bm{a}}_{t}$	$\displaystyle\sim\pi_{\eta}({\bm{z}}_{t})$	(4)

The encoder $e_{\theta}$ and latent-space dynamics model $d_{\phi}$ are responsible for representation learning. $f(\cdot)$ denotes our quantization scheme, which implicitly quantizes our latent representation (more details to follow). The encoder (with quantization) $f\circ e_{\theta}(\cdot)$ maps observations ${\bm{o}}_{t}$ to latent states ${\bm{z}}_{t}$ and is responsible for learning a representation which can aid RL. The latent-space dynamics model (with quantization) $d_{\phi}(\cdot)$ predicts the next latent states $\hat{\bm{z}}_{t+1}$ given a latent state ${\bm{z}}_{t}$ and an action ${\bm{a}}_{t}$ . Its sole purpose is to aid representation learning by making the latent states temporally consistent. Note that we do not use it for model-based RL. Once we have the representation learned by our encoder, we map all observations to the latent space and perform model-free RL in this latent space. Throughout this paper, we use Twin Delayed Deep Deterministic Policy Gradient (TD3, [15]) as the base algorithm. It consists of two state-action value functions $\{q_{\psi_{1}},q_{\psi_{2}}\}$ , known as critics, and a deterministic actor $\pi_{\eta}$ . Following prior works [4], we use a linear exploration noise schedule which decays from $1$ to $0.1$ during training.

Representation learning

Our representation learning uses the latent-state consistency loss,

\displaystyle\mathcal{L}_{\text{rep}}(\theta,\phi;\tau)

\displaystyle=\sum_{h=0}^{H-1}\gamma_{\text{rep}}^{h}\left(\frac{{f}(\hat{\bm{% z}}_{t+h}+d_{\phi}(\hat{{\bm{z}}}_{t+h},{\bm{a}}_{t+h}))}{\|{f}(\hat{\bm{z}}_{% t+h}+d_{\phi}(\hat{{\bm{z}}}_{t+h},{\bm{a}}_{t+h}))\|_{2}}\right)^{\top}\left(% \frac{{f}(e_{\bar{\theta}}({\bm{o}}_{t+h+1}))}{\|{f}(e_{\bar{\theta}}({\bm{o}}% _{t+h+1}))\|_{2}}\right),

(5)

which minimizes the cosine similarity between the next state predicted by the dynamics model $\hat{{\bm{z}}}_{t+1}=f(\hat{\bm{z}}_{t}+d_{\phi}(\hat{{\bm{z}}}_{t},{\bm{a}}_{% t}))$ and the next state predicted by the momentum encoder $\bar{{\bm{z}}}_{t+1}=f(e_{\bar{\theta}}({\bm{o}}_{t+1}))$ . The latent states are obtained with multi-step predictions in the latent space $\hat{{\bm{z}}}_{t+1}=f(\hat{\bm{z}}_{t}+d_{\phi}(\hat{{\bm{z}}}_{t},{\bm{a}}_{% t}))$ . The initial map** to the latent space $\hat{{\bm{z}}}_{0}=f(e_{\theta}({\bm{o}}_{0}))$ uses the online encoder which is being trained jointly with the dynamics model $d_{\phi}(\hat{{\bm{z}}_{t}},{\bm{a}}_{t})$ . The target $e_{\bar{\theta}}({\bm{o}}_{t+1})$ is calculated with the momentum encoder which uses an exponential moving average (EMA) of the encoder’s weights $\bar{\theta}\leftarrow(1-\tau)\bar{\theta}+\tau\theta$ . The target network update rate is denoted $\tau$ . Note that we do not use reward or value prediction for learning our representation and as a result, our representation is task-agnostic.

Quantization

Motivated by preventing dimensional collapse we quantize our latent space following the approach from Finite Scalar Quantization (FSQ, [16]). Their important observation is that carefully bounding each dimension gives rise to an implicit codebook $\mathcal{C}$ of a chosen size $|\mathcal{C}|$ . Having requested a $d\text{-dimensional}$ latent space, iQRL configures the encoder to output $c$ channels per dimension such that the representation from the encoder ${\bm{x}}=e_{\theta}({\bm{o}})\in\mathbb{R}^{d\times c}$ and the dynamics model $\hat{{\bm{x}}}={{\bm{z}}}+d_{\phi}({\bm{z}},{\bm{a}})\in\mathbb{R}^{d\times c}$ are in $\mathbb{R}^{d\times c}$ . To quantize ${\bm{x}}$ (and $\hat{{\bm{x}}}$ ) into a finite set of codewords, we first apply a bounding function $f(\cdot)$ and then we round to integers. Let us consider a single dimension $j$ of the encoder’s output ${\bm{v}}=[{\bm{x}}]_{j,:}\in\mathbb{R}^{c}$ which consists of $c\text{-channels}$ , and demonstrate how it is quantized. We follow FSQ and choose $f(\cdot)$ such that each entry in $\tilde{{\bm{v}}}=\mathrm{round}(f({\bm{v}}))$ takes one of $L_{i}$ unique values,

\displaystyle f:{\bm{v}}\rightarrow\lfloor L_{i}/2\rfloor\text{tanh}({\bm{v}}),

(6)

where $L_{i}$ is a hyperparameter for channel $i$ , specified as FSQ levels $\mathcal{L}=\{L_{1},\ldots,L_{c}\}$ . This gives an entry in our codebook $\tilde{\bm{v}}\in\mathcal{C}$ , where the implied codebook is given by the product of these per-channel codebook sets. The vectors in $\mathcal{C}$ can be enumerated giving a bijection from any $\tilde{{\bm{v}}}$ to an integer in $\{1,2,\ldots,L^{c}\}$ . As an example, in some of our experiments, we used $d=512$ latent dimensions each with $c=2$ channels consisting of 8 levels, i.e. we used FSQ levels $\mathcal{L}=\{L_{1}=8,L_{2}=8\}$ . This corresponds to a codebook of size $|\mathcal{C}|=\prod_{i=1}^{c}L_{i}=8\times 8=64=2^{6}$ for each dimension.

Note that this quantization requires a round operation. As such, to propagate gradients through the round operation we use straight-through gradient estimation (STE). This is easily accomplished in deep learning libraries using stop gradient $\mathrm{sg}$ as $\mathrm{round\_ste}(x):x\rightarrow x+\mathrm{sg}(\mathrm{round}(x)-x)$ . FSQ has the following hyperparameters: we must specify the number of channels $c$ and the number of levels per channel $\mathcal{L}=\{L_{1},\ldots,L_{c}\}$ . Table 1 shows the recommended number of channels and number of levels per channel to obtain codebooks of different sizes [16].

Table 1: FSQ levels

\mathcal{L}

to approximate different codebook sizes

|\mathcal{C}|

Target size $\|\mathcal{C}\|$	$2^{4}$	$2^{6}$	$2^{8}$	$2^{9}$	$2^{10}$
Proposed $\mathcal{L}$	$\{5,3\}$	$\{8,8\}$	$\{8,6,5\}$	$\{8,8,8\}$	$\{8,5,5,5\}$

In practice, we found codebooks of size $|\mathcal{C}|=2^{6}$ sufficient for all environments in the DeepMind Control suite. However, for more complex environments we hypothesize that larger codebooks will be required.

Model-free reinforcement learning

We learn the policy (actor) and action-value function (critic) using TD3 [15]. However, we follow Yarats et al. [4], Zhao et al. [7] and augment the loss with $n\text{-step}$ returns. The only difference to TD3 is that instead of using the original observations ${\bm{o}}_{t}$ , we map them through the online encoder ${\bm{z}}_{t}=f(e_{{\theta}}({\bm{o}}_{t}))$ and learn the actor/critic in the quantized latent space ${\bm{z}}_{t}$ . The critic is then updated by minimizing the following objective:

	$\displaystyle\mathcal{L}_{q}(\psi;\tau)$	$\displaystyle=\mathbb{E}_{\tau\sim\mathcal{D}}\left[\textstyle\sum_{k=1}^{2}(q% _{\psi_{k}}(f(e_{{\theta}}({\bm{o}}_{t})),{\bm{a}}_{t})-y)^{2}\right],\quad% \forall k\in 1,2$		(7)
	$\displaystyle y$	$\displaystyle=\sum_{n=0}^{N-1}r_{t+n}+\gamma^{n}\min_{k\in\{1,2\}}q_{\bar{\psi% }_{k}}(e_{{\theta}}({\bm{o}}_{t+n+1}),{\bm{a}}_{t+n+1}),\quad\text{with}\ {\bm% {a}}_{t+n}=\pi_{\bar{\eta}}({\bm{z}}_{t+n})+\epsilon_{t+n},$

where we use policy smoothing by adding clipped Gaussian noise $\epsilon_{t+n}\sim\text{clip}\left(\mathcal{N}(0,\sigma^{2}),-c,c\right)$ to the action ${\bm{a}}_{t+n}=\pi_{\bar{\eta}}({\bm{z}}_{t+n})+\epsilon_{t+n}$ . Note that we use the online encoder to get the latent states in both the prediction and the target. We then use the target action-value functions $\mathbf{q}_{\bar{\psi}}$ and the target policy $\pi_{\bar{\eta}}$ to calculate the TD target. Following TD3, we learn the actor’s parameters by minimizing

\displaystyle\mathcal{L}_{\pi}(\eta;\tau)=-\mathbb{E}_{{\bm{o}}_{t}\sim% \mathcal{D}}\bigg{[}\min_{k\in\{1,2\}}q_{\psi_{k}}(\underbrace{f(e_{{\theta}}(% {\bm{o}}_{t}))}_{{\bm{z}}_{t}},\pi_{\eta}(f(e_{{\theta}}({\bm{o}}_{t}))))\bigg% {]}.

(8)

That is, we maximize the Q-value using the clipped double Q-learning trick to combat overestimation in Q-learning. Note that we do not use the momentum encoder in the actor/critic objectives. In our experiments, using the momentum encoder resulted in worse performance.

Whilst our method shares similarities with TCRL [7], it is important to note that our transition model does not predict the reward. Instead, iQRL leverages quantization to help alleviate representation collapse, and, as a result, learns a task-agnostic representation.

5 Experiments

In this section, we evaluate iQRL in a variety of continuous control tasks from the DeepMind Control (DMC) Suite [47]. We aim to answer the following questions:

1.

How does iQRL compare to state-of-the-art model-free RL algorithms, especially in the hard DMC tasks?
2.

Does our FSQ-based quantization help combat representation and dimensional collapse?
3.

Is learning a representation with only latent-state consistency really better than including reward predictions?
4.

What impact does reconstruction loss have on the performance of iQRL?

iQRL is simple, fast, and performant

We compare iQRL to the model-free baseline Twin Delayed DDPG (TD3, [15]), and the representation learning-based RL methods Temporal Consistency Reinforcement Learning (TCRL, [7]), TD7 ([6]), and Temporal Action-driven Contrastive Learning (TACO, [5]). In Fig. 2, we evaluate sample efficiency by plotting the average performance of the algorithms across 20 DeepMind Control Suite tasks as a function of environment steps. We see that, on average, iQRL outperforms the baselines and shows significant advantages in many environments. We outperform TCRL, which is the most similar baseline to our work. Furthermore, TD3 is noncompetitive with iQRL, highlighting the importance of representation learning in state-based reinforcement learning. For complete results on all 20 tasks, see Fig. 11 in Appendix G. For more details of the tasks on which the algorithms were evaluated, see Appendix C. For more details of the baselines used in our work and how we implemented them, see Appendix B.

High-dimensional control

Many tasks in DeepMind Control Suite are particularly high-dimensional. For instance, the observation space of the Dog tasks is $\mathcal{O}\in\mathbb{R}^{223}$ and the action space is $\mathcal{A}\in\mathbb{R}^{38}$ , and for Humanoid, the observation space is $\mathcal{O}\in\mathbb{R}^{67}$ and the action space $\mathcal{A}\in\mathbb{R}^{24}$ . Fig. 2 and Fig. 11 show that iQRL excels in the high dimensional Dog and Humanoid environments when compared to the baselines. We hypothesize that our discretized representations are particularly beneficial for simplifying learning the transition dynamics in high-dimensional spaces, making iQRL highly sample efficient in these tasks.

iQRL does not suffer from rank collapse

We examine the behaviour of adding quantization to our MLP encoder during training. Following Ni et al. [36], we estimate the rank of the linear operator associated with the MLP encoder by calculating the matrix rank¹¹1Rank of an $m\times n$ matrix ${\bm{A}}$ is the dimension of the image of the map** $g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ , with $g({\bm{x}})={\bm{A}}{\bm{x}}$ of the latent states for a batch of inputs. We ensure full rank at the start of training by orthogonally initializing the MLP encoders. Fig. 3 shows the orthogonality-preserving effect of our quantization scheme as the matrix rank stays close to the maximum. Without quantization, a dimensional collapse occurs, which can have significant harmful effects as the representational power of the latent state diminishes [10]. Correspondingly, in three of the four environments, removing the quantization has a deteriorating impact on the sample efficiency of iQRL, and in Dog Run, the algorithm completely fails to learn to solve the task without the quantization.

Reward prediction is not necessary for representation learning

Unlike prior methods such as TCRL [7], TD-MPC [12], Dreamer-V2 [48] and TACO [49], our representation learning loss (Eq. 5) does not include a term for learning to predict the reward or the value of the latent state. Instead, we rely solely on the self-predictive temporal consistency loss. To analyze the impact of not including the reward prediction term, we compare our method to a variant of our method, where we have included a reward prediction head similar to that of Zhao et al. [7]. Formally, we define a reward head as $\hat{r}_{t}=g_{\xi}({\bm{z}}_{t},{\bm{a}}_{t})$ (see also Eq. 2), and include a reward prediction term (discounted MSE loss) in the representation loss:

\displaystyle\mathcal{L}_{\text{rew}}=\sum_{h=0}^{H-1}\gamma_{\text{rep}}^{h}% \|\hat{r}_{t+h}-r_{t+h}\|_{2}^{2},

where $r_{t+h}$ is the ground-truth H-step reward and $\hat{r}_{t+h}$ is the predicted H-step reward.

The results for this ablation study are shown in Fig. 4. The plots show that our method, iQRL, without a reward prediction term in the loss, has equal or superior performance to the variant with a reward prediction term except in Dog Run. Our results imply that learning to predict the reward is not necessary for learning a suitable latent representation. The upside of not including the reward prediction head is that it makes the representation task-agnostic, which we believe to be important for downstream applications such as speeding up learning in an incremental multi-task setting in the same domain.

We also evaluated whether including a reward head alone without our FSQ-based normalization scheme is sufficient for preserving the rank of the latent representation and found that iQRL with a reward prediction head but without FSQ suffers from poor performance and dimensional collapse. Therefore, the reward prediction head is not a substitute for our quantization. For more details of this experiment, see Appendix F.

Reconstruction loss has a detrimental impact

Learning to minimize the observation reconstruction error has been widely applied in model-based RL [1, 50, 51], and an observation decoder has been a component of many of the most successful RL algorithms to date [52]. However, recent work in representation learning for RL [7] and model-based RL [12] has shown that incorporating a reconstruction term into the representation loss can hurt the performance, as learning to reconstruct the observations is inefficient due to the observations containing irrelevant details and visuals like shading that are uncontrollable by the agent and do not affect the tasks.

To provide a thorough analysis of iQRL, we include results where we add a reconstruction term to our representation loss in Eq. 5:

\displaystyle\mathcal{L}_{{\bm{o}}}=\mathbb{E}_{{\bm{o}}_{t}\sim\mathcal{D}}[% \|\hat{{\bm{o}}}_{t}-{\bm{o}}_{t}\|_{2}^{2}],\quad\hat{{\bm{o}}}_{t}=h_{\kappa% }({\bm{z}}_{t}),

(9)

where $h_{\kappa}$ is a learned observation decoder that takes the latent state as the input and outputs the reconstructed observation. The decoder $h_{\kappa}$ is a standard MLP. We perform reconstruction at each time step in the horizon. The results in Fig. 5 show that in no environments does reconstruction aid learning, and in some tasks, such as the difficult Dog Run and Humanoid Walk tasks, including the reconstruction term has a significant detrimental effect on the performance, and can even prevent learning completely. Our results support the observations of Zhao et al. [7] and Hansen et al. [12] about the lack of need for a reconstruction target in continuous control tasks.

Projection head

Wen and Li [40] and Schwarzer et al. [8] investigated the role of a learnable projection head in non-contrastive self-supervised learning and found that it helps RL algorithms learn more diversified and therefore, superior representations. Whilst iQRL shares similarities with SPR [8], in particular, a temporal consistency loss using cosine similarity, it differs in that it does not use a learnable projection head and quantizes the representation instead. In Fig. 10, we show the impact of adding a projection head to iQRL. It shows that the projection head decreases the sample efficiency of iQRL. Whilst projection heads are effective for learning representations from images, our results suggest that they have a significant negative impact on sample efficiency when learning representations of state-based observations, reaffirming that state-based RL has a different set of challenges to image-based RL and techniques designed to combat representation collapse are not always transferable between the settings.

Stop gradient

[36] proved that using stop gradients should suffice for preventing representation collapse. However, their experiments suggested that using an EMA encoder improves performance over simply using stop gradients. In Fig. 9, we show how replacing iQRL’s EMA encoder with a stop gradient operation can have a negative impact on performace. For example, using stop gradient in the Acrobot Swingup task results in the agent struggling to solve the task.

Codebook size

In Appendix D we evaluate how the size of the codebook $|\mathcal{C}|$ influences the performance of the agent. It shows that size of the codebook and the activeness is intuitive: the smaller the codebook the larger the active proportion. The best codebook size varies between environments but the rank of the representation appears to be preserved for all codebook sizes.

Latent dimension

As a final experiment, we evaluate how the dimension of the latent space $d$ impacts iQRL’s performance. We find that iQRL is fairly robust to different latent dimensions. We find that a latent dimension of $d=1024$ with FSQ levels $\mathcal{L}=[8,8]$ , which corresponds to a codebook size $|\mathcal{C}|=2^{6}$ , performs best in the harder DMC tasks. See Appendix E for more details.

6 Conclusion

We have presented iQRL, a technique for learning representations using only a self-supervised temporal consistency loss, which demonstrates strong performance in continuous control tasks, including the complex DMC Humanoid and Dog tasks. Our quantization of the latent space empirically preserves the representation’s matrix rank, indicating that it alleviates representation and dimensional collapse. Our experiments further demonstrate that iQRL is extremely sample efficient whilst being fast to train, which we believe is a strong selling point. Importantly, our method is (i) straightforward, (ii) compatible with any model-free RL algorithm, and (iii) learns a task-agnostic representation.

Limitations and future work

Given that iQRL learns a task-agnostic representation, exploring its use for multi-task RL is an exciting direction for future work. Can iQRL learn a single representation which is shared across a wide variety of tasks? In this paper, we have only evaluated iQRL in deterministic environments so extending iQRL to stochastic environments is another important direction for future work.

Acknowledgments and Disclosure of Funding

AJS and KK were supported by the Research Council of Finland from the Flagship program: Finnish Center for Artificial Intelligence (FCAI). YZ is funded by Research Council of Finland (grant id 345521) and MN is funded by Business Finland (BIOND4.0 - Data Driven Control for Bioprocesses). AHS acknowledges funding from the Research Council of Finland (grant id 339730) and JP acknowledges funding from Research Council of Finland (grant ids 345521 and 353198). We acknowledge CSC – IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through CSC. We acknowledge the computational resources provided by the Aalto Science-IT project.

References

Sutton and Barto [2018] R.S. Sutton and A.G. Barto. Reinforcement Learning, Second Edition: An Introduction. Adaptive Computation and Machine Learning Series. MIT Press, 2018.
Laskin et al. [2020] Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, pages 5639–5650. PMLR, November 2020.
Yarats et al. [2020] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. In International Conference on Learning Representations, October 2020.
Yarats et al. [2021a] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning. In International Conference on Learning Representations, October 2021a.
Zheng et al. [2023] Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 36, pages 48203–48225, December 2023.
Fujimoto et al. [2023] Scott Fujimoto, Wei-Di Chang, Edward Smith, Shixiang (Shane) Gu, Doina Precup, and David Meger. For SALE: State-Action Representation Learning for Deep Reinforcement Learning. Advances in Neural Information Processing Systems, 36:61573–61624, December 2023.
Zhao et al. [2023] Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, and Joni Pajarinen. Simplified Temporal Consistency Reinforcement Learning. In Proceedings of the 40th International Conference on Machine Learning, pages 42227–42246. PMLR, July 2023.
Schwarzer et al. [2020a] Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-Efficient Reinforcement Learning with Self-Predictive Representations. In International Conference on Learning Representations, October 2020a.
Anand et al. [2019] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. Unsupervised State Representation Learning in Atari. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
**g et al. [2021] Li **g, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding Dimensional Collapse in Contrastive Self-supervised Learning. In International Conference on Learning Representations, October 2021.
Zhang et al. [2020] Amy Zhang, Rowan Thomas McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning Invariant Representations for Reinforcement Learning without Reconstruction. In International Conference on Learning Representations, October 2020.
Hansen et al. [2022] Nicklas A. Hansen, Hao Su, and Xiaolong Wang. Temporal Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Machine Learning, pages 8387–8406. PMLR, June 2022.
Gelada et al. [2019] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. In Proceedings of the 36th International Conference on Machine Learning, pages 2170–2179. PMLR, May 2019.
Rezaei-Shoshtari et al. [2022] Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, and Doina Precup. Continuous MDP Homomorphisms and Homomorphic Policy Gradient. In Advances in Neural Information Processing Systems, volume 35, pages 20189–20204, December 2022.
Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, pages 1587–1596. PMLR, July 2018.
Mentzer et al. [2023] Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite Scalar Quantization: VQ-VAE Made Simple, September 2023.
Abel et al. [2016] David Abel, David Hershkowitz, and Michael Littman. Near Optimal Behavior via Approximate State Abstraction. In Proceedings of The 33rd International Conference on Machine Learning, pages 2915–2923. PMLR, June 2016.
Mannor et al. [2004] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, page 71, New York, NY, USA, July 2004. Association for Computing Machinery.
Li et al. [2006] Lihong Li, Thomas Walsh, and Michael Littman. Towards a Unified Theory of State Abstraction for MDPs. In Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics, January 2006.
Andre and Russell [2002] David Andre and Stuart J. Russell. State abstraction for programmable reinforcement learning agents. In Eighteenth National Conference on Artificial Intelligence, pages 119–125, USA, July 2002. American Association for Artificial Intelligence.
Dearden and Boutilier [1997] Richard Dearden and Craig Boutilier. Abstraction and approximate decision-theoretic planning. Artificial Intelligence, 89(1):219–283, January 1997.
Singh et al. [1994] Satinder Singh, Tommi Jaakkola, and Michael Jordan. Reinforcement Learning with Soft State Aggregation. In Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994.
Higgins et al. [2018] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a Definition of Disentangled Representations, December 2018.
van Hoof et al. [2016] Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3928–3934, October 2016.
Watter et al. [2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
Ghosh and Bellemare [2020] Dibya Ghosh and Marc G. Bellemare. Representations for Stable Off-Policy Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, pages 3556–3565. PMLR, November 2020.
Kingma and Welling [2014] Diederik P. Kingma and M. Welling. Auto-Encoding Variational Bayes. ICLR, 2014.
Finn et al. [2016] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519, May 2016.
Higgins et al. [2017] Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, pages 1480–1490. PMLR, July 2017.
Lange et al. [2012] Sascha Lange, Martin Riedmiller, and Arne Voigtländer. Autonomous reinforcement learning on raw visual input data in a real world application. In The 2012 International Joint Conference on Neural Networks (IJCNN), pages 1–8, June 2012.
Hafner et al. [2019a] Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.
Rubinstein [1997] Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.
Zhang et al. [2018] Amy Zhang, Yuxin Wu, and Joelle Pineau. Natural environment benchmarks for reinforcement learning. arXiv preprint arXiv:1811.06032, 2018.
Zintgraf et al. [2021] Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: Variational bayes-adaptive deep rl via meta-learning. The Journal of Machine Learning Research, 22(1):13198–13236, 2021.
Tomar et al. [2021] Manan Tomar, Utkarsh A Mishra, Amy Zhang, and Matthew E Taylor. Learning representations for pixel-based control: What matters and why? arXiv preprint arXiv:2111.07775, 2021.
Ni et al. [2023] Tianwei Ni, Benjamin Eysenbach, Erfan SeyedSalehi, Michel Ma, Clement Gehring, Aditya Mahajan, and Pierre-Luc Bacon. Bridging State and History Representations: Understanding Self-Predictive RL. In The Twelfth International Conference on Learning Representations, October 2023.
Subramanian et al. [2022] Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems. The Journal of Machine Learning Research, 23(1):483–565, 2022.
Ye et al. [2021] Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.
Schwarzer et al. [2020b] Max Schwarzer, Ankesh Anand, Rishab Goel, R Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020b.
Wen and Li [2022] Zixin Wen and Yuanzhi Li. The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. Advances in Neural Information Processing Systems, 35:24794–24809, December 2022.
Lutter et al. [2021] Michael Lutter, Leonard Hasenclever, Arunkumar Byravan, Gabriel Dulac-Arnold, Piotr Trochim, Nicolas Heess, Josh Merel, and Yuval Tassa. Learning dynamics models for model predictive agents. arXiv preprint arXiv:2109.14311, 2021.
Kostrikov et al. [2020] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649, 2020.
Yarats et al. [2021b] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021b.
Hansen et al. [2023] Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, Robust World Models for Continuous Control. In The Twelfth International Conference on Learning Representations, October 2023.
Williams et al. [2015] Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.
Bellman [1957] Richard Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
Hafner et al. [2022] Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with Discrete World Models. In International Conference on Learning Representations, February 2022.
Zheng et al. [2024] Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daumé III, and Furong Huang. Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
Ha and Schmidhuber [2018] David Ha and Jürgen Schmidhuber. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
Hafner et al. [2019b] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In International Conference on Machine Learning, pages 2555–2565. PMLR, May 2019b.
Hafner et al. [2023] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
Yarats et al. [2021c] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving Sample Efficiency in Model-Free Reinforcement Learning from Images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10674–10681, May 2021c.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization, July 2016.

Appendices

Appendix A Implementation Details

Architecture

We implemented iQRL with PyTorch [54] and used the AdamW optimizer [55] for training the models. All components (encoder, dynamics, actor and critic) are implemented as MLPs. Following Hansen et al. [44] we let all intermediate layers be linear layers followed by LayerNorm [56]. Using LayerNorm is what led to our base TD3 implementation performing so well. We use Mish activation functions throughout. Below we summarize the iQRL architecture for our base model.

⬇

iQRL(

(fsq): FSQ(

(project_in): Identity()

(project_out): Identity()

)

(encoder): ModuleDict(

(state): Sequential(

(0): NormedLinear(in_features=O, out_features=256, act=Mish)

(1): Linear(in_features=256, out_features=512)

)

(encoder_tar): ModuleDict(

(state): Sequential(

(0): NormedLinear(in_features=O, out_features=256, act=Mish)

(1): Linear(in_features=256, out_features=512)

)

(dynamics): Sequential(

(0): NormedLinear(in_features=512+A, out_features=512, act=Mish)