\coltauthor\Name

Emma Cramer \Email[email protected]
\NameJonas Reiher \Email[email protected]
\NameSebastian Trimpe \Email[email protected]
\addrInstitute for Data Science in Mechanical Engineering (DSME), RWTH Aachen University
Dennewartstraße 27, 52068 Aachen, Germany

Tracking Object Positions in Reinforcement Learning:
A Metric for Keypoint Detection

Abstract

Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance.

keywords:
reinforcement learning, representation learning, autoencoder, keypoint detection

1 Introduction

In real-world control tasks like robotics, successful reinforcement learning (RL) often hinges on a thorough state representation. This necessitates including all task-relevant objects in the scene. This issue is particularly prominent in tasks involving unstructured environments or interactions with numerous objects, where defining the state space without significant prior knowledge is difficult. Image data provides a potential solution, either through direct end-to-end learning of the control signal or by first learning a low-dimensional representation of the high-dimensional data (Bleher et al., 2022; Levine et al., 2016). For practical applications, interpretability in terms of physical quantities is usually advantageous. Spatial autoencoders (SAEs) have been effective in learning low-dimensional representations, expressed as 2D points on the image plane, referred to as keypoints. This latent representation can be used, e.g., as part of the state representation of the RL agent. Figure 1 shows the complete learning pipeline.

While keypoints have led to well-performing RL algorithms (Kulkarni et al., 2019; Ghadirzadeh et al., 2017; Chen et al., 2023), limited research has been conducted on whether SAEs effectively extract positional information of objects in the scene and, if so, how well they do this. Prior work often evaluates SAE architectures indirectly by training a downstream RL agent and evaluating the performance of the full RL pipeline (Qin et al., 2020; Boney et al., 2021; Chen et al., 2023). This approach requires a lot of resources for SAE evaluation since RL training is computationally expensive. Further, if weak RL performance is obtained, it is unclear whether RL or SAE training did not perform. Other works assess SAE performance by its training loss, which lacks insight into the physical meaningfulness of the keypoints. Some works propose a qualitative assessment by examining single keypoints on the image plane (Zhang et al., 2018; Puang et al., 2020). This approach disregards the importance of consistency over trajectories. We argue that keypoints essentially serve as sensor readings and thus how well task-relevant objects are tracked over time needs to be assessed in terms of accuracy and reliability.

\subfigure

[SAE trained on rec. loss] Refer to caption       \subfigure[policy trained to maximize exp. return] Refer to caption

Figure 1: SAE extracts 2D positions from images via a spatial bottleneck. The SAE encoder is then integrated into an RL framework to obtain a state representation for immeasurable objects.

This paper proposes a straightforward metric for quantitatively evaluating the extraction of positional information of task-relevant objects in the latent space of SAEs. The proposed metric is applied to (i) train multiple base SAE architectures and compare their tracking performance, (ii) explore various improvements of these architectures, and (iii) learn a suitable RL task using keypoints from different SAEs as the state. The evaluation reveals significant variations in the spatial extraction capability of common SAEs, emphasizing the importance of a thorough evaluation before incorporating them into RL states. Building on these insights, we propose three key modifications to substantially improve the tracking performance of common SAE architectures, resulting in, e.g., a 30 % increase in tracking capability for the commonly used KeyNet architecture (Jakab et al., 2018). We demonstrate that the metric allows us to judge SAEs with regard to capturing physically interpretable positional features and that this metric is a good indicator of downstream RL performance. For the considered robotic manipulation task, SAE training takes approximately an order of magnitude less computational resources than RL training, making our metric an effective and lightweight indicator of RL performance before expensive RL training.

2 Related Work

Various approaches utilize deep neural networks (NNs) to extract state representations from images or videos (Dwibedi et al., 2018; Seo et al., 2022); we review the ones most related to this work.

Unsupervised state representation learning for RL. Applications span from general continuous control (Dwibedi et al., 2018; Hafner et al., 2019) to robotic manipulation (Lesort et al., 2019; Rafailov et al., 2021). Many models follow an autoencoder structure with a low-dimensional bottleneck, optimizing for input reconstruction (Finn et al., 2016; Yarats et al., 2021). Some of these learn world models and recurrently capture environment dynamics in the latent representation (Ha and Schmidhuber, 2018; Seo et al., 2022; Hafner et al., 2023). Generally, autoencoders constrain the dimension, but not what is captured in the latent space (Yarats et al., 2021; Rafailov et al., 2021). In contrast, SAEs are constrained to capture 2D keypoint positions (Finn et al., 2016).

Spatial autoencoder architectures. SAE keypoints have successfully been used as RL state representations in robotic control (Puang et al., 2020; Chen et al., 2023), to play Atari games (Kulkarni et al., 2019), or to provide a goal description (Qin et al., 2020). Central to SAEs is the spatial soft-argmax layer first proposed by Levine et al. (2016) to train an end-to-end deep visuomotor policy. Finn et al. (2016) modified this approach to obtain a standalone deep SAE architecture, consisting of a convolutional encoder and fully connected decoder. Jakab et al. (2018) propose the KeyNet architecture, incorporating a convolutional decoder. These two elements, the encoder-decoder structure and the spatial soft-argmax layer, are essential to all SAEs. Many architectures build upon these blocks; incorporating feature transport mechanisms (Kulkarni et al., 2019), working on error maps (Gopalakrishnan et al., 2021), and reconstructing segmentation masks (Puang et al., 2020) or frame differences (Sun et al., 2022). Recently, SAEs have been extended to learn 3D points (Li et al., 2022; Sun et al., 2023). While these approaches differ in the way they are trained, all aim to represent positional information. We focus our investigation on two of the most common base architectures (Finn et al., 2016; Jakab et al., 2018) as (i) they form the basis for many more complex architectures and (ii) we found that if trained correctly, they can serve as reliable feature extractors. In principle, our evaluation procedure can be applied to all of the above architectures.

Evaluation of SAEs. Typically, SAEs are evaluated indirectly through compute-intensive RL or control performance (Qin et al., 2020; Wang et al., 2022; Boney et al., 2021) or qualitative visual assessments (Zhang et al., 2018; Puang et al., 2020). In general, latent representations can be evaluated via reconstruction loss (Finn et al., 2016), disentanglement measurements (Carbonneau et al., 2022) or mutual information estimates (Rezaabad and Vishwanath, 2020), all of which neglect the spatial 2D keypoint structure and thus do not assess the physical meaningfulness of the features. In the computer vision domain, keypoints for image matching are evaluated by reprojecting from different views with known camera transformation (Zhao et al., 2023), which is not applicable for SAEs. Jakab et al. (2018) approximate labeled ground truth points as linear combinations of all keypoints. Their KeyNet SAE is evaluated with the percentage of these predicted points within a fixed distance from the labels. The same linear combination has been used by others to compute mean errors to ground truth points (Zhang et al., 2018; Lorenz et al., 2019; Sun et al., 2022). Kulkarni et al. (2019) match keypoints to ground truth points via a min-cost assignment and compute precision and recall over trajectories. Although being quantitative, the above approaches cannot assess the quality of keypoints over trajectories and allow no statement about whether all task-relevant objects are represented. We find that both aspects are critical for use in control or RL and our method, described in Section 4, addresses these key limitations in existing evaluation approaches.

3 Problem Setting

We consider the general structure of an autoencoder Ihenc,ϕzhdec,ψI^subscriptencitalic-ϕ𝐼𝑧subscriptdec𝜓^𝐼I\xrightarrow{h_{\mathrm{enc},\phi}}z\xrightarrow{h_{\mathrm{dec},\psi}}\hat{I}italic_I start_ARROW start_OVERACCENT italic_h start_POSTSUBSCRIPT roman_enc , italic_ϕ end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_z start_ARROW start_OVERACCENT italic_h start_POSTSUBSCRIPT roman_dec , italic_ψ end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW over^ start_ARG italic_I end_ARG, operating on an input image I𝐼Iitalic_I, which is mapped to a latent representation z𝑧zitalic_z via an NN encoder henc,ϕsubscriptencitalic-ϕh_{\mathrm{enc},\phi}italic_h start_POSTSUBSCRIPT roman_enc , italic_ϕ end_POSTSUBSCRIPT and then back to a reconstructed image I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG via an NN decoder hdec,ψsubscriptdec𝜓h_{\mathrm{dec},\psi}italic_h start_POSTSUBSCRIPT roman_dec , italic_ψ end_POSTSUBSCRIPT (cf. Figure 1). Typically, the autoencoder is trained in an unsupervised fashion to minimize reconstruction loss L(I,I^)=II^22𝐿𝐼^𝐼superscriptsubscriptnorm𝐼^𝐼22L(I,\hat{I})=\|I-\hat{I}\|_{2}^{2}italic_L ( italic_I , over^ start_ARG italic_I end_ARG ) = ∥ italic_I - over^ start_ARG italic_I end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT while restricting the dimension of the latent space with a low dimensional bottleneck. Here, we are particularly interested in spatial autoencoders (SAEs) (Finn et al., 2016), which aim to represent 2D positions of objects in an image as latent variables z𝑧zitalic_z. For this, the last layer of the encoder henc,ϕsubscriptencitalic-ϕh_{\mathrm{enc},\phi}italic_h start_POSTSUBSCRIPT roman_enc , italic_ϕ end_POSTSUBSCRIPT with N𝑁Nitalic_N outputs is chosen as a soft-argmax layer according to Finn et al. (2016). This layer ensures that the latent space can be interpreted as N𝑁Nitalic_N keypoints in the image plane with z=(z1,z2,,zN)2×N𝑧subscript𝑧1subscript𝑧2subscript𝑧𝑁superscript2𝑁z=(z_{1},z_{2},\dots,z_{N})\in\mathbb{R}^{2\times N}italic_z = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_N end_POSTSUPERSCRIPT. For this, the feature maps MH×W×N𝑀superscript𝐻𝑊𝑁M\in\mathbb{R}^{H\times W\times N}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_N end_POSTSUPERSCRIPT of the last convolutional encoder layer are passed through a channel-wise softmax layer shwn=exp(mhwn/α)/h,wexp(mhwn/α)subscript𝑠𝑤𝑛subscript𝑚𝑤𝑛𝛼subscriptsuperscriptsuperscript𝑤subscript𝑚superscriptsuperscript𝑤𝑛𝛼s_{hwn}=\exp(m_{hwn}/\alpha)/\sum_{h^{\prime},w^{\prime}}\exp(m_{h^{\prime}w^{% \prime}n}/\alpha)italic_s start_POSTSUBSCRIPT italic_h italic_w italic_n end_POSTSUBSCRIPT = roman_exp ( italic_m start_POSTSUBSCRIPT italic_h italic_w italic_n end_POSTSUBSCRIPT / italic_α ) / ∑ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_m start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n end_POSTSUBSCRIPT / italic_α ), where α𝛼\alphaitalic_α is a learned temperature parameter and hhitalic_h, w𝑤witalic_w, and n𝑛nitalic_n are indices along the height, width, and depth dimensions of M𝑀Mitalic_M. Then the n𝑛nitalic_n-th 2D point of maximum activation is computed as zn=(h,whshwn,h,wwshwn)subscript𝑧𝑛subscript𝑤subscript𝑠𝑤𝑛subscript𝑤𝑤subscript𝑠𝑤𝑛z_{n}=(\sum_{h,w}h\cdot s_{hwn},\sum_{h,w}w\cdot s_{hwn})italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT italic_h ⋅ italic_s start_POSTSUBSCRIPT italic_h italic_w italic_n end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT italic_w ⋅ italic_s start_POSTSUBSCRIPT italic_h italic_w italic_n end_POSTSUBSCRIPT ).

We consider a setup with K𝐾Kitalic_K rigid objects that shall be tracked. Let the ground truth position of the k𝑘kitalic_k-th object (e.g., its center of mass) in the 2D image space be given by xk2subscript𝑥𝑘superscript2x_{k}\in\mathbb{R}^{2}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and the positions of all objects collectively by x=(x1,,xK)2×K𝑥subscript𝑥1subscript𝑥𝐾superscript2𝐾x=(x_{1},\dots,x_{K})\in\mathbb{R}^{2\times K}italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_K end_POSTSUPERSCRIPT. An ideal SAE should track x𝑥xitalic_x with its latent representation z𝑧zitalic_z in some sense. However, how to evaluate the tracking performance is unclear, and proposing a method that quantifies this is our main objective:

Problem 1

We seek to quantify how well the keypoints z𝑧zitalic_z represent the ground truth objects x𝑥xitalic_x.

SAEs are often used as feature extractors for RL tasks, where keypoints z𝑧zitalic_z are then part of the state representation. In RL, an agent learns to optimize an objective through interaction with an environment (Sutton and Barto, 1998). The environment is represented as a discounted Markov decision process (MDP) defined by the tuple (𝒮,𝒜,p,r,ρ0,γ)𝒮𝒜𝑝𝑟subscript𝜌0𝛾(\mathcal{S},\mathcal{A},p,r,\rho_{0},\gamma)( caligraphic_S , caligraphic_A , italic_p , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ), with state space 𝒮𝒮\mathcal{S}caligraphic_S, action space 𝒜𝒜\mathcal{A}caligraphic_A, the typically unknown transition probability distribution p:𝒮×𝒜×𝒮:𝑝𝒮𝒜𝒮p:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}italic_p : caligraphic_S × caligraphic_A × caligraphic_S → blackboard_R, the reward function r:𝒮×𝒜:𝑟𝒮𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A → blackboard_R, the distribution of the initial state ρ0(s0):𝒮:subscript𝜌0subscript𝑠0𝒮\rho_{0}(s_{0}):\mathcal{S}\rightarrow\mathbb{R}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) : caligraphic_S → blackboard_R, and the discount rate γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ). A policy π:𝒮×𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_π : caligraphic_S × caligraphic_A → blackboard_R selects an action with a certain probability for a given state. The agent interacts with the MDP to collect episodes τ=(s0,a0,r1,s1,,sT)𝜏superscript𝑠0superscript𝑎0superscript𝑟1superscript𝑠1superscript𝑠𝑇\tau=(s^{0},a^{0},r^{1},s^{1},\dots,s^{T})italic_τ = ( italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), which are sequences of states, actions, and rewards over time steps t=0,,T𝑡0𝑇t=0,\dots,Titalic_t = 0 , … , italic_T. The usual objective in RL is to find the policy π𝜋\piitalic_π that maximizes the expected return J(π)=𝔼τ[t=1Tγtrt]𝐽𝜋subscript𝔼𝜏delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡superscript𝑟𝑡J(\pi)=\mathbb{E}_{\tau}[\,\sum_{t=1}^{T}\gamma^{t}r^{t}\,]italic_J ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ], where the expectation is over trajectories τ𝜏\tauitalic_τ under the policy π𝜋\piitalic_π. The general understanding in literature (Finn et al., 2016; Ghadirzadeh et al., 2017; Kulkarni et al., 2019; Wang et al., 2022) is that well-tracking SAEs will yield better RL performance, such as higher expected return J(π)𝐽𝜋J(\pi)italic_J ( italic_π ) or episode success rates. We investigate whether this holds true for the metric proposed for Problem 1:

Problem 2

Is SAE performance (according to Problem 1) an indicator for RL performance?

If this hypothesis holds true, SAE performance can be evaluated before actual RL training, usually at significantly lower computational cost.

4 A Metric to Evaluate Keypoints

In this section, we propose a metric to quantify the tracking performance of an SAE, addressing Problem 1. In Section 5.1, we then use the metric for RL to evaluate Problem 2.

As RL makes decisions sequentially over time, we are interested in tracking performance over multiple time steps. Therefore, we denote by xkt2superscriptsubscript𝑥𝑘𝑡superscript2x_{k}^{t}\in\mathbb{R}^{2}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT the ground truth position of the k𝑘kitalic_k-th object, and by xt2×Ksuperscript𝑥𝑡superscript2𝐾x^{t}\in\mathbb{R}^{2\times K}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_K end_POSTSUPERSCRIPT the positions of all K𝐾Kitalic_K objects collectively at time t𝑡titalic_t. Furthermore, we denote the trajectory of these objects over time steps t=0,,T1𝑡0𝑇1t=0,\dots,T-1italic_t = 0 , … , italic_T - 1 by xτ=(x0,x1,,xT1)2×K×Tsuperscript𝑥𝜏superscript𝑥0superscript𝑥1superscript𝑥𝑇1superscript2𝐾𝑇x^{\tau}=(x^{0},x^{1},\dots,x^{T-1})\in\mathbb{R}^{2\times K\times T}italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT = ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_K × italic_T end_POSTSUPERSCRIPT. We use analogous notation for the N𝑁Nitalic_N latent keypoints; that is, zτ=(z0,z1,,zT1)2×N×Tsuperscript𝑧𝜏superscript𝑧0superscript𝑧1superscript𝑧𝑇1superscript2𝑁𝑇z^{\tau}=(z^{0},z^{1},\dots,z^{T-1})\in\mathbb{R}^{2\times N\times T}italic_z start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT = ( italic_z start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_N × italic_T end_POSTSUPERSCRIPT denotes the trajectory of keypoints.

Given this notion of trajectories, Problem 1 translates to measuring how well the keypoint trajectory zτsuperscript𝑧𝜏z^{\tau}italic_z start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT follows the ground truth trajectory xτsuperscript𝑥𝜏x^{\tau}italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT for a given instance of an SAE. A naive approach would be to directly compute the Euclidean distance between point-pairs along these trajectories. However, this will not yield satisfactory results. The keypoints are learned in an unsupervised fashion, which provides no guarantee about which part of an object is tracked. As points on a rigid object have a fixed relation to each other, it is reasonable to assume that, for downstream RL training, any point on the object is an equally suitable representation. For example, if a ground truth point and a keypoint are on the same object at a constant offset, this offset would accumulate to a tracking error when naively taking the difference between the two points. Thus, we need to account for offsets by an appropriate transformation. Finally, the SAE extracts many keypoints (usually N>K𝑁𝐾N>Kitalic_N > italic_K) and the association of keypoints to ground truth points is unknown. Taking these together, an evaluation protocol of keypoints will thus require (i) accounting for the offset between any point on the object and ground truth, (ii) associating keypoints with ground truth points, and (iii) develo** a quantitative measure to evaluate the capability of tracking all relevant ground truth points.

Refer to caption Refer to caption

Figure 2: We consider keypoints (red) to be equally informative about object positions as the ground truth CM (black). Motion of both points results in a varying offset in the image plane. We evaluate with transformed keypoints (blue) minimizing the offset.

Transformation. Keypoints are coordinates in the 2D image space, which are supposed to track objects in 3D space. Often, the center of mass (CM) is taken as the ideal point to represent the 3D position of an object in the world frame. However, for the downstream RL task, the keypoints do not have to track the CM, but any fixed point on the object, i.e., the point’s offset from the CM should be constant in the object’s 3D frame of reference (cf. Figure 2). If the keypoints were to track the CM, keypoints and ground truth points would coincide in image space. Due to the 3D offset, we also observe an offset in 2D-image space (cf. Figure 2). This 2D offset is generally unknown; it depends on the unknown 3D offset, object position, orientation, and camera view. Even if the keypoints were to track a point on an object perfectly, this offset would falsely suggest a tracking error in 2D. Instead of capturing the full geometry of the problem, which requires additional problem insight, we propose a lightweight approach that eliminates the main offsets between keypoints and ground truth. We consider a time-invariant affine transformation of keypoints z^=Az+b^𝑧𝐴𝑧𝑏\hat{z}=Az+bover^ start_ARG italic_z end_ARG = italic_A italic_z + italic_b, where A2×2𝐴superscript22A\in\mathbb{R}^{2\times 2}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT and b2𝑏superscript2b\in\mathbb{R}^{2}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are fit via ordinary least squares on a held-out test set, containing random time steps from trajectories unseen in SAE training. This transformation can account for the scaling and translation of a keypoint trajectory. We note that even with a time-invariant 3D offset, the 2D offset can be time-variant due to the object’s motion; the time invariance thus represents an approximation. Still, we find that this transformation is easy to compute, requires no additional information about the ground truth objects, and works well in practice (cf. Section 5).

Association and tracking error. Consider the trajectories xτsuperscript𝑥𝜏x^{\tau}italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT, z^τsuperscript^𝑧𝜏\hat{z}^{\tau}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT of K𝐾Kitalic_K ground truth objects and N𝑁Nitalic_N transformed keypoints. We propose an error metric between the trajectory of one ground truth object xnτsubscriptsuperscript𝑥𝜏𝑛x^{\tau}_{n}italic_x start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the transformed trajectory of one keypoint z^nτsubscriptsuperscript^𝑧𝜏𝑛\hat{z}^{\tau}_{n}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We define the tracking error en,ksubscript𝑒𝑛𝑘e_{n,k}italic_e start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT between any two trajectories n𝑛nitalic_n and k𝑘kitalic_k as

en,k=t=1Tz^ntxkt22.subscript𝑒𝑛𝑘superscriptsubscript𝑡1𝑇superscriptsubscriptdelimited-∥∥subscriptsuperscript^𝑧𝑡𝑛subscriptsuperscript𝑥𝑡𝑘22e_{n,k}=\sum_{t=1}^{T}\lVert\hat{z}^{t}_{n}-x^{t}_{k}\rVert_{2}^{2}.italic_e start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

The error en,ksubscript𝑒𝑛𝑘e_{n,k}italic_e start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT is a measure of how well a specific keypoint tracks a ground truth object over time. Using the tracking error, we determine the index of the keypoint znksubscript𝑧subscriptsuperscript𝑛𝑘z_{n^{*}_{k}}italic_z start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT that best tracks object xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as nk=argminnen,ksubscriptsuperscript𝑛𝑘subscriptargmin𝑛subscript𝑒𝑛𝑘n^{*}_{k}=\operatorname*{arg\,min}_{n}e_{n,k}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT. Once we assigned the most suitable keypoint for each ground truth object, we give the tracking error of the associated keypoint as enk,ksubscript𝑒subscriptsuperscript𝑛𝑘𝑘e_{n^{*}_{k},k}italic_e start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT. For our evaluation, we always consider the tracking error of the best keypoint. The lower this tracking error, the better the ground truth point xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is represented by the keypoint znksubscript𝑧subscriptsuperscript𝑛𝑘z_{n^{*}_{k}}italic_z start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The error measure enables a comparison of different SAE architectures and individual training runs of the same architecture broken down into objects.

For the later evaluation of SAEs, we now define indicators for an SAE’s overall tracking performance. We classify an object xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as correctly tracked if the tracking error of the most suitable keypoint znksubscript𝑧subscriptsuperscript𝑛𝑘z_{n^{*}_{k}}italic_z start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT is below an application-specific threshold μk>0subscript𝜇𝑘0\mu_{k}>0italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0. The index set 𝒳csubscript𝒳𝑐\mathcal{X}_{c}caligraphic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of all correctly tracked objects is given by 𝒳c={k:enk,kμk}subscript𝒳𝑐conditional-set𝑘subscript𝑒subscriptsuperscript𝑛𝑘𝑘subscript𝜇𝑘\mathcal{X}_{c}=\{k:e_{n^{*}_{k},k}\leq\mu_{k}\}caligraphic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_k : italic_e start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. We then define the tracking capability TCTC\mathrm{TC}roman_TC of one trained SAE as the percentage of tracked ground truth objects, i.e.,

TC=|𝒳c|/K.TCsubscript𝒳𝑐𝐾\mathrm{TC}=\left|\mathcal{X}_{c}\right|/K.roman_TC = | caligraphic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | / italic_K . (2)

An ideal tracking capability of TC=1TC1\mathrm{\mathrm{TC}}=1roman_TC = 1 means that for this SAE, the position of all ground truth objects is correctly encoded in the latent space.

A quantitative evaluation should consider the distribution of the tracking error and the tracking capability over multiple training runs. We look at the mean, median, and the variance of the tracking error over runs. Similarly, we evaluate the mean tracking capability TC¯¯TC\overline{\mathrm{TC}}over¯ start_ARG roman_TC end_ARG over multiple runs. Intuitively, TC¯¯TC\overline{\mathrm{TC}}over¯ start_ARG roman_TC end_ARG gives the mean percentage of all ground truth objects captured by keypoints. An SAE with a high mean tracking capability is a reliable feature extractor for RL scenarios. For individual ground truth objects, we denote as TC¯ksubscript¯TC𝑘\overline{\mathrm{TC}}_{k}over¯ start_ARG roman_TC end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the mean tracking capability for object k𝑘kitalic_k.

5 Evaluation

Refer to caption
Refer to caption
Refer to caption
Figure 3: The PandaPush-v3 task with three object positions xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT marked (left). Selected ground truth, keypoint, and transformed keypoint trajectories are shown in white, red, and blue for the end effector (middle) and cube (left).

We first use our proposed metrics (1) and (2) to evaluate the tracking performance of base SAE architectures commonly used in RL and propose architecture modifications to improve tracking. We then investigate how tracking performance links to performance in a downstream RL task. The empirical results reveal the following main insights:

  1. 1.

    The proposed metric is able to quantify the tracking performance of SAEs.

  2. 2.

    The combination of the best baseline SAE with our proposed modification yields TC¯=0.99¯TC0.99\overline{\mathrm{TC}}=0.99over¯ start_ARG roman_TC end_ARG = 0.99, and it can thus be considered a reliable and precise spatial feature extractor.

  1. 3.

    Our proposed metric for SAE tracking performance is indicative of the performance of RL; that is, the architecture with best SAE metric also achieves best asymptotic return.

  2. 4.

    The best-found architecture in terms of SAE tracking achieves an RL return comparable to training with ground truth points.

5.1 SAE Evaluation

We demonstrate the suitability of the tracking error and tracking capability introduced in Section 4 to evaluate the performance of SAEs. We provide visualizations of our quantitative results at \hrefhttps://youtu.be/8KqFXQiWa9wyoutu.be/8KqFXQiWa9w.

SAE experiment setup. We use the PandaPush-v3 environment from the panda-gym (Gallouédec et al., 2021) simulation. The robot’s task is to push a cube toward a target (cf. Figure  3). We identify three task-relevant objects in this environment, (i) the green cube to be moved, (ii) the blue square indicating the target, and (iii) the tip of the end effector. The different sizes and motion behavior of the objects make them a suitable selection to evaluate the tracking performance. We consider three standard SAE architectures and our own combination of modifications:

  1. Basic:

    We design the Basic architecture to be a simple and efficient SAE baseline incorporating the key components that all SAEs typically share. The CNN encoder has six convolutional layers and max-pooling operations in between. The decoder uses KeyNet’s Gaussian kernel maps, followed by three convolutional layers. We look at two versions of this SAE with N=16𝑁16N=16italic_N = 16 (Basic) and N=32𝑁32N=32italic_N = 32 keypoints (Basic-kp32).

  2. DSAE (Finn et al., 2016):

    DSAE introduced the spatial soft-argmax bottleneck, still used in many other architectures (Zhang et al., 2018; Cabi et al., 2019; Gopalakrishnan et al., 2021; Puang et al., 2020; Boney et al., 2021). This was the first SAE to be successfully used for RL training. N=16𝑁16N=16italic_N = 16 keypoints are captured between a CNN encoder and fully connected decoder.

  3. KeyNet (Jakab et al., 2018):

    KeyNet is a widely used and built upon SAE architecture (Kulkarni et al., 2019; Minderer et al., 2019; Gopalakrishnan et al., 2021), consisting of a CNN encoder and decoder with N=30𝑁30N=30italic_N = 30 keypoints. Input to the decoder are N𝑁Nitalic_N feature maps with isotropic Gaussian kernels at the corresponding keypoint locations.

  4. Vel-std-bg modifications:

    We propose a set of modifications to the above architectures, combining ideas from existing works and new ones. Analogously to DSAE, we add a velocity loss term to the reconstruction loss with a weighting factor β𝛽\betaitalic_β. By penalizing a change of keypoint velocities in subsequent frame pairs, the velocity loss encourages temporal consistency. KeyNet uses Gaussian heatmaps as input to the first CNN decoder layer. We propose making the standard deviation σ𝜎\sigmaitalic_σ of these heatmaps trainable. This enables the decoder to control the radius of influence of a keypoint. Finally, we add a bias with the dimensions of the target image to the decoder’s output, giving the decoder a straightforward way to reconstruct a stationary background and allowing time-varying keypoints to focus on moving objects. For the modified architectures, we call the combinations of the KeyNet or Basic architecture combined with our proposed modifications KeyNet-vel-std-bg and Basic-vel-std-bg, respectively.

While many more architectures exist in literature (cf. Section 2), we deliberately choose baseline architectures maintaining the usual autoencoder setup without auxiliary networks such as adversaries or feature transport mechanisms. We choose modifications which we believe to be beneficial for the main goal of SAEs, spatial tracking of keypoints over time. For SAE tracking evaluation, we conduct 24 training runs with different random seeds. The tracking thresholds need to be chosen heuristically. Here we choose μcube=μtarget=0.015subscript𝜇cubesubscript𝜇target0.015\mu_{\mathrm{cube}}=\mu_{\mathrm{target}}=0.015italic_μ start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT = 0.015 and μeef=0.1subscript𝜇eef0.1\mu_{\mathrm{eef}}=0.1italic_μ start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT = 0.1. Intuitively larger objects result in a larger tracking error, due to the possible offset to the center of mass. We find a good heuristic to be related to the SAE reconstruction. Objects appear in the reconstruction when the tracking error falls below μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Additional modifications, an ablation study, and all experimental details such as hyperparameters can be found in the appendix.

Evaluating accuracy via the tracking error.

Refer to caption
Refer to caption
Figure 4: Basic-kp32 and KeyNet-vel-std-bg tracking errors enk,ksubscript𝑒subscriptsuperscript𝑛𝑘𝑘e_{n^{*}_{k},k}italic_e start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT for K=3𝐾3K=3italic_K = 3 objects over epochs.

First, we study the tracking error of individual runs during training. We observe a sudden drop in tracking error whenever the SAE has learned to track an object. To understand this behavior, we look at the tracking error over episodes for an SAE with medium performance, the Basic-kp32, in Figure 4. All runs for Basic-kp32 show the drop in tracking error for the cube, which is the easiest to track. For the target, which is slightly harder to track due to its smaller size and rare movement, only a few runs show the expected drop below our threshold, resulting in a correctly tracked target. Instead of a sharp drop, the tracking error for the end-effector shows a shallow decrease over training epochs. We interpret this observation as follows: The end-effector occupies considerably more pixels in the image than cube and target. Thus, the reconstruction first focuses on these areas, resulting in early vague tracking and reconstruction. However, tracking a point on the end-effector consistently is achieved only by a few runs. Looking at the tracking error of KeyNet-vel-std-bg, Figure 4 shows a distinct drop below the threshold for the cube and target. Even for the end-effector, the tracking error consistently falls below the threshold, indicating successful tracking. We find that the tracking error is useful in examining exactly how accurate a trained SAE architecture instance can track individual objects.

Evaluating reliability via the tracking error.

Refer to caption
Figure 5: Box plots of the tracking error enk,ksubscript𝑒subscriptsuperscript𝑛𝑘𝑘e_{n^{*}_{k},k}italic_e start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT for K=3𝐾3K=3italic_K = 3 ground truth objects.

Our results indicate differing tracking performance for random seeds within the same architecture, showing that the SAE architectures need to be evaluated over multiple training runs. The tracking error’s distribution over 24 training runs is illustrated in Figure 5. We remark that the tracking error varies among (i) architectures, (ii) random seeds, and (iii) objects. Among the standard architectures, KeyNet attains the lowest mean tracking error and smallest variance, indicative of good overall tracking performance. For DSAE and Basic, larger tracking errors with greater variance are observed, marking them less reliable. The KeyNet-vel-std-bg architecture shows lowest mean tracking error and variance for all three objects. We identify the criteria for well-performing architectures as low mean tracking error and small variance over runs.

Evaluating overall performance via the tracking capability.

Refer to caption
Figure 6: Tracking capabilities TCk¯¯subscriptTC𝑘\overline{\mathrm{TC}_{k}}over¯ start_ARG roman_TC start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG for k=3𝑘3k=3italic_k = 3 ground truth objects of the baseline architectures.

Figure 6 shows the sum of mean object tracking capabilities TC¯ksubscript¯TC𝑘\overline{\mathrm{TC}}_{k}over¯ start_ARG roman_TC end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over architectures, further demonstrating the varying tracking performance across SAE architectures. Examining the tracking capability with regard to the individual ground truth objects, we further substantiate our hypothesis that the target and end-effector are more difficult to track than the cube. The tracking performance of all base architectures has potential for improvement as none is close to the theoretical maximum of 3.0. The combination vel-std-bg yields consistent improvement in tracking capability. KeyNet already tracks cube and target well and has a TC¯=0.681¯TC0.681\overline{\mathrm{TC}}=0.681over¯ start_ARG roman_TC end_ARG = 0.681. KeyNet-vel-std-bg has a near-perfect mean tracking capability of TC¯=0.986¯TC0.986\overline{\mathrm{TC}}=0.986over¯ start_ARG roman_TC end_ARG = 0.986. The biggest change can be seen in end effector tracking, which improved from TC¯eef=0.167subscript¯TCeef0.167\overline{\mathrm{TC}}_{\mathrm{eef}}=0.167over¯ start_ARG roman_TC end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT = 0.167 to 0.9580.9580.9580.958. We see that the tracking capability is a compact description of how well task-relevant objects are tracked. This information is critical for downstream control and RL tasks.

Combining the insights from the tracking error and tracking capability answers Problem 1.

5.2 RL Evaluation

We run RL experiments with SAE architectures selected by their tracking performance and find that this is a good indicator of downstream RL performance.

RL experiment setup. For RL experiments with SAEs as state, we randomly sample 5 trained SAEs per architecture and conduct 2 randomly seeded RL training runs with each of them, yielding a total 10 runs per SAE architecture. We use the SAC (Haarnoja et al., 2018) implementation from stable-baselines3 (Raffin et al., 2021). Hyperparameters are listed in appendix A.

We consider two types of state representation for RL with SAE-encoded keypoints: (i) latent keypoints as state st=ztsuperscript𝑠𝑡superscript𝑧𝑡s^{t}=z^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, (ii) latent keypoints combined with robot 3D position oeefsubscript𝑜eefo_{\mathrm{eef}}italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT and velocity o˙eefsubscript˙𝑜eef\dot{o}_{\mathrm{eef}}over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT, giving sextt=(zt,oeef,o˙eef)subscriptsuperscript𝑠𝑡extsuperscript𝑧𝑡subscript𝑜eefsubscript˙𝑜eefs^{t}_{\mathrm{ext}}=(z^{t},o_{\mathrm{eef}},\dot{o}_{\mathrm{eef}})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT = ( italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT ). The second scenario is relevant since end-effector position and velocity are often available as robot state measurements. As additional benchmarks, we include state representations with ground truth points xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, which are usually not available in practice, obtaining st=xtsuperscript𝑠𝑡superscript𝑥𝑡s^{t}=x^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and sextt=(xt,oeef,o˙eef)subscriptsuperscript𝑠𝑡extsuperscript𝑥𝑡subscript𝑜eefsubscript˙𝑜eefs^{t}_{\mathrm{ext}}=(x^{t},o_{\mathrm{eef}},\dot{o}_{\mathrm{eef}})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT ). Finally, we compare to RL runs with the full 3D simulation state, including positions, velocity, and orientation of cube and target. Actions consist of 3D displacements at=(Δoeef)tsuperscript𝑎𝑡superscriptΔsubscript𝑜eef𝑡a^{t}=(\Delta o_{\mathrm{eef}})^{t}italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( roman_Δ italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of the end effector at every time step. We use a sparse reward with rt=1subscript𝑟𝑡1r_{t}=-1italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - 1 and rT=0subscript𝑟𝑇0r_{T}=0italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0 on episode success. Following (Agarwal et al., 2021), we report interquartile mean (IQM) success rates with bootstrapped 95 % percentile confidence intervals.

Reinforcement learning with keypoints. Figure 7 shows the RL performance with state representation stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. We observe varying success rates depending on the SAE architecture and the gradations in RL performance follow the order of SAEs by tracking capability, as seen in Figure 6. The DSAE architecture shows no RL progress. Although both Basic-vel-std-bg and Basic-kp32 have similar total tracking capabilities (cf. Fig. 6), the former performs better on the RL task. This is due to its ability to track the end-effector reasonably well, while Basic-kp32 tracks the target instead. End-effector tracking is critical, as moving the cube is otherwise impossible. The best-tracking KeyNet-vel-std-bg dominates the RL with learned keypoints. Still this architecture does not reach the full-state performance. This is to be expected since the representation is limited to 2D space and lacks velocity information. The runs using 2D ground truth points, mimicking a perfect SAE, learn significantly earlier than KeyNet-vel-std-bg, but only achieve a slightly higher final success rate.

Refer to caption\subfigure

[state stsuperscript𝑠𝑡s^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT] Refer to caption     \subfigure[state sexttsubscriptsuperscript𝑠𝑡exts^{t}_{\mathrm{ext}}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT] Refer to caption

Figure 7: Success rate for different states: keypoints only in 7 and extended state in 7.

Figure 7 shows the RL runs using state sexttsubscriptsuperscript𝑠𝑡exts^{t}_{\mathrm{ext}}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT, i.e., including the end-effector’s 3D position and velocity in addition to keypoints. RL performance with KeyNet-vel-std-bg and the full state show comparable final success rates close to 1.01.01.01.0, indicating that 2D keypoints are a useful state representation. KeyNet-vel-std-bg additionally learns notably faster. We presume that the reduced 2D representations of target and cube positions accelerate RL training. As expected with DSAE, tracking neither target nor cube, learning progress is impossible. Compared to the first experiment setup, Basic-kp32 and Basic-vel-std-bg switch positions in final RL performance. Although Basic-vel-std-bg improves with the more precise 3D end-effector position, it is still unable to track the target and, therefore, limited in performance. For Basic-kp32, the missing end-effector tracking is now compensated with ground truth 3D information. Using its notable target tracking capability, it achieves better final performance. Initially, Basic-vel-std-bg learns faster, supporting the assumption that 2D representations can accelerate RL training. These kinds of insights are facilitated by the tracking capability and would not have been possible via traditional SAE evaluation. The IQM success rates for runs with ground truth instead of keypoints show faster learning but do not quite reach the maximum of full state and KeyNet-vel-std-bg.

Answering Problem 2, we find a link between SAE tracking capability, including the tracking capability for individual objects, and downstream RL performance.

6 Conclusion

We propose a metric to evaluate SAE performance with respect to task-relevant objects. By means of this metric, we show that well-performing SAE architecture actually track positions of task-relevant objects. We find notable performance differences in SAE architectures and identify three components that reliably improve performance, leading to almost perfect object tracking. We show that SAE tracking performance is indicative of downstream RL performance for a representative robotic manipulation task. This allows identifying suitable SAEs after comparatively lightweight SAE pretraining and before computationally expensive RL training. In addition, troubleshooting is greatly facilitated by the ability to evaluate the performance of an SAE as a key component of the RL pipeline. We observe that an RL agent using keypoints as part of its state achieves RL performance comparable to an agent with full simulation state. Thus, we consider keypoints a suitable state representation for robotic RL. We have demonstrated that this straightforward metric is effective in evaluating SAE architectures. The metric can be used to analyze any 2D keypoint extractor and is not restricted to SAEs. Investigating alternative keypoint extractors and extensions to 3D keypoints is thus a promising avenue for future research. The code to reproduce all results is available at \hrefhttps://github.com/Data-Science-in-Mechanical-Engineering/SAE-RLgithub.com/Data-Science-in-Mechanical-Engineering/SAE-RL and can be used to inform future research.

\acks

We thank Paul Brunzema and Bernd Frauenknecht for their helpful comments. We also thank Robin Kupper for his contributions in the early stages of this research. This work was partially funded by the “Demonstrations- und Transfernetzwerk KI in der Produktion (ProKI-Netz)” initiative, funded by the German Federal Ministry of Education and Research (BMBF, grant number 02P22A010). Computations were performed with computing resources granted by RWTH Aachen University under project rwth1385.

References

  • Agarwal et al. (2021) Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep Reinforcement Learning at the Edge of the Statistical Precipice. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, Inc., 2021.
  • Bleher et al. (2022) Steffen Bleher, Steve Heim, and Sebastian Trimpe. Learning fast and precise pixel-to-torque control: A platform for reproducible research of learning on hardware. 29(2):75–84, 2022.
  • Boney et al. (2021) Rinu Boney, Alexander Ilin, and Juho Kannala. Learning of feature points without additional supervision improves reinforcement learning from images. arXiv preprint arXiv:2106.07995, 2021.
  • Cabi et al. (2019) Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, and Mel Vecerik. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv preprint arXiv:1909.12200, 2019.
  • Carbonneau et al. (2022) Marc-Andre Carbonneau, Julian Zaidi, Jonathan Boilard, and Ghyslain Gagnon. Measuring Disentanglement: A Review of Metrics. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Chen et al. (2023) Ling-Chen Chen, Chi-Kai Ho, and Chung-Ta King. KeyState: Improving Image-based Reinforcement Learning with Keypoint for Robot Control. In IEEE International Conference on Industrial Technology (ICIT), Orlando, FL, USA, 2023.
  • Dwibedi et al. (2018) Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, and Pierre Sermanet. Learning actionable representations from visual observations. In IEEE/RSJ international conference on intelligent robots and systems (IROS), 2018.
  • Finn et al. (2016) Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation (ICRA), 2016.
  • Gallouédec et al. (2021) Quentin Gallouédec, Nicolas Cazin, Emmanuel Dellandréa, and Liming Chen. panda-gym: Open-source goal-conditioned environments for robotic learning. arXiv preprint arXiv:2106.13687, 2021.
  • Ghadirzadeh et al. (2017) Ali Ghadirzadeh, Atsuto Maki, Danica Kragic, and Mårten Björkman. Deep predictive policy training using reinforcement learning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  • Gopalakrishnan et al. (2021) Anand Gopalakrishnan, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Unsupervised object keypoint learning using local spatial predictability. In Proceedings of the IEEE Conference on Learning Representations(ICLR), 2021.
  • Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. Advances in neural information processing systems, 31, 2018.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 2018.
  • Hafner et al. (2019) Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International conference on machine learning. PMLR, 2019.
  • Hafner et al. (2023) Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104, 2023.
  • Jakab et al. (2018) Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems (NIPS), volume 31, 2018.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kulkarni et al. (2019) Tejas D. Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, and Volodymyr Mnih. Unsupervised learning of object keypoints for perception and control. In Advances in Neural Information Processing Systems (NIPS), volume 32, 2019.
  • Lesort et al. (2019) Timothée Lesort, Mathieu Seurin, Xinrui Li, Natalia Díaz-Rodríguez, and David Filliat. Deep unsupervised state representation learning with robotic priors: a robustness analysis. In International Joint Conference on Neural Networks (IJCNN). IEEE, 2019.
  • Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1), 2016.
  • Li et al. (2022) Yunzhu Li, Shuang Li, Vincent Sitzmann, Pulkit Agrawal, and Antonio Torralba. 3d neural scene representations for visuomotor control. In Conference on Robot Learning. PMLR, 2022.
  • Lorenz et al. (2019) Dominik Lorenz, Leonard Bereska, Timo Milbich, and Bjorn Ommer. Unsupervised Part-Based Disentangling of Object Shape and Appearance. 2019.
  • Minderer et al. (2019) Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin P. Murphy, and Honglak Lee. Unsupervised learning of object structure and dynamics from videos. Advances in Neural Information Processing Systems, 32, 2019.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Puang et al. (2020) En Yen Puang, Keng Peng Tee, and Wei **g. Kovis: Keypoint-based visual servoing with zero-shot sim-to-real transfer for robotics manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • Qin et al. (2020) Zengyi Qin, Kuan Fang, Yuke Zhu, Li Fei-Fei, and Silvio Savarese. Keto: Learning keypoint representations for tool manipulation. In IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • Rafailov et al. (2021) Rafael Rafailov, Tianhe Yu, Aravind Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control. PMLR, 2021.
  • Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 22(268), 2021.
  • Rezaabad and Vishwanath (2020) Ali Lotfi Rezaabad and Sriram Vishwanath. Learning Representations by Maximizing Mutual Information in Variational Autoencoders. In IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 2020.
  • Seo et al. (2022) Younggyo Seo, Kimin Lee, Stephen L. James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning. PMLR, 2022.
  • Sun et al. (2022) Jennifer J. Sun, Serim Ryou, Roni H. Goldshmid, Brandon Weissbourd, John O. Dabiri, David J. Anderson, Ann Kennedy, Yisong Yue, and Pietro Perona. Self-Supervised Keypoint Discovery in Behavioral Videos. 2022.
  • Sun et al. (2023) Jennifer J. Sun, Lili Karashchuk, Amil Dravid, Serim Ryou, Sonia Fereidooni, John C. Tuthill, Aggelos Katsaggelos, Bingni W. Brunton, Georgia Gkioxari, Ann Kennedy, Yisong Yue, and Pietro Perona. BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos. 2023.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 1998.
  • Wang et al. (2022) Tianying Wang, En Yen Puang, Marcus Lee, Wei **g, and Yan Wu. End-to-end Reinforcement Learning of Robotic Manipulation with Robust Keypoints Representation. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022.
  • Yarats et al. (2021) Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021. Issue: 12.
  • Zhang et al. (2018) Yuting Zhang, Yijie Guo, Yixin **, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsupervised discovery of object landmarks as structural representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Zhao et al. (2023) Xiaoming Zhao, Xingming Wu, **yu Miao, Weihai Chen, Peter C. Y. Chen, and Zhengguo Li. ALIKE: Accurate and Lightweight Keypoint Detection and Descriptor Extraction. IEEE Transactions on Multimedia, 25, 2023.

Appendix A Implementation Details

The following section gives more details about our implementation and training framework.

A.1 SAE Experiment Setup

For SAE tracking evaluation, we conduct 24 training runs with different random seeds for each of the considered model architectures.

Training. Hyperparameters are kept fixed and are listed in Table 1. During SAE training, we add normally distributed noise with σ=0.001𝜎0.001\sigma=0.001italic_σ = 0.001 to the input images for regularization. Images contain floating point RGB values in the range [0,1]01[0,1][ 0 , 1 ].

Table 1: Hyperparameters for SAE training.
SAE Hyperparameter Value
number of epochs 500500500500
batch size 32
learning rate 0.001
optimizer Adam (Kingma and Ba, 2014)
input shape 256×256×32562563256\times 256\times 3256 × 256 × 3
target shape 64×64×36464364\times 64\times 364 × 64 × 3
latent spatial dimensions [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ]

Datasets. The training, validation, and testing datasets consist of 5 00050005\,0005 000, 2 50025002\,5002 500, and 2 50025002\,5002 500 images, respectively. To keep all three datasets well-separated, we initially collect image sequences of 10 frames each and perform data splitting on these sequences. To capture sufficient variation in the randomly initialized object and target positions, episode lengths during image collection are limited to 20202020 frames. Images are collected using a smoothed random policy, starting with a random initial action in [10,10]×[10,10]×[10,10]101010101010[-10,10]\times[-10,10]\times[-10,10][ - 10 , 10 ] × [ - 10 , 10 ] × [ - 10 , 10 ] and sampling the next action from a Gaussian normal distribution with standard deviation σ=1.0𝜎1.0\sigma=1.0italic_σ = 1.0 centered on the previous action.

Evaluation. The tracking thresholds used for computing the tracking capability TCTC\mathrm{TC}roman_TC are chosen to be μcube=0.015subscript𝜇cube0.015\mu_{\mathrm{cube}}=0.015italic_μ start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT = 0.015, μtarget=0.015subscript𝜇target0.015\mu_{\mathrm{target}}=0.015italic_μ start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT = 0.015, and μeef=0.1subscript𝜇eef0.1\mu_{\mathrm{eef}}=0.1italic_μ start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT = 0.1. Tracking errors are evaluated in the latent spatial dimensions of size [1,1]×[1,1]1111[-1,1]\times[-1,1][ - 1 , 1 ] × [ - 1 , 1 ]. Where point estimates are are reported, they refer to values at the end of training. Figures showing the object-wise tracking errors over time are slightly smoothed by applying a Gaussian filter with σ=2.5𝜎2.5\sigma=2.5italic_σ = 2.5 steps.

A.2 SAE Architectures

Existing and new SAE architectures were implemented by us in PyTorch (Paszke et al., 2019). All details of this implementation can be found in the accompanying source code.

Architectures from literature. Our DSAE implementation follows the layout outlined by Finn et al. (2016). We keep all mentioned hyperparameters, including a CNN encoder with three convolutional layers, the number of keypoints N=16𝑁16N=16italic_N = 16 and a single fully connected layer as decoder. Our KeyNet implementation follows the implementation from Jakab et al. (2018). In their paper, the goal is to shift an object in one image with information from another image. For this, a second image is appended to the decoder input. We neglect this since we are assuming that the scene does not change. Where applicable, we keep all mentioned hyperparameters. We choose the number of keypoints to be N=30𝑁30N=30italic_N = 30, as primarily used in their experiments. We fix the number of convolutional blocks in the encoder as 4444 and in the decoder as 3333.

Basic. Our Basic SAE architecture is a CNN similar to KeyNet (Jakab et al., 2018). The encoder consists of three convolutional blocks. Each block consists of a 2D convolution followed by a 2D batch-normalization layer and another 2D convolution followed by a 2D max-pooling operation with kernel size and stride 2222 and a 2D batch-normalization layer again. The kernel size for every convolution is 3333. The number of output channels for both convolutions in a block is identical with 32323232 in the first, 64646464 in the second, and 128128128128 output channels in the third block. The blocks are followed by a 2D 1×1111\times 11 × 1-convolution to aggregate into 16161616 feature maps. A spatial soft arg-max layer extracts N=16𝑁16N=16italic_N = 16 2D keypoints from these. Similarly to KeyNet, the decoder begins with 2D heatmaps with Gaussian kernels with σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1 at the keypoint location for each of the N=16𝑁16N=16italic_N = 16 feature maps. These maps are created at the target resolution of 64×64646464\times 6464 × 64. Three convolutional layers follow, each with a kernel size of 3333 and 64646464, 32323232, and 16161616 output channels, respectively. Each convolution is followed by a 2D batch-normalization layer. A final 2D 1×1111\times 11 × 1-convolution again aggregates the feature maps into 3333 RGB image channels.

Modifications. We find three modifications to SAE architectures that improve tracking:

  1. 1.

    Velocity loss term (-vel). In Finn et al. (2016), an additional loss term gslowsubscript𝑔slowg_{\mathrm{slow}}italic_g start_POSTSUBSCRIPT roman_slow end_POSTSUBSCRIPT is added to the reconstruction loss with a weighting factor β𝛽\betaitalic_β. By penalizing a change of keypoint velocities in subsequent frame pairs, the velocity loss encourages the detection of temporally consistent keypoints. For the velocity loss term, we choose a weighting factor β=0.1𝛽0.1\beta=0.1italic_β = 0.1 while the reconstruction loss remains weighted with factor 1.01.01.01.0 and is here computed as the mean of the three samples passed through the encoder concurrently.

  2. 2.

    Trainable Gaussian standard deviation (-std). We follow Jakab et al. (2018) and use Gaussian heatmaps as input to the first CNN decoder layer. We propose making the standard deviation σ𝜎\sigmaitalic_σ trainable. This enables the decoder to control a keypoint’s radius of influence. For the trainable Gaussian standard deviation, we use an initial value of σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1, equal to the fixed standard deviation in the unmodified case.

  3. 3.

    Background bias layer (-bg). We add a bias with the dimensions of the target image to the decoder’s output, giving the decoder a straightforward way to reconstruct a stationary background. This usually happens within the first epochs and time-varying keypoints then focus on moving objects. The background bias layer is initialized to zeros.

A.3 RL Experiment Setup

For RL experiments with SAEs as state representation extractors, we randomly sample 5 trained SAEs of each architecture and conduct 2 randomly seeded RL training runs with each of them, yielding a total of 10 runs per SAE architecture.

Training. As an RL algorithm, we choose the SAC (Haarnoja et al., 2018) implementation from stable-baselines3 (Raffin et al., 2021). Except for the values in Table 2, hyperparameters are kept at default values.

Table 2: Hyperparameters for RL training.
RL Hyperparameter Value
number of steps 3 000 00030000003\,000\,0003 000 000
steps after which learning starts 100 000100000100\,000100 000
episode time limit (steps) 100100100100
number of parallel environments 10101010
gradient descent steps per environment step 1111

Evaluation. Figures showing the evaluation success rate over time are slightly smoothed by applying a Gaussian filter with σ=2.5𝜎2.5\sigma=2.5italic_σ = 2.5 evaluation steps. Evaluations steps are performed every 10 0001000010\,00010 000 training steps with 10 evaluation episodes each.

State representation. In our RL evaluation, we consider different state representations. Table 3 summarizes these configurations. We distinguish between using SAE-encoded keypoints and using ground truth (GT) points as well as between extending the state with end-effector state measurements (+ robot) or not doing this. Additionally we consider the full simulation state, including target position and cube position, velocity, orientation, and angular velocity. Note that o𝑜oitalic_o and ϕitalic-ϕ\phiitalic_ϕ denote 3D positions and orientations, respectively, while o˙˙𝑜\dot{o}over˙ start_ARG italic_o end_ARG and ϕ˙˙italic-ϕ\dot{\phi}over˙ start_ARG italic_ϕ end_ARG are their temporal derivatives.

Table 3: Specifications of state representation vector configurations used for RL experiments.
Configuration State
Keypoints st=ztsuperscript𝑠𝑡superscript𝑧𝑡s^{t}=z^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
Keypoints + robot sextt=(zt,oeef,o˙eef)subscriptsuperscript𝑠𝑡extsuperscript𝑧𝑡subscript𝑜eefsubscript˙𝑜eefs^{t}_{\mathrm{ext}}=(z^{t},o_{\mathrm{eef}},\dot{o}_{\mathrm{eef}})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT = ( italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT )
GT points st=xtsuperscript𝑠𝑡superscript𝑥𝑡s^{t}=x^{t}italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
GT points + robot sextt=(xt,oeef,o˙eef)subscriptsuperscript𝑠𝑡extsuperscript𝑥𝑡subscript𝑜eefsubscript˙𝑜eefs^{t}_{\mathrm{ext}}=(x^{t},o_{\mathrm{eef}},\dot{o}_{\mathrm{eef}})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ext end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT )
Full Simulation State sfullt=(oeef,o˙eef,ocube,o˙cube,ϕcube,ϕ˙cube,otarget)subscriptsuperscript𝑠𝑡fullsubscript𝑜eefsubscript˙𝑜eefsubscript𝑜cubesubscript˙𝑜cubesubscriptitalic-ϕcubesubscript˙italic-ϕcubesubscript𝑜targets^{t}_{\mathrm{full}}=(o_{\mathrm{eef}},\dot{o}_{\mathrm{eef}},o_{\mathrm{cube% }},\dot{o}_{\mathrm{cube}},\phi_{\mathrm{cube}},\dot{\phi}_{\mathrm{cube}},o_{% \mathrm{target}})italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_full end_POSTSUBSCRIPT = ( italic_o start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT , over˙ start_ARG italic_o end_ARG start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT , over˙ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT roman_cube end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT )

Appendix B Experimental Results

In the following, we discuss selected additional results we obtained. These are a comparison of our SAE metric with the reconstruction loss, qualitative results of applying the linear feature transformation, and a closer look at tracking error distributions for the considered SAE architecture modifications.

Reconstruction loss. In comparison to tracking errors and tracking capability, we examine the reconstruction loss of two exemplary architectures in Figure 8. Overall, we observe the expected decrease in reconstruction loss over training epochs and a difference in final reconstruction loss between the two architectures. In contrast to the tracking error, however, the reconstruction loss cannot tell us if any of the two architectures effectively tracks the ground truth objects. Neither are any notable jumps present, as they are for the tracking error over time, nor is there any way of differentiating between multiple objects. We, therefore, find the reconstruction loss to be an insufficient indicator for judging an SAE’s tracking performance.

\subfigure

[] Refer to caption \subfigure[] Refer to caption

Figure 8: Reconstruction loss over training epochs for the Basic-kp32 architecture (8) and the KeyNet architecture (Fig. 8).

Qualitative transformation results. Figure 9 shows trajectories from four example episodes for the end-effector and the cube. The trajectory of the best keypoint according to our tracking error formulation is shown in red. As expected, is to be noted that this trajectory can significantly deviate from the white ground truth point trajectory. When taking into account the proposed transformation for tracking evaluation, we obtain the blue trajectory. This trajectory comes much closer to the ground truth point trajectory we evaluate against. For the cube, these to trajectories coincide almost exactly. The larger deviation for the end-effector can be explained by its wider range of movement the transformation for which is approximated linearly and time-invariant.

\subfigure

[Trajectories of the end-effector] Refer to caption    \subfigure[Trajectories of the cube] Refer to caption

Figure 9: Trajectories from sample episodes for end-effector tracking in Figure 9 and cube tracking in Figure 9. Ground truth points (white), untransformed keypoints obtained with KeyNet-vel-std-bg (red), and transformed keypoints (blue) are plotted over time.

Modifications and ablations. Figure 10 shows the tracking capabilities of the Basic architecture with all combinations of modifications and the KeyNet architecture with the best-found combination. The velocity loss (-vel) generally improves the tracking capability for continuously moving objects, in our case, the end-effector. However, it can have a slightly negative effect on tracking of more stationary objects. Making σ𝜎\sigmaitalic_σ trainable (-std) and adding a background bias layer (-bg) almost always yields an improved tracking capability. The combination vel-std-bg yields a consistent improvement in tracking capability. For Basic, it is increased from TC¯=0.181¯TC0.181\overline{\mathrm{TC}}=0.181over¯ start_ARG roman_TC end_ARG = 0.181 to TC¯=0.528¯TC0.528\overline{\mathrm{TC}}=0.528over¯ start_ARG roman_TC end_ARG = 0.528.

Refer to caption
Figure 10: Model ablations: tracking capability TCk¯¯subscriptTC𝑘\overline{\mathrm{TC}_{k}}over¯ start_ARG roman_TC start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG for k=3𝑘3k=3italic_k = 3 ground truth objects of the adapted architectures.
Refer to caption
Figure 11: KeyNet-vel-std-bg tracking errors TCk¯¯subscriptTC𝑘\overline{\mathrm{TC}_{k}}over¯ start_ARG roman_TC start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG for k=3𝑘3k=3italic_k = 3 ground truth objects over epochs.

KeyNet already tracks cube and target well and reaches TC¯=0.681¯TC0.681\overline{\mathrm{TC}}=0.681over¯ start_ARG roman_TC end_ARG = 0.681. KeyNet-vel-std-bg has a near-perfect mean tracking capability of TC¯=0.986¯TC0.986\overline{\mathrm{TC}}=0.986over¯ start_ARG roman_TC end_ARG = 0.986. The biggest change can be seen in end-effector tracking, improving from TCeef=0.167subscriptTCeef0.167\mathrm{TC}_{\mathrm{eef}}=0.167roman_TC start_POSTSUBSCRIPT roman_eef end_POSTSUBSCRIPT = 0.167 to 0.9580.9580.9580.958. This exceptional tracking performance is confirmed by the tracking error. For cube and target the tracking error consistently shows a distinct drop below our threshold (cf. Figure 11). As explained in Section 5.1, the end-effector usually shows a shallower slope. Still, the tracking error consistently falls below the threshold, indicating that the KeyNet-vel-std-bg architecture learns to track the gripper.

Due to its more directly interpretable nature, we primarily examine the tracking capability TCTC\mathrm{TC}roman_TC in our evaluation of architecture modifications. Figure 12 additionally shows box plots of the object-wise tracking error distributions over 24 runs for each model. We again observe the effectiveness of the joined modification -vel-std-bg for both the Basic and the KeyNet architecture.

Refer to caption
Figure 12: Box plots of tracking errors over 24 runs of the considered architecture modifications.