Light-weight probing of unsupervised representations for Reinforcement Learning
Authors:
Wancong Zhang,
Anthony GX-Chen,
Vlad Sobal,
Yann LeCun,
Nicolas Carion
Abstract:
Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. Inspired by t…
▽ More
Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. Inspired by the vision community, we study whether linear probing can be a proxy evaluation task for the quality of unsupervised RL representation. Specifically, we probe for the observed reward in a given state and the action of an expert in a given state, both of which are generally applicable to many RL domains. Through rigorous experimentation, we show that the probing tasks are strongly rank correlated with the downstream RL performance on the Atari100k Benchmark, while having lower variance and up to 600x lower computational cost. This provides a more efficient method for exploring the space of pretraining algorithms and identifying promising pretraining recipes without the need to run RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.
△ Less
Submitted 31 May, 2024; v1 submitted 25 August, 2022;
originally announced August 2022.
A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions
Authors:
Anthony GX-Chen,
Veronica Chelu,
Blake A. Richards,
Joelle Pineau
Abstract:
Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrap**, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a poli…
▽ More
Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrap**, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrap** targets used when estimating value functions, and propose a new backup target, the $η$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $η$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $ηγ$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrap** entirely on the value function estimate, or bootstrap** on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.