Search | arXiv e-print repository

Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

Abstract: In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their str… ▽ More In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings. △ Less

Submitted 15 February, 2022; originally announced February 2022.

Comments: 9p+appendix, accepted to AISTATS 2022

arXiv:2202.06828 [pdf, other]

On the Convergence of SARSA with Linear Function Approximation

Authors: Shangtong Zhang, Remi Tachet, Romain Laroche

Abstract: SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region. However, little is known about how fast SARSA converges to that region and how large the region is. In this paper, we make progress towards this open problem by showing the convergence rate of pro… ▽ More SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region. However, little is known about how fast SARSA converges to that region and how large the region is. In this paper, we make progress towards this open problem by showing the convergence rate of projected SARSA to a bounded region. Importantly, the region is much smaller than the region that we project into, provided that the magnitude of the reward is not too large. Existing works regarding the convergence of linear SARSA to a fixed point all require the Lipschitz constant of SARSA's policy improvement operator to be sufficiently small; our analysis instead applies to arbitrary Lipschitz constants and thus characterizes the behavior of linear SARSA for a new regime. △ Less

Submitted 12 May, 2023; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: ICML 2023

arXiv:2111.02997 [pdf, other]

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

Authors: Shangtong Zhang, Remi Tachet, Romain Laroche

Abstract: In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy g… ▽ More In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties. △ Less

Submitted 24 October, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

Comments: Journal of Machine Learning Research 2022

arXiv:2109.14727 [pdf, other]

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

Authors: Romain Laroche, Remi Tachet

Abstract: The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we… ▽ More The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: accepted to NeurIPS as a poster

arXiv:2106.13401 [pdf, other]

Decomposed Mutual Information Estimation for Contrastive Representation Learning

Authors: Alessandro Sordoni, Nouha Dziri, Hannes Schulz, Geoff Gordon, Phil Bachman, Remi Tachet

Abstract: Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong unde… ▽ More Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: ICML 2021

arXiv:2003.04475 [pdf, other]

Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift

Authors: Remi Tachet, Han Zhao, Yu-Xiang Wang, Geoff Gordon

Abstract: Adversarial learning has demonstrated good performance in the unsupervised domain adaptation setting, by learning domain-invariant representations. However, recent work has shown limitations of this approach when label distributions differ between the source and target domains. In this paper, we propose a new assumption, generalized label shift ($GLS$), to improve robustness against mismatched lab… ▽ More Adversarial learning has demonstrated good performance in the unsupervised domain adaptation setting, by learning domain-invariant representations. However, recent work has shown limitations of this approach when label distributions differ between the source and target domains. In this paper, we propose a new assumption, generalized label shift ($GLS$), to improve robustness against mismatched label distributions. $GLS$ states that, conditioned on the label, there exists a representation of the input that is invariant between the source and target domains. Under $GLS$, we provide theoretical guarantees on the transfer performance of any classifier. We also devise necessary and sufficient conditions for $GLS$ to hold, by using an estimation of the relative class weights between domains and an appropriate reweighting of samples. Our weight estimation method could be straightforwardly and generically applied in existing domain adaptation (DA) algorithms that learn domain-invariant representations, with small computational overhead. In particular, we modify three DA algorithms, JAN, DANN and CDAN, and evaluate their performance on standard and artificial DA tasks. Our algorithms outperform the base versions, with vast improvements for large label distribution mismatches. Our code is available at https://tinyurl.com/y585xt6j. △ Less

Submitted 11 December, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

Comments: Appeared in NeurIPS 2020

arXiv:2002.10948 [pdf, other]

doi 10.24963/ijcai.2020/394

Reinforcement Learning Framework for Deep Brain Stimulation Study

Authors: Dmitrii Krylov, Remi Tachet, Romain Laroche, Michael Rosenblum, Dmitry V. Dylov

Abstract: Malfunctioning neurons in the brain sometimes operate synchronously, reportedly causing many neurological diseases, e.g. Parkinson's. Suppression and control of this collective synchronous activity are therefore of great importance for neuroscience, and can only rely on limited engineering trials due to the need to experiment with live human brains. We present the first Reinforcement Learning gym… ▽ More Malfunctioning neurons in the brain sometimes operate synchronously, reportedly causing many neurological diseases, e.g. Parkinson's. Suppression and control of this collective synchronous activity are therefore of great importance for neuroscience, and can only rely on limited engineering trials due to the need to experiment with live human brains. We present the first Reinforcement Learning gym framework that emulates this collective behavior of neurons and allows us to find suppression parameters for the environment of synthetic degenerate models of neurons. We successfully suppress synchrony via RL for three pathological signaling regimes, characterize the framework's stability to noise, and further remove the unwanted oscillations by engaging multiple PPO agents. △ Less

Submitted 22 February, 2020; originally announced February 2020.

Comments: 7 pages + 1 references, 7 figures. arXiv admin note: text overlap with arXiv:1909.12154

Journal ref: IJCAI 2020, pp. 2847-2854

arXiv:1911.03861 [pdf, other]

Increasing Robustness to Spurious Correlations using Forgettable Examples

Authors: Yadollah Yaghoobzadeh, Soroush Mehri, Remi Tachet, T. J. Hazen, Alessandro Sordoni

Abstract: Neural NLP models tend to rely on spurious correlations between labels and input features to perform their tasks. Minority examples, i.e., examples that contradict the spurious correlations present in the majority of data points, have been shown to increase the out-of-distribution generalization of pre-trained language models. In this paper, we first propose using example forgetting to find minori… ▽ More Neural NLP models tend to rely on spurious correlations between labels and input features to perform their tasks. Minority examples, i.e., examples that contradict the spurious correlations present in the majority of data points, have been shown to increase the out-of-distribution generalization of pre-trained language models. In this paper, we first propose using example forgetting to find minority examples without prior knowledge of the spurious correlations present in the dataset. Forgettable examples are instances either learned and then forgotten during training or never learned. We empirically show how these examples are related to minorities in our training sets. Then, we introduce a new approach to robustify models by fine-tuning our models twice, first on the full training data and second on the minorities only. We obtain substantial improvements in out-of-distribution generalization when applying our approach to the MNLI, QQP, and FEVER datasets. △ Less

Submitted 1 February, 2021; v1 submitted 10 November, 2019; originally announced November 2019.

Comments: 14 pages, Accepted at EACL2021

arXiv:1809.06848 [pdf, other]

On the Learning Dynamics of Deep Neural Networks

Authors: Remi Tachet, Mohammad Pezeshki, Samira Shabanian, Aaron Courville, Yoshua Bengio

Abstract: While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm emp… ▽ More While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features. △ Less

Submitted 11 December, 2020; v1 submitted 18 September, 2018; originally announced September 2018.

Comments: 19 pages, 7 figures

arXiv:1809.02591 [pdf, other]

Learning Invariances for Policy Generalization

Authors: Remi Tachet, Philip Bachman, Harm van Seijen

Abstract: While recent progress has spawned very powerful machine learning systems, those agents remain extremely specialized and fail to transfer the knowledge they gain to similar yet unseen tasks. In this paper, we study a simple reinforcement learning problem and focus on learning policies that encode the proper invariances for generalization to different settings. We evaluate three potential methods fo… ▽ More While recent progress has spawned very powerful machine learning systems, those agents remain extremely specialized and fail to transfer the knowledge they gain to similar yet unseen tasks. In this paper, we study a simple reinforcement learning problem and focus on learning policies that encode the proper invariances for generalization to different settings. We evaluate three potential methods for policy generalization: data augmentation, meta-learning and adversarial training. We find our data augmentation method to be effective, and study the potential of meta-learning and adversarial learning as alternative task-agnostic approaches. △ Less

Submitted 12 December, 2020; v1 submitted 7 September, 2018; originally announced September 2018.

Comments: 7 pages, 1 figure

arXiv:1710.04983 [pdf, other]

doi 10.1109/TITS.2018.2869085

Estimating savings in parking demand using shared vehicles for home-work commuting

Authors: Dániel Kondor, Hongmou Zhang, Remi Tachet, Paolo Santi, Carlo Ratti

Abstract: The increasing availability and adoption of shared vehicles as an alternative to personally-owned cars presents ample opportunities for achieving more efficient transportation in cities. With private cars spending on the average over 95\% of the time parked, one of the possible benefits of shared mobility is the reduced need for parking space. While widely discussed, a systematic quantification of… ▽ More The increasing availability and adoption of shared vehicles as an alternative to personally-owned cars presents ample opportunities for achieving more efficient transportation in cities. With private cars spending on the average over 95\% of the time parked, one of the possible benefits of shared mobility is the reduced need for parking space. While widely discussed, a systematic quantification of these benefits as a function of mobility demand and sharing models is still mostly lacking in the literature. As a first step in this direction, this paper focuses on a type of private mobility which, although specific, is a major contributor to traffic congestion and parking needs, namely, home-work commuting. We develop a data-driven methodology for estimating commuter parking needs in different shared mobility models, including a model where self-driving vehicles are used to partially compensate flow imbalance typical of commuting, and further reduce parking infrastructure at the expense of increased traveled kilometers. We consider the city of Singapore as a case study, and produce very encouraging results showing that the gradual transition to shared mobility models will bring tangible reductions in parking infrastructure. In the future-looking, self-driving vehicle scenario, our analysis suggests that up to 50\% reduction in parking needs can be achieved at the expense of increasing total traveled kilometers of less than 2\%. △ Less

Submitted 21 October, 2018; v1 submitted 13 October, 2017; originally announced October 2017.

Comments: IEEE Transactions on Intelligent Transportation Systems, 2018

Showing 1–11 of 11 results for author: Tachet, R