Search | arXiv e-print repository

Probabilistic Inference in Reinforcement Learning Done Right

Authors: Jean Tarbouriech, Tor Lattimore, Brendan O'Donoghue

Abstract: A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statist… ▽ More A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: NeurIPS 2023

arXiv:2302.09339 [pdf, other]

Efficient Exploration via Epistemic-Risk-Seeking Policy Optimization

Authors: Brendan O'Donoghue

Abstract: Exploration remains a key challenge in deep reinforcement learning (RL). Optimism in the face of uncertainty is a well-known heuristic with theoretical guarantees in the tabular setting, but how best to translate the principle to deep reinforcement learning, which involves online stochastic gradients and deep network function approximators, is not fully understood. In this paper we propose a new,… ▽ More Exploration remains a key challenge in deep reinforcement learning (RL). Optimism in the face of uncertainty is a well-known heuristic with theoretical guarantees in the tabular setting, but how best to translate the principle to deep reinforcement learning, which involves online stochastic gradients and deep network function approximators, is not fully understood. In this paper we propose a new, differentiable optimistic objective that when optimized yields a policy that provably explores efficiently, with guarantees even under function approximation. Our new objective is a zero-sum two-player game derived from endowing the agent with an epistemic-risk-seeking utility function, which converts uncertainty into value and encourages the agent to explore uncertain states. We show that the solution to this game minimizes an upper bound on the regret, with the 'players' each attempting to minimize one component of a particular regret decomposition. We derive a new model-free algorithm which we call 'epistemic-risk-seeking actor-critic' (ERSAC), which is simply an application of simultaneous stochastic gradient ascent-descent to the game. Finally, we discuss a recipe for incorporating off-policy data and show that combining the risk-seeking objective with replay data yields a double benefit in terms of statistical efficiency. We conclude with some results showing good performance of a deep RL agent using the technique on the challenging 'DeepSea' environment, showing significant performance improvements even over other efficient exploration techniques, as well as improved performance on the Atari benchmark. △ Less

Submitted 4 June, 2023; v1 submitted 18 February, 2023; originally announced February 2023.

arXiv:2302.01275 [pdf, other]

ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs

Authors: Ted Moskovitz, Brendan O'Donoghue, Vivek Veeriah, Sebastian Flennerhag, Satinder Singh, Tom Zahavy

Abstract: In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate… ▽ More In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems. △ Less

Submitted 5 March, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

arXiv:2301.03236 [pdf, other]

Optimistic Meta-Gradients

Authors: Sebastian Flennerhag, Tom Zahavy, Brendan O'Donoghue, Hado van Hasselt, András György, Satinder Singh

Abstract: We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient fo… ▽ More We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through Bootstrapped Meta-Gradients (Flennerhag et al., 2022), providing deeper insight into its underlying mechanics. △ Less

Submitted 9 January, 2023; originally announced January 2023.

arXiv:2212.14530 [pdf, other]

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Authors: Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

Abstract: We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying struc… ▽ More We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically. △ Less

Submitted 29 December, 2022; originally announced December 2022.

Comments: 24 pages, 6 figures

arXiv:2210.12160 [pdf, ps, other]

On the connection between Bregman divergence and value in regularized Markov decision processes

Authors: Brendan O'Donoghue

Abstract: In this short note we derive a relationship between the Bregman divergence from the current policy to the optimal policy and the suboptimality of the current value function in a regularized Markov decision process. This result has implications for multi-task reinforcement learning, offline reinforcement learning, and regret analysis under function approximation, among others. In this short note we derive a relationship between the Bregman divergence from the current policy to the optimal policy and the suboptimality of the current value function in a regularized Markov decision process. This result has implications for multi-task reinforcement learning, offline reinforcement learning, and regret analysis under function approximation, among others. △ Less

Submitted 6 November, 2022; v1 submitted 21 October, 2022; originally announced October 2022.

arXiv:2110.15688 [pdf, other]

Variational Bayesian Optimistic Sampling

Authors: Brendan O'Donoghue, Tor Lattimore

Abstract: We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian `optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys $\tilde O(\sqrt{AT})$ Bayesian regret for a problem wi… ▽ More We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian `optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys $\tilde O(\sqrt{AT})$ Bayesian regret for a problem with $A$ actions after $T$ rounds. We extend the regret analysis for optimistic policies to bilinear saddle-point problems which include zero-sum matrix games and constrained bandits as special cases. In this case we show that Thompson sampling can produce policies outside of the optimistic set and suffer linear regret in some instances. Finding a policy inside the optimistic set amounts to solving a convex optimization problem and we call the resulting algorithm `variational Bayesian optimistic sampling' (VBOS). The procedure works for any posteriors, \ie, it does not require the posterior to have any special properties, such as log-concavity, unimodality, or smoothness. The variational view of the problem has many useful properties, including the ability to tune the exploration-exploitation tradeoff, add regularization, incorporate constraints, and linearly parameterize the policy. △ Less

Submitted 29 October, 2021; originally announced October 2021.

arXiv:2110.04629 [pdf, other]

The Neural Testbed: Evaluating Joint Predictions

Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy

Abstract: Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a… ▽ More Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community. △ Less

Submitted 1 November, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

arXiv:2106.04756 [pdf, other]

Practical Large-Scale Linear Programming using Primal-Dual Hybrid Gradient

Authors: David Applegate, Mateo Díaz, Oliver Hinder, Haihao Lu, Miles Lubin, Brendan O'Donoghue, Warren Schudy

Abstract: We present PDLP, a practical first-order method for linear programming (LP) that can solve to the high levels of accuracy that are expected in traditional LP applications. In addition, it can scale to very large problems because its core operation is matrix-vector multiplications. PDLP is derived by applying the primal-dual hybrid gradient (PDHG) method, popularized by Chambolle and Pock (2011), t… ▽ More We present PDLP, a practical first-order method for linear programming (LP) that can solve to the high levels of accuracy that are expected in traditional LP applications. In addition, it can scale to very large problems because its core operation is matrix-vector multiplications. PDLP is derived by applying the primal-dual hybrid gradient (PDHG) method, popularized by Chambolle and Pock (2011), to a saddle-point formulation of LP. PDLP enhances PDHG for LP by combining several new techniques with older tricks from the literature; the enhancements include diagonal preconditioning, presolving, adaptive step sizes, and adaptive restarting. PDLP improves the state of the art for first-order methods applied to LP. We compare PDLP with SCS, an ADMM-based solver, on a set of 383 LP instances derived from MIPLIB 2017. With a target of $10^{-8}$ relative accuracy and 1 hour time limit, PDLP achieves a 6.3x reduction in the geometric mean of solve times and a 4.6x reduction in the number of instances unsolved (from 227 to 49). Furthermore, we highlight standard benchmark instances and a large-scale application (PageRank) where our open-source prototype of PDLP, written in Julia, outperforms a commercial LP solver. △ Less

Submitted 7 January, 2022; v1 submitted 8 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021

arXiv:2106.00669 [pdf, other]

Discovering Diverse Nearly Optimal Policies with Successor Features

Authors: Tom Zahavy, Brendan O'Donoghue, Andre Barreto, Volodymyr Mnih, Sebastian Flennerhag, Satinder Singh

Abstract: Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while ass… ▽ More Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns. △ Less

Submitted 4 January, 2022; v1 submitted 1 June, 2021; originally announced June 2021.

arXiv:2106.00661 [pdf, other]

Reward is enough for convex MDPs

Authors: Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, Satinder Singh

Abstract: Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that t… ▽ More Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature. △ Less

Submitted 2 June, 2023; v1 submitted 1 June, 2021; originally announced June 2021.

arXiv:2102.04323 [pdf, other]

Discovering a set of policies for the worst case reward

Authors: Tom Zahavy, Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O'Donoghue, Iurii Kemaev, Satinder Singh

Abstract: We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any compositio… ▽ More We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite. △ Less

Submitted 10 December, 2021; v1 submitted 8 February, 2021; originally announced February 2021.

arXiv:2012.13349 [pdf, other]

Solving Mixed Integer Programs Using Neural Networks

Authors: Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, Ravichandra Addanki, Tharindi Hapuarachchi, Thomas Keck, James Keeling, Pushmeet Kohli, Ira Ktena, Yujia Li, Oriol Vinyals, Yori Zwols

Abstract: Mixed Integer Programming (MIP) solvers rely on an array of sophisticated heuristics developed with decades of research to solve large-scale MIP instances encountered in practice. Machine learning offers to automatically construct better heuristics from data by exploiting shared structure among instances in the data. This paper applies learning to the two key sub-tasks of a MIP solver, generating… ▽ More Mixed Integer Programming (MIP) solvers rely on an array of sophisticated heuristics developed with decades of research to solve large-scale MIP instances encountered in practice. Machine learning offers to automatically construct better heuristics from data by exploiting shared structure among instances in the data. This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one. Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP. Neural Diving learns a deep neural network to generate multiple partial assignments for its integer variables, and the resulting smaller MIPs for un-assigned variables are solved with SCIP to construct high quality joint assignments. Neural Branching learns a deep neural network to make variable selection decisions in branch-and-bound to bound the objective value gap with a small tree. This is done by imitating a new variant of Full Strong Branching we propose that scales to large instances using GPUs. We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each. Most instances in all the datasets combined have $10^3-10^6$ variables and constraints after presolve, which is significantly larger than previous learning approaches. Comparing solvers with respect to primal-dual gap averaged over a held-out set of instances, the learning-augmented SCIP is 2x to 10x better on all datasets except one on which it is $10^5$x better, at large time limits. To the best of our knowledge, ours is the first learning approach to demonstrate such large improvements over SCIP on both large-scale real-world application datasets and MIPLIB. △ Less

Submitted 29 July, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

arXiv:2010.11364 [pdf, ps, other]

Sample Efficient Reinforcement Learning with REINFORCE

Authors: Junzi Zhang, Jongho Kim, Brendan O'Donoghue, Stephen Boyd

Abstract: Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their ap… ▽ More Policy gradient methods are among the most effective methods for large-scale reinforcement learning, and their empirical success has prompted several works that develop the foundation of their global convergence theory. However, prior works have either required exact gradients or state-action visitation measure based mini-batch stochastic gradients with a diverging batch size, which limit their applicability in practical scenarios. In this paper, we consider classical policy gradient methods that compute an approximate gradient with a single trajectory or a fixed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization, along with the widely-used REINFORCE gradient estimation procedure. By controlling the number of "bad" episodes and resorting to the classical doubling trick, we establish an anytime sub-linear high probability regret bound as well as almost sure global convergence of the average regret with an asymptotically sub-linear rate. These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice. △ Less

Submitted 24 December, 2020; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: Accepted to AAAI 2021. Fixed typos in constants and enriched the literature review

arXiv:2006.05145 [pdf, other]

Matrix games with bandit feedback

Authors: Brendan O'Donoghue, Tor Lattimore, Ian Osband

Abstract: We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve lo… ▽ More We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms. △ Less

Submitted 12 June, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

arXiv:2004.02177 [pdf, other]

Operator splitting for a homogeneous embedding of the linear complementarity problem

Authors: Brendan O'Donoghue

Abstract: The linear complementarity problem (LCP) is a general set membership problem that includes quadratic cone programming as a special case. In this work we consider a homogeneous embedding of the LCP, which encodes both the optimality conditions of the original LCP as well as certificates of infeasibility. The resulting problem can be expressed as a monotone inclusion problem involving the sum of two… ▽ More The linear complementarity problem (LCP) is a general set membership problem that includes quadratic cone programming as a special case. In this work we consider a homogeneous embedding of the LCP, which encodes both the optimality conditions of the original LCP as well as certificates of infeasibility. The resulting problem can be expressed as a monotone inclusion problem involving the sum of two maximal monotone operators, to which we apply Douglas-Rachford splitting. The resulting algorithm has almost identical per-iteration cost as applying the splitting to the LCP directly. Specifically, each iteration requires solving a linear system and projecting onto a convex cone. The resulting algorithm has no explicit hyper-parameters and is able to return a primal-dual solution should one exist or a certificate of infeasibility otherwise. If a sequence of related problems are solved it can easily be warm-started and make use of factorization caching of the linear system. We demonstrate on a range of public and synthetic datasets that for feasible problems our approach tends to be somewhat faster than applying operator splitting directly to the LCP, and in cases of infeasibility our approach can be significantly faster than alternative approaches based on diverging iterates. The algorithm we describe has been implemented in C and is available open-source as SCS-3.0. △ Less

Submitted 12 June, 2021; v1 submitted 5 April, 2020; originally announced April 2020.

arXiv:2001.00805 [pdf, other]

Making Sense of Reinforcement Learning and Probabilistic Inference

Authors: Brendan O'Donoghue, Ian Osband, Catalin Ionescu

Abstract: Reinforcement learning (RL) combines a control problem with statistical estimation: The system dynamics are not known to the agent, but can be learned through experience. A recent line of research casts `RL as inference' and suggests a particular framework to generalize the RL problem as probabilistic inference. Our paper surfaces a key shortcoming in that approach, and clarifies the sense in whic… ▽ More Reinforcement learning (RL) combines a control problem with statistical estimation: The system dynamics are not known to the agent, but can be learned through experience. A recent line of research casts `RL as inference' and suggests a particular framework to generalize the RL problem as probabilistic inference. Our paper surfaces a key shortcoming in that approach, and clarifies the sense in which RL can be coherently cast as an inference problem. In particular, an RL agent must consider the effects of its actions upon future rewards and observations: The exploration-exploitation tradeoff. In all but the most simple settings, the resulting inference is computationally intractable so that practical RL algorithms must resort to approximation. We demonstrate that the popular `RL as inference' approximation can perform poorly in even very basic problems. However, we show that with a small modification the framework does yield algorithms that can provably perform well, and we show that the resulting algorithm is equivalent to the recently proposed K-learning, which we further connect with Thompson sampling. △ Less

Submitted 4 November, 2020; v1 submitted 3 January, 2020; originally announced January 2020.

Comments: ICLR 2020

arXiv:1906.02608 [pdf, other]

Hamiltonian descent for composite objectives

Authors: Brendan O'Donoghue, Chris J. Maddison

Abstract: In optimization the duality gap between the primal and the dual problems is a measure of the suboptimality of any primal-dual point. In classical mechanics the equations of motion of a system can be derived from the Hamiltonian function, which is a quantity that describes the total energy of the system. In this paper we consider a convex optimization problem consisting of the sum of two convex fun… ▽ More In optimization the duality gap between the primal and the dual problems is a measure of the suboptimality of any primal-dual point. In classical mechanics the equations of motion of a system can be derived from the Hamiltonian function, which is a quantity that describes the total energy of the system. In this paper we consider a convex optimization problem consisting of the sum of two convex functions, sometimes referred to as a composite objective, and we identify the duality gap to be the 'energy' of the system. In the Hamiltonian formalism the energy is conserved, so we add a contractive term to the standard equations of motion so that this energy decreases linearly (i.e., geometrically) with time. This yields a continuous-time ordinary differential equation (ODE) in the primal and dual variables which converges to zero duality gap, i.e., optimality. This ODE has several useful properties: it induces a natural operator splitting; at convergence it yields both the primal and dual solutions; and it is invariant to affine transformation despite only using first order information. We provide several discretizations of this ODE, some of which are new algorithms and others correspond to known techniques, such as the alternating direction method of multipliers (ADMM). We conclude with some numerical examples that show the promise of our approach. We give an example where our technique can solve a convex quadratic minimization problem orders of magnitude faster than several commonly-used gradient methods, including conjugate gradient, when the conditioning of the problem is poor. Our framework provides new insights into previously known algorithms in the literature as well as providing a technique to generate new primal-dual algorithms. △ Less

Submitted 17 November, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

arXiv:1902.09592 [pdf, other]

Verification of Non-Linear Specifications for Neural Networks

Authors: Chongli Qin, Krishnamurthy, Dvijotham, Brendan O'Donoghue, Rudy Bunel, Robert Stanforth, Sven Gowal, Jonathan Uesato, Grzegorz Swirszcz, Pushmeet Kohli

Abstract: Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we extend verification algorithms to be able to certify richer properties of neural networks. To do this we introduce the class of convex-relaxable specifications, which… ▽ More Prior work on neural network verification has focused on specifications that are linear functions of the output of the network, e.g., invariance of the classifier output under adversarial perturbations of the input. In this paper, we extend verification algorithms to be able to certify richer properties of neural networks. To do this we introduce the class of convex-relaxable specifications, which constitute nonlinear specifications that can be verified using a convex relaxation. We show that a number of important properties of interest can be modeled within this class, including conservation of energy in a learned dynamics model of a physical system; semantic consistency of a classifier's output labels under adversarial perturbations and bounding errors in a system that predicts the summation of handwritten digits. Our experimental evaluation shows that our method is able to effectively verify these specifications. Moreover, our evaluation exposes the failure modes in models which cannot be verified to satisfy these specifications. Thus, emphasizing the importance of training models not just to fit training data but also to be consistent with specifications. △ Less

Submitted 25 February, 2019; originally announced February 2019.

Comments: ICLR conference paper

arXiv:1811.09300 [pdf, other]

Strength in Numbers: Trading-off Robustness and Computation via Adversarially-Trained Ensembles

Authors: Edward Grefenstette, Robert Stanforth, Brendan O'Donoghue, Jonathan Uesato, Grzegorz Swirszcz, Pushmeet Kohli

Abstract: While deep learning has led to remarkable results on a number of challenging problems, researchers have discovered a vulnerability of neural networks in adversarial settings, where small but carefully chosen perturbations to the input can make the models produce extremely inaccurate outputs. This makes these models particularly unsuitable for safety-critical application domains (e.g. self-driving… ▽ More While deep learning has led to remarkable results on a number of challenging problems, researchers have discovered a vulnerability of neural networks in adversarial settings, where small but carefully chosen perturbations to the input can make the models produce extremely inaccurate outputs. This makes these models particularly unsuitable for safety-critical application domains (e.g. self-driving cars) where robustness is extremely important. Recent work has shown that augmenting training with adversarially generated data provides some degree of robustness against test-time attacks. In this paper we investigate how this approach scales as we increase the computational budget given to the defender. We show that increasing the number of parameters in adversarially-trained models increases their robustness, and in particular that ensembling smaller models while adversarially training the entire ensemble as a single model is a more efficient way of spending said budget than simply using a larger single model. Crucially, we show that it is the adversarial training of the ensemble, rather than the ensembling of adversarially trained models, which provides robustness. △ Less

Submitted 22 November, 2018; originally announced November 2018.

Comments: 12 pages

arXiv:1809.05042 [pdf, other]

Hamiltonian Descent Methods

Authors: Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O'Donoghue, Arnaud Doucet

Abstract: We propose a family of optimization methods that achieve linear convergence using first-order gradient information and constant step sizes on a class of convex functions much larger than the smooth and strongly convex ones. This larger class includes functions whose second derivatives may be singular or unbounded at their minima. Our methods are discretizations of conformal Hamiltonian dynamics, w… ▽ More We propose a family of optimization methods that achieve linear convergence using first-order gradient information and constant step sizes on a class of convex functions much larger than the smooth and strongly convex ones. This larger class includes functions whose second derivatives may be singular or unbounded at their minima. Our methods are discretizations of conformal Hamiltonian dynamics, which generalize the classical momentum method to model the motion of a particle with non-standard kinetic energy exposed to a dissipative force and the gradient field of the function of interest. They are first-order in the sense that they require only gradient computation. Yet, crucially the kinetic gradient map can be designed to incorporate information about the convex conjugate in a fashion that allows for linear convergence on convex functions that may be non-smooth or non-strongly convex. We study in detail one implicit and two explicit methods. For one explicit method, we provide conditions under which it converges to stationary points of non-convex functions. For all, we provide conditions on the convex function and kinetic energy pair that guarantee linear convergence, and show that these conditions can be satisfied by functions with power growth. In sum, these methods expand the class of convex functions on which linear convergence is possible with first-order computation. △ Less

Submitted 13 September, 2018; originally announced September 2018.

arXiv:1808.03971 [pdf, other]

Globally Convergent Type-I Anderson Acceleration for Non-Smooth Fixed-Point Iterations

Authors: Junzi Zhang, Brendan O'Donoghue, Stephen Boyd

Abstract: We consider the application of the type-I Anderson acceleration to solving general non-smooth fixed-point problems. By interleaving with safe-guarding steps, and employing a Powell-type regularization and a re-start checking for strong linear independence of the updates, we propose the first globally convergent variant of Anderson acceleration assuming only that the fixed-point iteration is non-ex… ▽ More We consider the application of the type-I Anderson acceleration to solving general non-smooth fixed-point problems. By interleaving with safe-guarding steps, and employing a Powell-type regularization and a re-start checking for strong linear independence of the updates, we propose the first globally convergent variant of Anderson acceleration assuming only that the fixed-point iteration is non-expansive. We show by extensive numerical experiments that many first order algorithms can be improved, especially in their terminal convergence, with the proposed algorithm. Our proposed method of acceleration is being implemented in SCS 2.0, one of the default solvers used in the convex optimization parser-solver CVXPY 1.0. △ Less

Submitted 12 August, 2018; originally announced August 2018.

Comments: 47 pages

arXiv:1807.09647 [pdf, other]

Variational Bayesian Reinforcement Learning with Regret Bounds

Authors: Brendan O'Donoghue

Abstract: In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so th… ▽ More In reinforcement learning the Q-values summarize the expected future rewards that the agent will attain. However, they cannot capture the epistemic uncertainty about those rewards. In this work we derive a new Bellman operator with associated fixed point we call the `knowledge values'. These K-values compress both the expected future rewards and the epistemic uncertainty into a single value, so that high uncertainty, high reward, or both, can yield high K-values. The key principle is to endow the agent with a risk-seeking utility function that is carefully tuned to balance exploration and exploitation. When the agent follows a Boltzmann policy over the K-values it yields a Bayes regret bound of $\tilde O(L \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the total number of states, $A$ is the number of actions, and $T$ is the number of elapsed timesteps. We show deep connections of this approach to the soft-max and maximum-entropy strands of research in reinforcement learning. △ Less

Submitted 6 December, 2022; v1 submitted 25 July, 2018; originally announced July 2018.

arXiv:1805.10265 [pdf, other]

Training verified learners with learned verifiers

Authors: Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O'Donoghue, Jonathan Uesato, Pushmeet Kohli

Abstract: This paper proposes a new algorithmic framework, predictor-verifier training, to train neural networks that are verifiable, i.e., networks that provably satisfy some desired input-output properties. The key idea is to simultaneously train two networks: a predictor network that performs the task at hand,e.g., predicting labels given inputs, and a verifier network that computes a bound on how well t… ▽ More This paper proposes a new algorithmic framework, predictor-verifier training, to train neural networks that are verifiable, i.e., networks that provably satisfy some desired input-output properties. The key idea is to simultaneously train two networks: a predictor network that performs the task at hand,e.g., predicting labels given inputs, and a verifier network that computes a bound on how well the predictor satisfies the properties being verified. Both networks can be trained simultaneously to optimize a weighted combination of the standard data-fitting loss and a term that bounds the maximum violation of the property. Experiments show that not only is the predictor-verifier architecture able to train networks to achieve state of the art verified robustness to adversarial examples with much shorter training times (outperforming previous algorithms on small datasets like MNIST and SVHN), but it can also be scaled to produce the first known (to the best of our knowledge) verifiably robust networks for CIFAR-10. △ Less

Submitted 29 May, 2018; v1 submitted 25 May, 2018; originally announced May 2018.

arXiv:1802.05666 [pdf, other]

Adversarial Risk and the Dangers of Evaluating Against Weak Attacks

Authors: Jonathan Uesato, Brendan O'Donoghue, Aaron van den Oord, Pushmeet Kohli

Abstract: This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate 'adversarial risk' as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optim… ▽ More This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. We motivate 'adversarial risk' as an objective for achieving models robust to worst-case inputs. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may optimize this surrogate rather than the true adversarial risk. We formalize this notion as 'obscurity to an adversary,' and develop tools and heuristics for identifying obscured models and designing transparent models. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that our formulations and results will help researchers to develop more powerful defenses. △ Less

Submitted 12 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

arXiv:1709.05380 [pdf, other]

The Uncertainty Bellman Equation and Exploration

Authors: Brendan O'Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

Abstract: We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-s… ▽ More We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $ε$-greedy improves DQN performance on 51 out of 57 games in the Atari suite. △ Less

Submitted 22 October, 2018; v1 submitted 15 September, 2017; originally announced September 2017.

arXiv:1611.01626 [pdf, other]

Combining policy gradient and Q-learning

Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

Abstract: Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the f… ▽ More Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning. △ Less

Submitted 7 April, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

arXiv:1506.00749 [pdf, ps, other]

doi 10.1109/TSP.2015.2443731

Large-Scale Convex Optimization for Dense Wireless Cooperative Networks

Authors: Yuanming Shi, Jun Zhang, Brendan O'Donoghue, Khaled B. Letaief

Abstract: Convex optimization is a powerful tool for resource allocation and signal processing in wireless networks. As the network density is expected to drastically increase in order to accommodate the exponentially growing mobile data traffic, performance optimization problems are entering a new era characterized by a high dimension and/or a large number of constraints, which poses significant design and… ▽ More Convex optimization is a powerful tool for resource allocation and signal processing in wireless networks. As the network density is expected to drastically increase in order to accommodate the exponentially growing mobile data traffic, performance optimization problems are entering a new era characterized by a high dimension and/or a large number of constraints, which poses significant design and computational challenges. In this paper, we present a novel two-stage approach to solve large-scale convex optimization problems for dense wireless cooperative networks, which can effectively detect infeasibility and enjoy modeling flexibility. In the proposed approach, the original large-scale convex problem is transformed into a standard cone programming form in the first stage via matrix stuffing, which only needs to copy the problem parameters such as channel state information (CSI) and quality-of-service (QoS) requirements to the pre-stored structure of the standard form. The capability of yielding infeasibility certificates and enabling parallel computing is achieved by solving the homogeneous self-dual embedding of the primal-dual pair of the standard form. In the solving stage, the operator splitting method, namely, the alternating direction method of multipliers (ADMM), is adopted to solve the large-scale homogeneous self-dual embedding. Compared with second-order methods, ADMM can solve large-scale problems in parallel with modest accuracy within a reasonable amount of time. Simulation results will demonstrate the speedup, scalability, and reliability of the proposed framework compared with the state-of-the-art modeling frameworks and solvers. △ Less

Submitted 2 June, 2015; originally announced June 2015.

Comments: to appear in IEEE Trans. Signal Process., 2015. Simulation code is available at https://github.com/SHIYUANMING/large-scale-convex-optimization

arXiv:1312.3039 [pdf, ps, other]

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Authors: Brendan O'Donoghue, Eric Chu, Neal Parikh, Stephen Boyd

Abstract: We introduce a first order method for solving very large convex cone programs. The method uses an operator splitting method, the alternating directions method of multipliers, to solve the homogeneous self-dual embedding, an equivalent feasibility problem involving finding a nonzero point in the intersection of a subspace and a cone. This approach has several favorable properties. Compared to inter… ▽ More We introduce a first order method for solving very large convex cone programs. The method uses an operator splitting method, the alternating directions method of multipliers, to solve the homogeneous self-dual embedding, an equivalent feasibility problem involving finding a nonzero point in the intersection of a subspace and a cone. This approach has several favorable properties. Compared to interior-point methods, first-order methods scale to very large problems, at the cost of requiring more time to reach very high accuracy. Compared to other first-order methods for cone programs, our approach finds both primal and dual solutions when available or a certificate of infeasibility or unboundedness otherwise, is parameter-free, and the per-iteration cost of the method is the same as applying a splitting method to the primal or dual alone. We discuss efficient implementation of the method in detail, including direct and indirect methods for computing projection onto the subspace, scaling the original problem data, and stop** criteria. We describe an open-source implementation, which handles the usual (symmetric) non-negative, second-order, and semidefinite cones as well as the (non-self-dual) exponential and power cones and their duals. We report numerical results that show speedups over interior-point cone solvers for large problems, and scaling to very large general cone programs. △ Less

Submitted 25 July, 2016; v1 submitted 11 December, 2013; originally announced December 2013.

Comments: 23 pages, no figures

Journal ref: Journal of Optimization Theory and Applications, 169(3):1042-1068, June 2016

arXiv:1204.3982 [pdf, ps, other]

Adaptive Restart for Accelerated Gradient Schemes

Authors: Brendan O'Donoghue, Emmanuel Candes

Abstract: In this paper we demonstrate a simple heuristic adaptive restart technique that can dramatically improve the convergence rate of accelerated gradient schemes. The analysis of the technique relies on the observation that these schemes exhibit two modes of behavior depending on how much momentum is applied. In what we refer to as the 'high momentum' regime the iterates generated by an accelerated gr… ▽ More In this paper we demonstrate a simple heuristic adaptive restart technique that can dramatically improve the convergence rate of accelerated gradient schemes. The analysis of the technique relies on the observation that these schemes exhibit two modes of behavior depending on how much momentum is applied. In what we refer to as the 'high momentum' regime the iterates generated by an accelerated gradient scheme exhibit a periodic behavior, where the period is proportional to the square root of the local condition number of the objective function. This suggests a restart technique whereby we reset the momentum whenever we observe periodic behavior. We provide analysis to show that in many cases adaptively restarting allows us to recover the optimal rate of convergence with no prior knowledge of function parameters. △ Less

Submitted 18 April, 2012; originally announced April 2012.

Comments: 17 pages, 7 figures

Showing 1–30 of 30 results for author: O'Donoghue, B