Search | arXiv e-print repository

BAM: Bayes with Adaptive Memory

Authors: Josue Nassar, Jennifer Brennan, Ben Evans, Kendall Lowrey

Abstract: Online learning via Bayes' theorem allows new data to be continuously integrated into an agent's current beliefs. However, a naive application of Bayesian methods in non stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past dat… ▽ More Online learning via Bayes' theorem allows new data to be continuously integrated into an agent's current beliefs. However, a naive application of Bayesian methods in non stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past data; however, this simple mechanism of "forgetting" fails to account for the fact that many real-world environments involve revisiting similar states. We propose a new framework, Bayes with Adaptive Memory (BAM), that takes advantage of past experience by allowing the agent to choose which past observations to remember and which to forget. We demonstrate that BAM generalizes many popular Bayesian update rules for non-stationary environments. Through a variety of experiments, we demonstrate the ability of BAM to continuously adapt in an ever-changing world. △ Less

Submitted 8 February, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: International Conference on Learning Representations (ICLR), 2022

arXiv:2106.15775 [pdf, other]

Koopman Spectrum Nonlinear Regulators and Efficient Online Learning

Authors: Motoya Ohnishi, Isao Ishikawa, Kendall Lowrey, Masahiro Ikeda, Sham Kakade, Yoshinobu Kawahara

Abstract: Most modern reinforcement learning algorithms optimize a cumulative single-step cost along a trajectory. The optimized motions are often 'unnatural', representing, for example, behaviors with sudden accelerations that waste energy and lack predictability. In this work, we present a novel paradigm of controlling nonlinear systems via the minimization of the Koopman spectrum cost: a cost over the Ko… ▽ More Most modern reinforcement learning algorithms optimize a cumulative single-step cost along a trajectory. The optimized motions are often 'unnatural', representing, for example, behaviors with sudden accelerations that waste energy and lack predictability. In this work, we present a novel paradigm of controlling nonlinear systems via the minimization of the Koopman spectrum cost: a cost over the Koopman operator of the controlled dynamics. This induces a broader class of dynamical behaviors that evolve over stable manifolds such as nonlinear oscillators, closed loops, and smooth movements. We demonstrate that some dynamics characterizations that are not possible with a cumulative cost are feasible in this paradigm, which generalizes the classical eigenstructure and pole assignments to nonlinear decision making. Moreover, we present a sample efficient online learning algorithm for our problem that enjoys a sub-linear regret bound under some structural assumptions. △ Less

Submitted 2 July, 2024; v1 submitted 29 June, 2021; originally announced June 2021.

Comments: 41 pages, 21 figures

Journal ref: Transactions on Machine Learning Research (https://openreview.net/forum?id=thfoUZugvS), 2024

arXiv:2012.06684 [pdf, other]

Faster Policy Learning with Continuous-Time Gradients

Authors: Samuel Ainsworth, Kendall Lowrey, John Thickstun, Zaid Harchaoui, Siddhartha Srinivasa

Abstract: We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate… ▽ More We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators. △ Less

Submitted 24 June, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Journal ref: L4DC 2021

arXiv:2006.12466 [pdf, other]

Information Theoretic Regret Bounds for Online Nonlinear Control

Authors: Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, Wen Sun

Abstract: This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Con… ▽ More This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Continuous Control ($LC^3$) algorithm, enjoys a near-optimal $O(\sqrt{T})$ regret bound against the optimal controller in episodic settings, where $T$ is the number of episodes. The bound has no explicit dependence on dimension of the system dynamics, which could be infinite, but instead only depends on information theoretic quantities. We empirically show its application to a number of nonlinear control tasks and demonstrate the benefit of exploration for learning model dynamics. △ Less

Submitted 22 June, 2020; originally announced June 2020.

arXiv:2006.03106 [pdf, other]

doi 10.1109/LRA.2020.2972836

Model-Based Generalization Under Parameter Uncertainty Using Path Integral Control

Authors: Ian Abraham, Ankur Handa, Nathan Ratliff, Kendall Lowrey, Todd D. Murphey, Dieter Fox

Abstract: This work addresses the problem of robot interaction in complex environments where online control and adaptation is necessary. By expanding the sample space in the free energy formulation of path integral control, we derive a natural extension to the path integral control that embeds uncertainty into action and provides robustness for model-based robot planning. Our algorithm is applied to a diver… ▽ More This work addresses the problem of robot interaction in complex environments where online control and adaptation is necessary. By expanding the sample space in the free energy formulation of path integral control, we derive a natural extension to the path integral control that embeds uncertainty into action and provides robustness for model-based robot planning. Our algorithm is applied to a diverse set of tasks using different robots and validate our results in simulation and real-world experiments. We further show that our method is capable of running in real-time without loss of performance. Videos of the experiments as well as additional implementation details can be found at https://sites.google.com/view/emppi. △ Less

Submitted 4 June, 2020; originally announced June 2020.

Journal ref: IEEE Robotics and Automation Letters ( Volume: 5 , Issue: 2 , April 2020 )

arXiv:2001.07343 [pdf, other]

Lyceum: An efficient and scalable ecosystem for robot learning

Authors: Colin Summers, Kendall Lowrey, Aravind Rajeswaran, Siddhartha Srinivasa, Emanuel Todorov

Abstract: We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depen… ▽ More We introduce Lyceum, a high-performance computational ecosystem for robot learning. Lyceum is built on top of the Julia programming language and the MuJoCo physics simulator, combining the ease-of-use of a high-level programming language with the performance of native C. In addition, Lyceum has a straightforward API to support parallel computation across multiple cores and machines. Overall, depending on the complexity of the environment, Lyceum is 5-30x faster compared to other popular abstractions like OpenAI's Gym and DeepMind's dm-control. This substantially reduces training time for various reinforcement learning algorithms; and is also fast enough to support real-time model predictive control through MuJoCo. The code, tutorials, and demonstration videos can be found at: www.lyceum.ml. △ Less

Submitted 21 January, 2020; originally announced January 2020.

arXiv:1811.01848 [pdf, other]

Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control

Authors: Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, Igor Mordatch

Abstract: We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, an… ▽ More We propose a plan online and learn offline (POLO) framework for the setting where an agent, with an internal model, needs to continually act and learn in the world. Our work builds on the synergistic relationship between local model-based control, global value function learning, and exploration. We study how local trajectory optimization can cope with approximation errors in the value function, and can stabilize and accelerate value function learning. Conversely, we also study how approximate value functions can help reduce the planning horizon and allow for better policies beyond local solutions. Finally, we also demonstrate how trajectory optimization can be used to perform temporally coordinated exploration in conjunction with estimating uncertainty in value function approximation. This exploration is critical for fast and stable learning of the value function. Combining these components enable solutions to complex simulated control tasks, like humanoid locomotion and dexterous in-hand manipulation, in the equivalent of a few minutes of experience in the real world. △ Less

Submitted 28 January, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: The first two authors contributed equally. Accepted at ICLR 2019. Supplementary videos available at: https://sites.google.com/view/polo-mpc

arXiv:1803.10371 [pdf, other]

Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system

Authors: Kendall Lowrey, Svetoslav Kolev, Jeremy Dao, Aravind Rajeswaran, Emanuel Todorov

Abstract: Reinforcement learning has emerged as a promising methodology for training robot controllers. However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods. Model-based reinforcement learning methods provide an avenue to circumvent these challenges, but the traditional concern has been the mismatch bet… ▽ More Reinforcement learning has emerged as a promising methodology for training robot controllers. However, most results have been limited to simulation due to the need for a large number of samples and the lack of automated-yet-safe data collection methods. Model-based reinforcement learning methods provide an avenue to circumvent these challenges, but the traditional concern has been the mismatch between the simulator and the real world. Here, we show that control policies learned in simulation can successfully transfer to a physical system, composed of three Phantom robots pushing an object to various desired target positions. We use a modified form of the natural policy gradient algorithm for learning, applied to a carefully identified simulation model. The resulting policies, trained entirely in simulation, work well on the physical system without additional training. In addition, we show that training with an ensemble of models makes the learned policies more robust to modeling errors, thus compensating for difficulties in system identification. △ Less

Submitted 27 March, 2018; originally announced March 2018.

Comments: Accepted at IEEE SIMPAR 2018. Project page: https://sites.google.com/view/phantomsim2real

arXiv:1703.02660 [pdf, other]

Towards Generalization and Simplicity in Continuous Control

Authors: Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, Sham Kakade

Abstract: This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scena… ▽ More This work shows that policies with simple linear and RBF parameterizations can be trained to solve a variety of continuous control tasks, including the OpenAI gym benchmarks. The performance of these trained policies are competitive with state of the art results, obtained with more elaborate parameterizations such as fully connected neural networks. Furthermore, existing training and testing scenarios are shown to be very limited and prone to over-fitting, thus giving rise to only trajectory-centric policies. Training with a diverse initial state distribution is shown to produce more global policies with better generalization. This allows for interactive control scenarios where the system recovers from large on-line perturbations; as shown in the supplementary video. △ Less

Submitted 20 March, 2018; v1 submitted 7 March, 2017; originally announced March 2017.

Comments: NIPS 2017, Project page: https://sites.google.com/view/simple-pol

Showing 1–9 of 9 results for author: Lowrey, K