Search | arXiv e-print repository

SAIL: Self-Improving Efficient Online Alignment of Large Language Models

Authors: Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conc… ▽ More Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. However, current offline alignment approaches like DPO, IPO, and SLiC rely heavily on fixed preference datasets, which can lead to sub-optimal performance. On the other hand, recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation and suffers from distribution shift issues. To address this, we establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment by exploring responses and regulating preference labels. In doing so, we permit alignment methods to operate in an online and self-improving manner, as well as generalize prior online RLHF methods as special cases. Compared to state-of-the-art iterative RLHF methods, our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 24 pages, 6 figures, 3 tables

arXiv:2406.13992 [pdf, ps, other]

Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game Perspective

Authors: Muhammad Aneeq uz Zaman, Mathieu Laurière, Alec Koppel, Tamer Başar

Abstract: In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainti… ▽ More In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted for publication in L4DC 2024

arXiv:2405.07432 [pdf, other]

Compressed Online Learning of Conditional Mean Embedding

Authors: Boya Hou, Sina Sanjari, Alec Koppel, Subhonmesh Bose

Abstract: The conditional mean embedding (CME) encodes Markovian stochastic kernels through their actions on probability distributions embedded within the reproducing kernel Hilbert spaces (RKHS). The CME plays a key role in several well-known machine learning tasks such as reinforcement learning, analysis of dynamical systems, etc. We present an algorithm to learn the CME incrementally from data via an ope… ▽ More The conditional mean embedding (CME) encodes Markovian stochastic kernels through their actions on probability distributions embedded within the reproducing kernel Hilbert spaces (RKHS). The CME plays a key role in several well-known machine learning tasks such as reinforcement learning, analysis of dynamical systems, etc. We present an algorithm to learn the CME incrementally from data via an operator-valued stochastic gradient descent. As is well-known, function learning in RKHS suffers from scalability challenges from large data. We utilize a compression mechanism to counter the scalability challenge. The core contribution of this paper is a finite-sample performance guarantee on the last iterate of the online compressed operator learning algorithm with fast-mixing Markovian samples, when the target CME may not be contained in the hypothesis space. We illustrate the efficacy of our algorithm by applying it to the analysis of an example dynamical system. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 39 pages

arXiv:2403.11925 [pdf, other]

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Authors: Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

Abstract: In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating… ▽ More In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{τ_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings. △ Less

Submitted 20 June, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: 26 Pages, 2 Figures

arXiv:2403.11345 [pdf, other]

Independent RL for Cooperative-Competitive Agents: A Mean-Field Perspective

Authors: Muhammad Aneeq uz Zaman, Alec Koppel, Mathieu Laurière, Tamer Başar

Abstract: We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in th… ▽ More We address in this paper Reinforcement Learning (RL) among agents that are grouped into teams such that there is cooperation within each team but general-sum (non-zero sum) competition across different teams. To develop an RL method that provably achieves a Nash equilibrium, we focus on a linear-quadratic structure. Moreover, to tackle the non-stationarity induced by multi-agent interactions in the finite population setting, we consider the case where the number of agents within each team is infinite, i.e., the mean-field setting. This results in a General-Sum LQ Mean-Field Type Game (GS-MFTGs). We characterize the Nash equilibrium (NE) of the GS-MFTG, under a standard invertibility condition. This MFTG NE is then shown to be $\mathcal{O}(1/M)$-NE for the finite population game where $M$ is a lower bound on the number of agents in each team. These structural results motivate an algorithm called Multi-player Receding-horizon Natural Policy Gradient (MRPG), where each team minimizes its cumulative cost independently in a receding-horizon manner. Despite the non-convexity of the problem, we establish that the resulting algorithm converges to a global NE through a novel problem decomposition into sub-problems using backward recursive discrete-time Hamilton-Jacobi-Isaacs (HJI) equations, in which independent natural policy gradient is shown to exhibit linear convergence under time-independent diagonal dominance. Experiments illuminate the merits of this approach in practice. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.08936 [pdf, other]

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

Authors: Peihong Yu, Manav Mishra, Alec Koppel, Carl Busart, Priya Narayan, Dinesh Manocha, Amrit Bedi, Pratap Tokekar

Abstract: Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce… ▽ More Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of policy behavior with demonstrations, and the second regulates incentives based on whether the behavior leads to the desired objective. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations, and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability to leverage joint demonstrations in the StarCraft scenario and converge effectively even with demonstrations from non-co-trained policies. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.08925 [pdf, other]

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

Authors: Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting it… ▽ More Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby highlighting the generality and robustness of our proposed solution. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language models (with Tulu2-7B) and show the efficacy of the proposed approach in the presence of diversity among human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2311.10927 [pdf, other]

Learning Payment-Free Resource Allocation Mechanisms

Authors: Sihan Zeng, Sujay Bhatt, Eleonora Kreacic, Parisa Hassanzadeh, Alec Koppel, Sumitra Ganesh

Abstract: We consider the design of mechanisms that allocate limited resources among self-interested agents using neural networks. Unlike the recent works that leverage machine learning for revenue maximization in auctions, we consider welfare maximization as the key objective in the payment-free setting. Without payment exchange, it is unclear how we can align agents' incentives to achieve the desired obje… ▽ More We consider the design of mechanisms that allocate limited resources among self-interested agents using neural networks. Unlike the recent works that leverage machine learning for revenue maximization in auctions, we consider welfare maximization as the key objective in the payment-free setting. Without payment exchange, it is unclear how we can align agents' incentives to achieve the desired objectives of truthfulness and social welfare simultaneously, without resorting to approximations. Our work makes novel contributions by designing an approximate mechanism that desirably trade-off social welfare with truthfulness. Specifically, (i) we contribute a new end-to-end neural network architecture, ExS-Net, that accommodates the idea of "money-burning" for mechanism design without payments; (ii)~we provide a generalization bound that guarantees the mechanism performance when trained under finite samples; and (iii) we provide an experimental demonstration of the merits of the proposed mechanism. △ Less

Submitted 12 April, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

arXiv:2310.07320 [pdf, other]

Byzantine-Resilient Decentralized Multi-Armed Bandits

Authors: **gxuan Zhu, Alec Koppel, Alvaro Velasquez, Ji Liu

Abstract: In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient b… ▽ More In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be Byzantine, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least 3f+1 neighbors where f is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2308.02585 [pdf, other]

PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback

Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, Furong Huang

Abstract: We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment obj… ▽ More We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. {True to our best knowledge, this work presents the first formulation of the RLHF as a bilevel optimization problem which generalizes the existing RLHF formulations and addresses the existing distribution shift issues in RLHF formulations.} To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order $\mathcal{O}(1/T)$. Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63\% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks. △ Less

Submitted 30 April, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

arXiv:2306.15444 [pdf, other]

Limited-Memory Greedy Quasi-Newton Method with Non-asymptotic Superlinear Convergence Rate

Authors: Zhan Gao, Aryan Mokhtari, Alec Koppel

Abstract: Non-asymptotic convergence analysis of quasi-Newton methods has gained attention with a landmark result establishing an explicit local superlinear rate of O$((1/\sqrt{t})^t)$. The methods that obtain this rate, however, exhibit a well-known drawback: they require the storage of the previous Hessian approximation matrix or all past curvature information to form the current Hessian inverse approxima… ▽ More Non-asymptotic convergence analysis of quasi-Newton methods has gained attention with a landmark result establishing an explicit local superlinear rate of O$((1/\sqrt{t})^t)$. The methods that obtain this rate, however, exhibit a well-known drawback: they require the storage of the previous Hessian approximation matrix or all past curvature information to form the current Hessian inverse approximation. Limited-memory variants of quasi-Newton methods such as the celebrated L-BFGS alleviate this issue by leveraging a limited window of past curvature information to construct the Hessian inverse approximation. As a result, their per iteration complexity and storage requirement is O$(τd)$ where $τ\le d$ is the size of the window and $d$ is the problem dimension reducing the O$(d^2)$ computational cost and memory requirement of standard quasi-Newton methods. However, to the best of our knowledge, there is no result showing a non-asymptotic superlinear convergence rate for any limited-memory quasi-Newton method. In this work, we close this gap by presenting a Limited-memory Greedy BFGS (LG-BFGS) method that can achieve an explicit non-asymptotic superlinear rate. We incorporate displacement aggregation, i.e., decorrelating projection, in post-processing gradient variations, together with a basis vector selection scheme on variable variations, which greedily maximizes a progress measure of the Hessian estimate to the true Hessian. Their combination allows past curvature information to remain in a sparse subspace while yielding a valid representation of the full history. Interestingly, our established non-asymptotic superlinear convergence rate demonstrates an explicit trade-off between the convergence speed and memory requirement, which to our knowledge, is the first of its kind. Numerical results corroborate our theoretical findings and demonstrate the effectiveness of our method. △ Less

Submitted 18 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.06192 [pdf, other]

Ada-NAV: Adaptive Trajectory Length-Based Sample Efficient Policy Learning for Robotic Navigation

Authors: Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Tianyi Zhou, Amrit Singh Bedi, Dinesh Manocha

Abstract: Trajectory length stands as a crucial hyperparameter within reinforcement learning (RL) algorithms, significantly contributing to the sample inefficiency in robotics applications. Motivated by the pivotal role trajectory length plays in the training process, we introduce Ada-NAV, a novel adaptive trajectory length scheme designed to enhance the training sample efficiency of RL algorithms in roboti… ▽ More Trajectory length stands as a crucial hyperparameter within reinforcement learning (RL) algorithms, significantly contributing to the sample inefficiency in robotics applications. Motivated by the pivotal role trajectory length plays in the training process, we introduce Ada-NAV, a novel adaptive trajectory length scheme designed to enhance the training sample efficiency of RL algorithms in robotic navigation tasks. Unlike traditional approaches that treat trajectory length as a fixed hyperparameter, we propose to dynamically adjust it based on the entropy of the underlying navigation policy. Interestingly, Ada-NAV can be applied to both existing on-policy and off-policy RL methods, which we demonstrate by empirically validating its efficacy on three popular RL methods: REINFORCE, Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). We demonstrate through simulated and real-world robotic experiments that Ada-NAV outperforms conventional methods that employ constant or randomly sampled trajectory lengths. Specifically, for a fixed sample budget, Ada-NAV achieves an 18\% increase in navigation success rate, a 20-38\% reduction in navigation path length, and a 9.32\% decrease in elevation costs. Furthermore, we showcase the versatility of Ada-NAV by integrating it with the Clearpath Husky robot, illustrating its applicability in complex outdoor environments. △ Less

Submitted 20 March, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 11 pages, 9 figures, 2 tables

arXiv:2306.05046 [pdf, other]

A Gradient-based Approach for Online Robust Deep Neural Network Training with Noisy Labels

Authors: Yifan Yang, Alec Koppel, Zheng Zhang

Abstract: Learning with noisy labels is an important topic for scalable training in many real-world scenarios. However, few previous research considers this problem in the online setting, where the arrival of data is streaming. In this paper, we propose a novel gradient-based approach to enable the detection of noisy labels for the online learning of model parameters, named Online Gradient-based Robust Sele… ▽ More Learning with noisy labels is an important topic for scalable training in many real-world scenarios. However, few previous research considers this problem in the online setting, where the arrival of data is streaming. In this paper, we propose a novel gradient-based approach to enable the detection of noisy labels for the online learning of model parameters, named Online Gradient-based Robust Selection (OGRS). In contrast to the previous sample selection approach for the offline training that requires the estimation of a clean ratio of the dataset before each epoch of training, OGRS can automatically select clean samples by steps of gradient update from datasets with varying clean ratios without changing the parameter setting. During the training process, the OGRS method selects clean samples at each iteration and feeds the selected sample to incrementally update the model parameters. We provide a detailed theoretical analysis to demonstrate data selection process is converging to the low-loss region of the sample space, by introducing and proving the sub-linear local Lagrangian regret of the non-convex constrained optimization problem. Experimental results show that it outperforms state-of-the-art methods in different settings. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2305.17568 [pdf, other]

Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities

Authors: Donghao Ying, Yunkai Zhang, Yuhao Ding, Alec Koppel, Javad Lavaei

Abstract: We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by {\it general utilities}, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploratio… ▽ More We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by {\it general utilities}, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and $κ$-hop neighbor truncation under a form of correlation decay property, where $κ$ is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $\widetilde{\mathcal{O}}\left(ε^{-3.5}\right)$ samples to achieve an $ε$-FOSP with an approximation error of $\mathcal{O}(φ_0^{2κ})$, where $φ_0\in (0,1)$. Finally, we demonstrate the effectiveness of our model through extensive numerical experiments. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Comments: 50 pages

arXiv:2305.17283 [pdf, other]

Sharpened Lazy Incremental Quasi-Newton Method

Authors: Aakash Lahoti, Spandan Senapati, Ketan Rajawat, Alec Koppel

Abstract: The problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics. In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$. Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence… ▽ More The problem of minimizing the sum of $n$ functions in $d$ dimensions is ubiquitous in machine learning and statistics. In many applications where the number of observations $n$ is large, it is necessary to use incremental or stochastic methods, as their per-iteration cost is independent of $n$. Of these, Quasi-Newton (QN) methods strike a balance between the per-iteration cost and the convergence rate. Specifically, they exhibit a superlinear rate with $O(d^2)$ cost in contrast to the linear rate of first-order methods with $O(d)$ cost and the quadratic rate of second-order methods with $O(d^3)$ cost. However, existing incremental methods have notable shortcomings: Incremental Quasi-Newton (IQN) only exhibits asymptotic superlinear convergence. In contrast, Incremental Greedy BFGS (IGS) offers explicit superlinear convergence but suffers from poor empirical performance and has a per-iteration cost of $O(d^3)$. To address these issues, we introduce the Sharpened Lazy Incremental Quasi-Newton Method (SLIQN) that achieves the best of both worlds: an explicit superlinear convergence rate, and superior empirical performance at a per-iteration $O(d^2)$ cost. SLIQN features two key changes: first, it incorporates a hybrid strategy of using both classic and greedy BFGS updates, allowing it to empirically outperform both IQN and IGS. Second, it employs a clever constant multiplicative factor along with a lazy propagation strategy, which enables it to have a cost of $O(d^2)$. Additionally, our experiments demonstrate the superiority of SLIQN over other incremental and stochastic Quasi-Newton variants and establish its competitiveness with second-order incremental methods. △ Less

Submitted 12 March, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: 36 pages, 2 figures; Accepted to AISTATS 2024

arXiv:2302.07938 [pdf, ps, other]

Scalable Multi-Agent Reinforcement Learning with General Utilities

Authors: Donghao Ying, Yuhao Ding, Alec Koppel, Javad Lavaei

Abstract: We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of th… ▽ More We study the scalable multi-agent reinforcement learning (MARL) with general utilities, defined as nonlinear functions of the team's long-term state-action occupancy measure. The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team. By exploiting the spatial correlation decay property of the network structure, we propose a scalable distributed policy gradient algorithm with shadow reward and localized policy that consists of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function estimation, and (3) truncated policy gradient estimation and policy update. Our algorithm converges, with high probability, to $ε$-stationarity with $\widetilde{\mathcal{O}}(ε^{-2})$ samples up to some approximation error that decreases exponentially in the communication radius. This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability. △ Less

Submitted 26 August, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: Supplementary material for the contribution to American Control Conference 2023 under the same title

arXiv:2301.12083 [pdf, other]

Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic

Authors: Wesley A. Suttle, Amrit Singh Bedi, Bhrij Patel, Brian M. Sadler, Alec Koppel, Dinesh Manocha

Abstract: Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, mak… ▽ More Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards. △ Less

Submitted 1 February, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:2301.12038 [pdf, other]

STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Mengdi Wang, Furong Huang, Dinesh Manocha

Abstract: Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instanc… ▽ More Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm \algo: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. {We further establish that {\algo} archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL.} Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches. △ Less

Submitted 18 September, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

arXiv:2208.11639 [pdf, ps, other]

Oracle-free Reinforcement Learning in Mean-Field Games along a Single Sample Path

Authors: Muhammad Aneeq uz Zaman, Alec Koppel, Sujay Bhatt, Tamer Başar

Abstract: We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by develo** an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperati… ▽ More We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by develo** an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\tilde{\mathcal{O}}(ε^{-4})$ where $ε$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class. △ Less

Submitted 11 April, 2023; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: Accepted for publication in AISTATS 2023

arXiv:2206.10815 [pdf, other]

FedBC: Calibrating Global and Local Models via Federated Learning Beyond Consensus

Authors: Amrit Singh Bedi, Chen Fan, Alec Koppel, Anit Kumar Sahu, Brian M. Sadler, Furong Huang, Dinesh Manocha

Abstract: In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considerin… ▽ More In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop a novel primal-dual method called Federated Learning Beyond Consensus (\texttt{FedBC}). Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter introduced to scalarize the multi-criterion formulation. Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art. △ Less

Submitted 1 February, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

arXiv:2206.05652 [pdf, other]

Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies

Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Pratap Tokekar, Dinesh Manocha

Abstract: In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either rewa… ▽ More In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either reward sha** or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior to the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2). We show consistent performance improvement across all tasks in terms of high average cumulative reward. HT-SPG also demonstrates improved convergence speed with minimum samples, thereby emphasizing the sample efficiency of our proposed algorithm. △ Less

Submitted 12 June, 2022; originally announced June 2022.

arXiv:2206.01162 [pdf, other]

Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

Authors: Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Brian M. Sadler, Furong Huang, Pratap Tokekar, Dinesh Manocha

Abstract: Model-based approaches to reinforcement learning (MBRL) exhibit favorable performance in practice, but their theoretical guarantees in large spaces are mostly restricted to the setting when transition model is Gaussian or Lipschitz, and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assump… ▽ More Model-based approaches to reinforcement learning (MBRL) exhibit favorable performance in practice, but their theoretical guarantees in large spaces are mostly restricted to the setting when transition model is Gaussian or Lipschitz, and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assumptions on the target transition model to belong to a generic family of mixture models; (ii) is applicable to large-scale training by incorporating a compression step such that the posterior estimate consists of a Bayesian coreset of only statistically significant past state-action pairs; and (iii) exhibits a sublinear Bayesian regret. To achieve these results, we adopt an approach based upon Stein's method, which, under a smoothness condition on the constructed posterior and target, allows distributional distance to be evaluated in closed form as the kernelized Stein discrepancy (KSD). The aforementioned compression step is then computed in terms of greedily retaining only those samples which are more than a certain KSD away from the previous model estimate. Experimentally, we observe that this approach is competitive with several state-of-the-art RL methodologies, and can achieve up-to 50 percent reduction in wall clock time in some continuous control environments. △ Less

Submitted 4 May, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2203.00851 [pdf, other]

Distributed Riemannian Optimization with Lazy Communication for Collaborative Geometric Estimation

Authors: Yulun Tian, Amrit Singh Bedi, Alec Koppel, Miguel Calvo-Fullana, David M. Rosen, Jonathan P. How

Abstract: We present the first distributed optimization algorithm with lazy communication for collaborative geometric estimation, the backbone of modern collaborative simultaneous localization and map** (SLAM) and structure-from-motion (SfM) applications. Our method allows agents to cooperatively reconstruct a shared geometric model on a central server by fusing individual observations, but without the ne… ▽ More We present the first distributed optimization algorithm with lazy communication for collaborative geometric estimation, the backbone of modern collaborative simultaneous localization and map** (SLAM) and structure-from-motion (SfM) applications. Our method allows agents to cooperatively reconstruct a shared geometric model on a central server by fusing individual observations, but without the need to transmit potentially sensitive information about the agents themselves (such as their locations). Furthermore, to alleviate the burden of communication during iterative optimization, we design a set of communication triggering conditions that enable agents to selectively upload a targeted subset of local information that is useful to global optimization. Our approach thus achieves significant communication reduction with minimal impact on optimization performance. As our main theoretical contribution, we prove that our method converges to first-order critical points with a global sublinear convergence rate. Numerical evaluations on bundle adjustment problems from collaborative SLAM and SfM datasets show that our method performs competitively against existing distributed techniques, while achieving up to 78% total communication reduction. △ Less

Submitted 29 July, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

Comments: technical report (17 pages, 3 figures); to appear at IROS 2022

arXiv:2202.10538 [pdf, other]

Sharpened Quasi-Newton Methods: Faster Superlinear Rate and Larger Local Convergence Neighborhood

Authors: Qiujiang **, Alec Koppel, Ketan Rajawat, Aryan Mokhtari

Abstract: Non-asymptotic analysis of quasi-Newton methods have gained traction recently. In particular, several works have established a non-asymptotic superlinear rate of $\mathcal{O}((1/\sqrt{t})^t)$ for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergenc… ▽ More Non-asymptotic analysis of quasi-Newton methods have gained traction recently. In particular, several works have established a non-asymptotic superlinear rate of $\mathcal{O}((1/\sqrt{t})^t)$ for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of Greedy-BFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in Greedy-BFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and Greedy-BFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method out-performs both BFGS and Greedy-BFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to Greedy-BFGS. Numerical experiments on various datasets also confirm our theoretical findings. △ Less

Submitted 15 June, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

arXiv:2201.12332 [pdf, other]

On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

Authors: Amrit Singh Bedi, Souradip Chakraborty, Anjaly Parayil, Brian Sadler, Pratap Tokekar, Alec Koppel

Abstract: We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that app… ▽ More We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks. △ Less

Submitted 30 January, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

arXiv:2201.08832 [pdf, other]

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

Authors: Wesley A. Suttle, Alec Koppel, Ji Liu

Abstract: In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the… ▽ More In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the average cost of a policy and the entropy of its induced state occupancy measure, enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. Specifically, we show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known. Since model knowledge is typically lacking in practice, we lay the foundations for model-free OIR policy search methods by establishing a corresponding policy gradient theorem. Building on this result, we subsequently derive REINFORCE- and actor-critic-style algorithms for solving the OIR problem in policy parameter space. Crucially, exploiting the powerful hidden quasiconcavity property implied by the concave programming transformation of the OIR problem, we establish finite-time convergence of the REINFORCE-style scheme to global optimality and asymptotic convergence of the actor-critic-style scheme to (near) global optimality under suitable conditions. Finally, we experimentally illustrate the utility of OIR-based methods over vanilla methods in sparse-reward settings, supporting the OIR as an alternative to existing RL objectives. △ Less

Submitted 28 December, 2023; v1 submitted 21 January, 2022; originally announced January 2022.

arXiv:2201.07130 [pdf, other]

Online, Informative MCMC Thinning with Kernelized Stein Discrepancy

Authors: Cole Hawkins, Alec Koppel, Zheng Zhang

Abstract: A fundamental challenge in Bayesian inference is efficient representation of a target distribution. Many non-parametric approaches do so by sampling a large number of points using variants of Markov Chain Monte Carlo (MCMC). We propose an MCMC variant that retains only those posterior samples which exceed a KSD threshold, which we call KSD Thinning. We establish the convergence and complexity trad… ▽ More A fundamental challenge in Bayesian inference is efficient representation of a target distribution. Many non-parametric approaches do so by sampling a large number of points using variants of Markov Chain Monte Carlo (MCMC). We propose an MCMC variant that retains only those posterior samples which exceed a KSD threshold, which we call KSD Thinning. We establish the convergence and complexity tradeoffs for several settings of KSD Thinning as a function of the KSD threshold parameter, sample size, and other problem parameters. Finally, we provide experimental comparisons against other online nonparametric Bayesian methods that generate low-complexity posterior representations, and observe superior consistency/complexity tradeoffs. Code is available at github.com/colehawkins/KSD-Thinning. △ Less

Submitted 21 April, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

Comments: New version corrects error in notation regarding dictionary size growth. Little o to Omega

arXiv:2110.12929 [pdf, other]

Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming

Authors: Alec Koppel, Amrit Singh Bedi, Bhargav Ganguly, Vaneet Aggarwal

Abstract: In tabular multi-agent reinforcement learning with average-cost criterion, a team of agents sequentially interacts with the environment and observes local incentives. We focus on the case that the global reward is a sum of local rewards, the joint policy factorizes into agents' marginals, and full state observability. To date, few global optimality guarantees exist even for this simple setting, as… ▽ More In tabular multi-agent reinforcement learning with average-cost criterion, a team of agents sequentially interacts with the environment and observes local incentives. We focus on the case that the global reward is a sum of local rewards, the joint policy factorizes into agents' marginals, and full state observability. To date, few global optimality guarantees exist even for this simple setting, as most results yield convergence to stationarity for parameterized policies in large/possibly continuous spaces. To solidify the foundations of MARL, we build upon linear programming (LP) reformulations, for which stochastic primal-dual methods yields a model-free approach to achieve \emph{optimal sample complexity} in the centralized case. We develop multi-agent extensions, whereby agents solve their local saddle point problems and then perform local weighted averaging. We establish that the sample complexity to obtain near-globally optimal solutions matches tight dependencies on the cardinality of the state and action spaces, and exhibits classical scalings with respect to the network in accordance with multi-agent optimization. Experiments corroborate these results in practice. △ Less

Submitted 21 October, 2021; originally announced October 2021.

arXiv:2110.06401 [pdf, other]

Distributed Gaussian Process Map** for Robot Teams with Time-varying Communication

Authors: James Di, Ehsan Zobeidi, Alec Koppel, Nikolay Atanasov

Abstract: Multi-agent map** is a fundamentally important capability for autonomous robot task coordination and execution in complex environments. While successful algorithms have been proposed for map** using individual platforms, cooperative online map** for teams of robots remains largely a challenge. We focus on probabilistic variants of map** due to its potential utility in downstream tasks such… ▽ More Multi-agent map** is a fundamentally important capability for autonomous robot task coordination and execution in complex environments. While successful algorithms have been proposed for map** using individual platforms, cooperative online map** for teams of robots remains largely a challenge. We focus on probabilistic variants of map** due to its potential utility in downstream tasks such as uncertainty-aware path-planning. A critical question to enabling this capability is how to process and aggregate incrementally observed local information among individual platforms, especially when their ability to communicate is intermittent. We put forth an Incremental Sparse Gaussian Process (GP) methodology for multi-robot map**, where the regression is over a truncated signed-distance field (TSDF). Doing so permits each robot in the network to track a local estimate of a pseudo-point approximation GP posterior and perform weighted averaging of its parameters with those of its (possibly time-varying) set of neighbors. We establish conditions on the pseudo-point representation, as well as communication protocol, such that robots' local GPs converge to the one with globally aggregated information. We further provide experiments that corroborate our theoretical findings for probabilistic multi-robot map**. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2109.06332 [pdf, other]

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

Authors: Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, Vaneet Aggarwal

Abstract: Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to sol… ▽ More Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve $ε$-optimal cumulative reward with $ε$ feasible policies. An $ε$-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve $ε$-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit $\tilde{\mathcal{O}}\left(1/ε^2\right)$ sample complexity to achieve $ε$-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the $ε$-optimal policy with zero constraint violation is $\tilde{\mathcal{O}}\left(1/ε^5\right)$. Hence, the proposed algorithm provides a significant improvement as compared to the state of the art. △ Less

Submitted 13 July, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: This paper is the arXiv version with Appendices of the published AAAI paper: "Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach," in Proc. AAAI, Feb 2022. The paper has been further extended with concave utilities and constraints in v2

Journal ref: AAAI 2022

arXiv:2107.12797 [pdf, other]

Wasserstein-Splitting Gaussian Process Regression for Heterogeneous Online Bayesian Inference

Authors: Michael E. Kepler, Alec Koppel, Amrit Singh Bedi, Daniel J. Stilwell

Abstract: Gaussian processes (GPs) are a well-known nonparametric Bayesian inference technique, but they suffer from scalability problems for large sample sizes, and their performance can degrade for non-stationary or spatially heterogeneous data. In this work, we seek to overcome these issues through (i) employing variational free energy approximations of GPs operating in tandem with online expectation pro… ▽ More Gaussian processes (GPs) are a well-known nonparametric Bayesian inference technique, but they suffer from scalability problems for large sample sizes, and their performance can degrade for non-stationary or spatially heterogeneous data. In this work, we seek to overcome these issues through (i) employing variational free energy approximations of GPs operating in tandem with online expectation propagation steps; and (ii) introducing a local splitting step which instantiates a new GP whenever the posterior distribution changes significantly as quantified by the Wasserstein metric over posterior distributions. Over time, then, this yields an ensemble of sparse GPs which may be updated incrementally, and adapts to locality, heterogeneity, and non-stationarity in training data. △ Less

Submitted 26 July, 2021; originally announced July 2021.

arXiv:2106.08414 [pdf, other]

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

Authors: Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel

Abstract: Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and sui… ▽ More Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jum** in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned. △ Less

Submitted 2 January, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2106.00543 [pdf, other]

MARL with General Utilities via Decentralized Shadow Reward Actor-Critic

Authors: Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel

Abstract: We posit a new mechanism for cooperation in multi-agent reinforcement learning (MARL) based upon any nonlinear function of the team's long-term state-action occupancy measure, i.e., a \emph{general utility}. This subsumes the cumulative return but also allows one to incorporate risk-sensitivity, exploration, and priors. % We derive the {\bf D}ecentralized {\bf S}hadow Reward {\bf A}ctor-{\bf C}rit… ▽ More We posit a new mechanism for cooperation in multi-agent reinforcement learning (MARL) based upon any nonlinear function of the team's long-term state-action occupancy measure, i.e., a \emph{general utility}. This subsumes the cumulative return but also allows one to incorporate risk-sensitivity, exploration, and priors. % We derive the {\bf D}ecentralized {\bf S}hadow Reward {\bf A}ctor-{\bf C}ritic (DSAC) in which agents alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). DSAC augments the classic critic step by requiring agents to (i) estimate their local occupancy measure in order to (ii) estimate the derivative of the local utility with respect to their occupancy measure, i.e., the "shadow reward". DSAC converges to $ε$-stationarity in $\mathcal{O}(1/ε^{2.5})$ (Theorem \ref{theorem:final}) or faster $\mathcal{O}(1/ε^{2})$ (Corollary \ref{corollary:communication}) steps with high probability, depending on the amount of communications. We further establish the non-existence of spurious stationary points for this problem, that is, DSAC finds the globally optimal policy (Corollary \ref{corollary:global}). Experiments demonstrate the merits of goals beyond the cumulative return in cooperative MARL. △ Less

Submitted 24 June, 2021; v1 submitted 29 May, 2021; originally announced June 2021.

arXiv:2103.16170 [pdf, other]

Dense Incremental Metric-Semantic Map** for Multi-Agent Systems via Sparse Gaussian Process Regression

Authors: Ehsan Zobeidi, Alec Koppel, Nikolay Atanasov

Abstract: We develop an online probabilistic metric-semantic map** approach for mobile robot teams relying on streaming RGB-D observations. The generated maps contain full continuous distributional information about the geometric surfaces and semantic labels (e.g., chair, table, wall). Our approach is based on online Gaussian Process (GP) training and inference, and avoids the complexity of GP classificat… ▽ More We develop an online probabilistic metric-semantic map** approach for mobile robot teams relying on streaming RGB-D observations. The generated maps contain full continuous distributional information about the geometric surfaces and semantic labels (e.g., chair, table, wall). Our approach is based on online Gaussian Process (GP) training and inference, and avoids the complexity of GP classification by regressing a truncated signed distance function (TSDF) of the regions occupied by different semantic classes. Online regression is enabled through a sparse pseudo-point approximation of the GP posterior. To scale to large environments, we further consider spatial domain partitioning via an octree data structure with overlap** leaves. An extension to the multi-robot setting is developed by having each robot execute its own online measurement update and then combine its posterior parameters via local weighted geometric averaging with those of its neighbors. This yields a distributed information processing architecture in which the GP map estimates of all robots converge to a common map of the environment while relying only on local one-hop communication. Our experiments demonstrate the effectiveness of the probabilistic metric-semantic map** technique in 2-D and 3-D environments in both single and multi-robot settings. △ Less

Submitted 30 March, 2021; originally announced March 2021.

arXiv:2103.14474 [pdf, other]

doi 10.1109/IROS.2018.8594065

Composable Learning with Sparse Kernel Representations

Authors: Ekaterina Tolstaya, Ethan Stump, Alec Koppel, Alejandro Ribeiro

Abstract: We present a reinforcement learning algorithm for learning sparse non-parametric controllers in a Reproducing Kernel Hilbert Space. We improve the sample complexity of this approach by imposing a structure of the state-action function through a normalized advantage function (NAF). This representation of the policy enables efficiently composing multiple learned models without additional training sa… ▽ More We present a reinforcement learning algorithm for learning sparse non-parametric controllers in a Reproducing Kernel Hilbert Space. We improve the sample complexity of this approach by imposing a structure of the state-action function through a normalized advantage function (NAF). This representation of the policy enables efficiently composing multiple learned models without additional training samples or interaction with the environment. We demonstrate the performance of this algorithm on learning obstacle-avoidance policies in multiple simulations of a robot equipped with a laser scanner while navigating in a 2D environment. We apply the composition operation to various policy combinations and test them to show that the composed policies retain the performance of their components. We also transfer the composed policy directly to a physical platform operating in an arena with obstacles in order to demonstrate a degree of generalization. △ Less

Submitted 29 March, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

arXiv:2011.07142 [pdf, other]

Sparse Representations of Positive Functions via First and Second-Order Pseudo-Mirror Descent

Authors: Abhishek Chakraborty, Ketan Rajawat, Alec Koppel

Abstract: We consider expected risk minimization problems when the range of the estimator is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. To facilitate nonlinear interpolation, we hypothesize that the search space is a Reproducing Kernel Hilbert Space (RKHS). We develop first and second-order variants of stochastic mirror descent e… ▽ More We consider expected risk minimization problems when the range of the estimator is required to be nonnegative, motivated by the settings of maximum likelihood estimation (MLE) and trajectory optimization. To facilitate nonlinear interpolation, we hypothesize that the search space is a Reproducing Kernel Hilbert Space (RKHS). We develop first and second-order variants of stochastic mirror descent employing (i) \emph{pseudo-gradients} and (ii) complexity-reducing projections. Compressive projection in the first-order scheme is executed via kernel orthogonal matching pursuit (KOMP), which overcomes the fact that the vanilla RKHS parameterization grows unbounded with the iteration index in the stochastic setting. Moreover, pseudo-gradients are needed when gradient estimates for cost are only computable up to some numerical error, which arise in, e.g., integral approximations. Under constant step-size and compression budget, we establish tradeoffs between the radius of convergence of the expected sub-optimality and the projection budget parameter, as well as non-asymptotic bounds on the model complexity. To refine the solution's precision, we develop a second-order extension which employs recursively averaged pseudo-gradient outer-products to approximate the Hessian inverse, whose convergence in mean is established under an additional eigenvalue decay condition on the Hessian of the optimal RKHS element, which is unique to this work. Experiments demonstrate favorable performance on inhomogeneous Poisson Process intensity estimation in practice. △ Less

Submitted 3 May, 2022; v1 submitted 13 November, 2020; originally announced November 2020.

arXiv:2009.04950 [pdf, other]

A Markov Decision Process Approach to Active Meta Learning

Authors: Bingjia Wang, Alec Koppel, Vikram Krishnamurthy

Abstract: In supervised learning, we fit a single statistical model to a given data set, assuming that the data is associated with a singular task, which yields well-tuned models for specific use, but does not adapt well to new contexts. By contrast, in meta-learning, the data is associated with numerous tasks, and we seek a model that may perform well on all tasks simultaneously, in pursuit of greater gene… ▽ More In supervised learning, we fit a single statistical model to a given data set, assuming that the data is associated with a singular task, which yields well-tuned models for specific use, but does not adapt well to new contexts. By contrast, in meta-learning, the data is associated with numerous tasks, and we seek a model that may perform well on all tasks simultaneously, in pursuit of greater generalization. One challenge in meta-learning is how to exploit relationships between tasks and classes, which is overlooked by commonly used random or cyclic passes through data. In this work, we propose actively selecting samples on which to train by discerning covariates inside and between meta-training sets. Specifically, we cast the problem of selecting a sample from a number of meta-training sets as either a multi-armed bandit or a Markov Decision Process (MDP), depending on how one encapsulates correlation across tasks. We develop scheduling schemes based on Upper Confidence Bound (UCB), Gittins Index and tabular Markov Decision Problems (MDPs) solved with linear programming, where the reward is the scaled statistical accuracy to ensure it is a time-invariant function of state and action. Across a variety of experimental contexts, we observe significant reductions in sample complexity of active selection scheme relative to cyclic or i.i.d. sampling, demonstrating the merit of exploiting covariates in practice. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:2007.02151 [pdf, other]

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Authors: Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, Mengdi Wang

Abstract: In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes se… ▽ More In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate. △ Less

Submitted 4 July, 2020; originally announced July 2020.

arXiv:2007.01219 [pdf, ps, other]

Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems

Authors: Zhan Gao, Alec Koppel, Alejandro Ribeiro

Abstract: Stochastic gradient descent is a canonical tool for addressing stochastic optimization problems, and forms the bedrock of modern machine learning and statistics. In this work, we seek to balance the fact that attenuating step-size is required for exact asymptotic convergence with the fact that constant step-size learns faster in finite time up to an error. To do so, rather than fixing the mini-bat… ▽ More Stochastic gradient descent is a canonical tool for addressing stochastic optimization problems, and forms the bedrock of modern machine learning and statistics. In this work, we seek to balance the fact that attenuating step-size is required for exact asymptotic convergence with the fact that constant step-size learns faster in finite time up to an error. To do so, rather than fixing the mini-batch and the step-size at the outset, we propose a strategy to allow parameters to evolve adaptively. Specifically, the batch-size is set to be a piecewise-constant increasing sequence where the increase occurs when a suitable error criterion is satisfied. Moreover, the step-size is selected as that which yields the fastest convergence. The overall algorithm, two scale adaptive (TSA) scheme, is developed for both convex and non-convex stochastic optimization problems. It inherits the exact asymptotic convergence of stochastic gradient method. More importantly, the optimal error decreasing rate is achieved theoretically, as well as an overall reduction in computational cost. Experimentally, we observe that TSA attains a favorable tradeoff relative to standard SGD that fixes the mini-batch and the step-size, or simply allowing one to increase or decrease respectively. △ Less

Submitted 9 July, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

arXiv:2004.11094 [pdf, other]

Consistent Online Gaussian Process Regression Without the Sample Complexity Bottleneck

Authors: Alec Koppel, Hrusikesha Pradhan, Ketan Rajawat

Abstract: Gaussian processes provide a framework for nonlinear nonparametric Bayesian inference widely applicable across science and engineering. Unfortunately, their computational burden scales cubically with the training sample size, which in the case that samples arrive in perpetuity, approaches infinity. This issue necessitates approximations for use with streaming data, which to date mostly lack conver… ▽ More Gaussian processes provide a framework for nonlinear nonparametric Bayesian inference widely applicable across science and engineering. Unfortunately, their computational burden scales cubically with the training sample size, which in the case that samples arrive in perpetuity, approaches infinity. This issue necessitates approximations for use with streaming data, which to date mostly lack convergence guarantees. Thus, we develop the first online Gaussian process approximation that preserves convergence to the population posterior, i.e., asymptotic posterior consistency, while ameliorating its intractable complexity growth with the sample size. We propose an online compression scheme that, following each a posteriori update, fixes an error neighborhood with respect to the Hellinger metric centered at the current posterior, and greedily tosses out past kernel dictionary elements until its boundary is hit. We call the resulting method Parsimonious Online Gaussian Processes (POG). For diminishing error radius, exact asymptotic consistency is preserved (Theorem 1(i)) at the cost of unbounded memory in the limit. On the other hand, for constant error radius, POG converges to a neighborhood of the population posterior (Theorem 1(ii))but with finite memory at-worst determined by the metric entropy of the feature space (Theorem 2). Experimental results are presented on several nonlinear regression problems which illuminates the merits of this approach as compared with alternatives that fix the subspace dimension defining the history of past points. △ Less

Submitted 15 July, 2021; v1 submitted 23 April, 2020; originally announced April 2020.

arXiv:2004.04843 [pdf, other]

Policy Gradient using Weak Derivatives for Reinforcement Learning

Authors: Sujay Bhatt, Alec Koppel, Vikram Krishnamurthy

Abstract: This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient… ▽ More This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/\sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm. △ Less

Submitted 9 April, 2020; originally announced April 2020.

Comments: 1 figure

arXiv:2003.12637 [pdf, other]

Collaborative Beamforming Under Localization Errors: A Discrete Optimization Approach

Authors: Erfaun Noorani, Yagiz Savas, Alec Koppel, John Baras, Ufuk Topcu, Brian M. Sadler

Abstract: We consider a network of agents that locate themselves in an environment through sensor measurements and aim to transmit a message signal to a base station via collaborative beamforming. The agents' sensor measurements result in localization errors, which degrade the quality of service at the base station due to unknown phase offsets that arise in the agents' communication channels. Assuming that… ▽ More We consider a network of agents that locate themselves in an environment through sensor measurements and aim to transmit a message signal to a base station via collaborative beamforming. The agents' sensor measurements result in localization errors, which degrade the quality of service at the base station due to unknown phase offsets that arise in the agents' communication channels. Assuming that each agent's localization error follows a Gaussian distribution, we study the problem of forming a reliable communication link between the agents and the base station despite the localization errors. In particular, we formulate a discrete optimization problem to choose only a subset of agents to transmit the message signal so that the variance of the signal-to-noise ratio (SNR) received by the base station is minimized while the expected SNR exceeds a desired threshold. When the variances of the localization errors are below a certain threshold characterized in terms of the carrier frequency, we show that greedy algorithms can be used to globally minimize the variance of the received SNR. On the other hand, when some agents have localization errors with large variances, we show that the variance of the received SNR can be locally minimized by exploiting the supermodularity of the mean and variance of the received SNR. In numerical simulations, we demonstrate that the proposed algorithms have the potential to synthesize beamformers orders of magnitude faster than convex optimization-based approaches while achieving comparable performances using less number of agents. △ Less

Submitted 17 March, 2021; v1 submitted 27 March, 2020; originally announced March 2020.

arXiv:2003.10550 [pdf, other]

Regret and Belief Complexity Trade-off in Gaussian Process Bandits via Information Thresholding

Authors: Amrit Singh Bedi, Dheeraj Peddireddy, Vaneet Aggarwal, Brian M. Sadler, Alec Koppel

Abstract: Bayesian optimization is a framework for global search via maximum a posteriori updates rather than simulated annealing, and has gained prominence for decision-making under uncertainty. In this work, we cast Bayesian optimization as a multi-armed bandit problem, where the payoff function is sampled from a Gaussian process (GP). Further, we focus on action selections via upper confidence bound (UCB… ▽ More Bayesian optimization is a framework for global search via maximum a posteriori updates rather than simulated annealing, and has gained prominence for decision-making under uncertainty. In this work, we cast Bayesian optimization as a multi-armed bandit problem, where the payoff function is sampled from a Gaussian process (GP). Further, we focus on action selections via upper confidence bound (UCB) or expected improvement (EI) due to their prevalent use in practice. Prior works using GPs for bandits cannot allow the iteration horizon $T$ to be large, as the complexity of computing the posterior parameters scales cubically with the number of past observations. To circumvent this computational burden, we propose a simple statistical test: only incorporate an action into the GP posterior when its conditional entropy exceeds an $ε$ threshold. Doing so permits us to precisely characterize the trade-off between regret bounds of GP bandit algorithms and complexity of the posterior distributions depending on the compression parameter $ε$ for both discrete and continuous action sets. To best of our knowledge, this is the first result which allows us to obtain sublinear regret bounds while still maintaining sublinear growth rate of the complexity of the posterior which is linear in the existing literature. Moreover, a provably finite bound on the complexity could be achieved but the algorithm would result in $ε$-regret which means $\textbf{Reg}_T/T \rightarrow \mathcal{O}(ε)$ as $T\rightarrow \infty$. Experimentally, we observe state of the art accuracy and complexity trade-offs for GP bandit algorithms applied to global optimization, suggesting the merits of compressed GPs in bandit settings. △ Less

Submitted 21 March, 2022; v1 submitted 23 March, 2020; originally announced March 2020.

arXiv:2003.03281 [pdf, ps, other]

doi 10.1109/LRA.2020.3010216

Asynchronous and Parallel Distributed Pose Graph Optimization

Authors: Yulun Tian, Alec Koppel, Amrit Singh Bedi, Jonathan P. How

Abstract: We present Asynchronous Stochastic Parallel Pose Graph Optimization (ASAPP), the first asynchronous algorithm for distributed pose graph optimization (PGO) in multi-robot simultaneous localization and map**. By enabling robots to optimize their local trajectory estimates without synchronization, ASAPP offers resiliency against communication delays and alleviates the need to wait for stragglers i… ▽ More We present Asynchronous Stochastic Parallel Pose Graph Optimization (ASAPP), the first asynchronous algorithm for distributed pose graph optimization (PGO) in multi-robot simultaneous localization and map**. By enabling robots to optimize their local trajectory estimates without synchronization, ASAPP offers resiliency against communication delays and alleviates the need to wait for stragglers in the network. Furthermore, ASAPP can be applied on the rank-restricted relaxations of PGO, a crucial class of non-convex Riemannian optimization problems that underlies recent breakthroughs on globally optimal PGO. Under bounded delay, we establish the global first-order convergence of ASAPP using a sufficiently small stepsize. The derived stepsize depends on the worst-case delay and inherent problem sparsity, and furthermore matches known result for synchronous algorithms when there is no delay. Numerical evaluations on simulated and real-world datasets demonstrate favorable performance compared to state-of-the-art synchronous approach, and show ASAPP's resilience against a wide range of delays in practice. △ Less

Submitted 30 June, 2023; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: full paper with appendices

arXiv:2002.12475 [pdf, other]

Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

Authors: Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, Alec Koppel

Abstract: We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we cal… ▽ More We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:1910.08412 [pdf, other]

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Authors: Harshat Kumar, Alec Koppel, Alejandro Ribeiro

Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asym… ▽ More Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning. △ Less

Submitted 27 January, 2023; v1 submitted 18 October, 2019; originally announced October 2019.

arXiv:1909.11555 [pdf, other]

Optimally Compressed Nonparametric Online Learning

Authors: Alec Koppel, Amrit Singh Bedi, Ketan Rajawat, Brian M. Sadler

Abstract: Batch training of machine learning models based on neural networks is now well established, whereas to date streaming methods are largely based on linear models. To go beyond linear in the online setting, nonparametric methods are of interest due to their universality and ability to stably incorporate new information via convexity or Bayes' Rule. Unfortunately, when used online, nonparametric meth… ▽ More Batch training of machine learning models based on neural networks is now well established, whereas to date streaming methods are largely based on linear models. To go beyond linear in the online setting, nonparametric methods are of interest due to their universality and ability to stably incorporate new information via convexity or Bayes' Rule. Unfortunately, when used online, nonparametric methods suffer a "curse of dimensionality" which precludes their use: their complexity scales at least with the time index. We survey online compression tools which bring their memory under control and attain approximate convergence. The asymptotic bias depends on a compression parameter that trades off memory and accuracy. Further, the applications to robotics, communications, economics, and power are discussed, as well as extensions to multi-agent systems. △ Less

Submitted 17 January, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

arXiv:1909.10279 [pdf, other]

Nearly Consistent Finite Particle Estimates in Streaming Importance Sampling

Authors: Alec Koppel, Amrit Singh Bedi, Brian M. Sadler, Victor Elvira

Abstract: In Bayesian inference, we seek to compute information about random variables such as moments or quantiles on the basis of {available data} and prior information. When the distribution of random variables is {intractable}, Monte Carlo (MC) sampling is usually required. {Importance sampling is a standard MC tool that approximates this unavailable distribution with a set of weighted samples.} This pr… ▽ More In Bayesian inference, we seek to compute information about random variables such as moments or quantiles on the basis of {available data} and prior information. When the distribution of random variables is {intractable}, Monte Carlo (MC) sampling is usually required. {Importance sampling is a standard MC tool that approximates this unavailable distribution with a set of weighted samples.} This procedure is asymptotically consistent as the number of MC samples (particles) go to infinity. However, retaining infinitely many particles is intractable. Thus, we propose a way to only keep a \emph{finite representative subset} of particles and their augmented importance weights that is \emph{nearly consistent}. To do so in {an online manner}, we (1) embed the posterior density estimate in a reproducing kernel Hilbert space (RKHS) through its kernel mean embedding; and (2) sequentially project this RKHS element onto a lower-dimensional subspace in RKHS using the maximum mean discrepancy, an integral probability metric. Theoretically, we establish that this scheme results in a bias determined by a compression parameter, which yields a tunable tradeoff between consistency and memory. In experiments, we observe the compressed estimates achieve comparable performance to the dense ones with substantial reductions in representational complexity. △ Less

Submitted 5 April, 2021; v1 submitted 23 September, 2019; originally announced September 2019.

arXiv:1909.05442 [pdf, other]

Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony

Authors: Amrit Singh Bedi, Alec Koppel, Ketan Rajawat, Brian M. Sadler

Abstract: An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \… ▽ More An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \emph{online} universal function approximator based on compressed kernel methods. Our approach hinges upon viewing non-stationary learning as online convex optimization with dynamic comparators, for which performance is quantified by dynamic regret. Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time. For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data. △ Less

Submitted 11 September, 2019; originally announced September 2019.

arXiv:1908.00510 [pdf, ps, other]

Adaptive Kernel Learning in Heterogeneous Networks

Authors: Hrusikesha Pradhan, Amrit Singh Bedi, Alec Koppel, Ketan Rajawat

Abstract: We consider learning in decentralized heterogeneous networks: agents seek to minimize a convex functional that aggregates data across the network, while only having access to their local data streams. We focus on the case where agents seek to estimate a regression \emph{function} that belongs to a reproducing kernel Hilbert space (RKHS). To incentivize coordination while respecting network heterog… ▽ More We consider learning in decentralized heterogeneous networks: agents seek to minimize a convex functional that aggregates data across the network, while only having access to their local data streams. We focus on the case where agents seek to estimate a regression \emph{function} that belongs to a reproducing kernel Hilbert space (RKHS). To incentivize coordination while respecting network heterogeneity, we impose nonlinear proximity constraints. To solve the constrained stochastic program, we propose applying a functional variant of stochastic primal-dual (Arrow-Hurwicz) method which yields a decentralized algorithm. To handle the fact that agents' functions have complexity proportional to time (owing to the RKHS parameterization), we project the primal iterates onto subspaces greedily constructed from kernel evaluations of agents' local observations. The resulting scheme, dubbed Heterogeneous Adaptive Learning with Kernels (HALK), when used with constant step-sizes, yields $\mathcal{O}(\sqrt{T})$ attenuation in sub-optimality and exactly satisfies the constraints in the long run, which improves upon the state of the art rates for vector-valued problems. △ Less

Submitted 1 June, 2021; v1 submitted 1 August, 2019; originally announced August 2019.

Showing 1–50 of 64 results for author: Koppel, A