Search | arXiv e-print repository

Learning the Target Network in Function Space

Authors: Kavosh Asadi, Yao Liu, Shoham Sabach, Ming Yin, Rasool Fakoor

Abstract: We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algo… ▽ More We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algorithm is designed to maintain an equivalence between the two networks in the function space. This value-based equivalence is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based target-network updates significantly improve deep RL on the Atari benchmark. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Accepted to International Conference on Machine Learning (ICML24)

arXiv:2310.05905 [pdf, other]

TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models

Authors: Zuxin Liu, Jesse Zhang, Kavosh Asadi, Yao Liu, Ding Zhao, Shoham Sabach, Rasool Fakoor

Abstract: The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes either effective pretraining of large models for decision-making or single-task adaptation. But real-wor… ▽ More The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes either effective pretraining of large models for decision-making or single-task adaptation. But real-world problems will require data-efficient, continual adaptation for new control tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques -- e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments in large-scale language-conditioned manipulation tasks comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1\% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings. △ Less

Submitted 8 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: Published on ICLR 2024

arXiv:2306.17833 [pdf, other]

Resetting the Optimizer in Deep RL: An Empirical Study

Authors: Kavosh Asadi, Rasool Fakoor, Shoham Sabach

Abstract: We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain th… ▽ More We focus on the task of approximating the optimal value function in deep reinforcement learning. This iterative process is comprised of solving a sequence of optimization problems where the loss function changes per iteration. The common approach to solving this sequence of problems is to employ modern variants of the stochastic gradient descent algorithm such as Adam. These optimizers maintain their own internal parameters such as estimates of the first-order and the second-order moments of the gradient, and update them over time. Therefore, information obtained in previous iterations is used to solve the optimization problem in the current iteration. We demonstrate that this can contaminate the moment estimates because the optimization landscape can change arbitrarily from one iteration to the next one. To hedge against this negative effect, a simple idea is to reset the internal parameters of the optimizer when starting a new iteration. We empirically investigate this resetting idea by employing various optimizers in conjunction with the Rainbow algorithm. We demonstrate that this simple modification significantly improves the performance of deep RL on the Atari benchmark. △ Less

Submitted 14 November, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

Comments: Accepted at Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2306.17750 [pdf, other]

TD Convergence: An Optimization Perspective

Authors: Kavosh Asadi, Shoham Sabach, Yao Liu, Omer Gottesman, Rasool Fakoor

Abstract: We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two f… ▽ More We study the convergence behavior of the celebrated temporal-difference (TD) learning algorithm. By looking at the algorithm through the lens of optimization, we first argue that TD can be viewed as an iterative optimization algorithm where the function to be minimized changes per iteration. By carefully investigating the divergence displayed by TD on a classical counter example, we identify two forces that determine the convergent or divergent behavior of the algorithm. We next formalize our discovery in the linear TD setting with quadratic loss and prove that convergence of TD hinges on the interplay between these two forces. We extend this optimization perspective to prove convergence of TD in a much broader setting than just linear approximation and squared loss. Our results provide a theoretical explanation for the successful application of TD in reinforcement learning. △ Less

Submitted 8 November, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

Comments: Accepted at Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2205.05588 [pdf, other]

Characterizing the Action-Generalization Gap in Deep Q-Learning

Authors: Zhiyuan Zhou, Cameron Allen, Kavosh Asadi, George Konidaris

Abstract: We study the action generalization ability of deep Q-learning in discrete action spaces. Generalization is crucial for efficient reinforcement learning (RL) because it allows agents to use knowledge learned from past experiences on new tasks. But while function approximation provides deep RL agents with a natural way to generalize over state inputs, the same generalization mechanism does not apply… ▽ More We study the action generalization ability of deep Q-learning in discrete action spaces. Generalization is crucial for efficient reinforcement learning (RL) because it allows agents to use knowledge learned from past experiences on new tasks. But while function approximation provides deep RL agents with a natural way to generalize over state inputs, the same generalization mechanism does not apply to discrete action outputs. And yet, surprisingly, our experiments indicate that Deep Q-Networks (DQN), which use exactly this type of function approximator, are still able to achieve modest action generalization. Our main contribution is twofold: first, we propose a method of evaluating action generalization using expert knowledge of action similarity, and empirically confirm that action generalization leads to faster learning; second, we characterize the action-generalization gap (the difference in learning performance between DQN and the expert) in different domains. We find that DQN can indeed generalize over actions in several simple domains, but that its ability to do so decreases as the action space grows larger. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: To appear at the 5th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM2022)

arXiv:2112.05848 [pdf, other]

Faster Deep Reinforcement Learning with Slower Online Network

Authors: Kavosh Asadi, Rasool Fakoor, Omer Gottesman, Taesup Kim, Michael L. Littman, Alexander J. Smola

Abstract: Deep reinforcement learning algorithms often use two networks for value function optimization: an online network, and a target network that tracks the online network with some delay. Using two separate networks enables the agent to hedge against issues that arise when performing bootstrap**. In this paper we endow two popular deep reinforcement learning algorithms, namely DQN and Rainbow, with u… ▽ More Deep reinforcement learning algorithms often use two networks for value function optimization: an online network, and a target network that tracks the online network with some delay. Using two separate networks enables the agent to hedge against issues that arise when performing bootstrap**. In this paper we endow two popular deep reinforcement learning algorithms, namely DQN and Rainbow, with updates that incentivize the online network to remain in the proximity of the target network. This improves the robustness of deep reinforcement learning in presence of noisy updates. The resultant agents, called DQN Pro and Rainbow Pro, exhibit significant performance improvements over their original counterparts on the Atari benchmark demonstrating the effectiveness of this simple idea in deep reinforcement learning. The code for our paper is available here: Github.com/amazon-research/fast-rl-with-slow-updates. △ Less

Submitted 17 April, 2023; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: Published at the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2110.12276 [pdf, other]

Coarse-Grained Smoothness for RL in Metric Spaces

Authors: Omer Gottesman, Kavosh Asadi, Cameron Allen, Sam Lobel, George Konidaris, Michael Littman

Abstract: Principled decision-making in continuous state--action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows… ▽ More Principled decision-making in continuous state--action spaces is impossible without some assumptions. A common approach is to assume Lipschitz continuity of the Q-function. We show that, unfortunately, this property fails to hold in many typical domains. We propose a new coarse-grained smoothness definition that generalizes the notion of Lipschitz continuity, is more widely applicable, and allows us to compute significantly tighter bounds on Q-functions, leading to improved learning. We provide a theoretical analysis of our new smoothness definition, and discuss its implications and impact on control and exploration in continuous domains. △ Less

Submitted 23 October, 2021; originally announced October 2021.

arXiv:2109.07054 [pdf, other]

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Authors: Ishaan Shah, David Halpern, Kavosh Asadi, Michael L. Littman

Abstract: Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent… ▽ More Fluid human-agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor-Critic by Humans (COACH) algorithm under three different types of feedback-policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: Accepted into ICML 2021 workshops Human-AI Collaboration in Sequential Decision-Making and Human in the Loop Learning

arXiv:2102.09225 [pdf, other]

Continuous Doubly Constrained Batch Reinforcement Learning

Authors: Rasool Fakoor, Jonas Mueller, Kavosh Asadi, Pratik Chaudhari, Alexander J. Smola

Abstract: Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produc… ▽ More Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected. △ Less

Submitted 6 December, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021 conference paper

arXiv:2002.05518 [pdf, other]

Learning State Abstractions for Transfer in Continuous Control

Authors: Kavosh Asadi, David Abel, Michael L. Littman

Abstract: Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that ab… ▽ More Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning. We provide theory showing that learned abstractions maintain a bounded value loss, and we report experiments showing that the abstractions empower tabular Q-Learning to learn efficiently in unseen tasks. △ Less

Submitted 8 February, 2020; originally announced February 2020.

arXiv:2002.01883 [pdf, other]

Deep Radial-Basis Value Functions for Continuous Control

Authors: Kavosh Asadi, Neev Parikh, Ronald E. Parr, George D. Konidaris, Michael L. Littman

Abstract: A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep radial-basis value functions (RBVFs): value functions learned using a deep network with a radial-basis function (RBF) output layer. We show that the max… ▽ More A core operation in reinforcement learning (RL) is finding an action that is optimal with respect to a learned value function. This operation is often challenging when the learned value function takes continuous actions as input. We introduce deep radial-basis value functions (RBVFs): value functions learned using a deep network with a radial-basis function (RBF) output layer. We show that the maximum action-value with respect to a deep RBVF can be approximated easily and accurately. Moreover, deep RBVFs can represent any true value function owing to their support for universal function approximation. We extend the standard DQN algorithm to continuous control by endowing the agent with a deep RBVF. We show that the resultant agent, called RBF-DQN, significantly outperforms value-function-only baselines, and is competitive with state-of-the-art actor-critic algorithms. △ Less

Submitted 13 March, 2021; v1 submitted 5 February, 2020; originally announced February 2020.

Comments: In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI)

arXiv:2001.05411 [pdf, other]

Lipschitz Lifelong Reinforcement Learning

Authors: Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu **nai, Emmanuel Rachelson, Michael L. Littman

Abstract: We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes (MDPs) and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfe… ▽ More We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes (MDPs) and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments. △ Less

Submitted 22 March, 2021; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: In proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), 21 pages, 11 figures

arXiv:1905.13320 [pdf, other]

Combating the Compounding-Error Problem with a Multi-step Model

Authors: Kavosh Asadi, Dipendra Misra, Seungchan Kim, Michel L. Littman

Abstract: Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state---a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction… ▽ More Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state---a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction errors can get magnified, leading to unacceptable inaccuracy. This compounding-error problem plagues planning and undermines model-based reinforcement learning. In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions. Novel theoretical and empirical results indicate that the multi-step model is more conducive to efficient value-function estimation, and it yields better action selection compared to the one-step model. These results make a strong case for using multi-step models in the context of model-based reinforcement learning. △ Less

Submitted 30 May, 2019; originally announced May 2019.

arXiv:1902.05099 [pdf]

Virtual Manipulation in an Immersive Virtual Environment: Simulation of Virtual Assembly

Authors: Mojtaba Noghabaei, Khashayar Asadi, Kevin Han

Abstract: To fill the lack of research efforts in virtual assembly of modules and training, this paper presents a virtual manipulation of building objects in an Immersive Virtual Environment (IVE). A worker wearing a Virtual Reality (VR) head-mounted device (HMD) virtually perform an assembly of multiple modules while identifying any issues. Hand motions of the worker are tracked by a motion sensor mounted… ▽ More To fill the lack of research efforts in virtual assembly of modules and training, this paper presents a virtual manipulation of building objects in an Immersive Virtual Environment (IVE). A worker wearing a Virtual Reality (VR) head-mounted device (HMD) virtually perform an assembly of multiple modules while identifying any issues. Hand motions of the worker are tracked by a motion sensor mounted on the HMD. The worker can be graded based on his/her overall performance and speed during this VR simulation. The developed VR simulation can ultimately enable workers to identify unforeseen issues (e.g., not enough clearance for an object to be installed). The presented method can solve current deficiencies in discrepancy detection in 3D scanned models of elements. The developed VR platform can also be used for interactive training and simulation sessions that can potentially improve efficiency and help achieve better work performance for assemblies of complex systems. △ Less

Submitted 30 January, 2019; originally announced February 2019.

Comments: 8 pages, 4 figures

arXiv:1901.11078 [pdf, other]

Real-world Map** of Gaze Fixations Using Instance Segmentation for Road Construction Safety Applications

Authors: Idris Jeelani, Khashayar Asadi, Hariharan Ramshankar, Kevin Han, Alex Albert

Abstract: Research studies have shown that a large proportion of hazards remain unrecognized, which expose construction workers to unanticipated safety risks. Recent studies have also found that a strong correlation exists between viewing patterns of workers, captured using eye-tracking devices, and their hazard recognition performance. Therefore, it is important to analyze the viewing patterns of workers t… ▽ More Research studies have shown that a large proportion of hazards remain unrecognized, which expose construction workers to unanticipated safety risks. Recent studies have also found that a strong correlation exists between viewing patterns of workers, captured using eye-tracking devices, and their hazard recognition performance. Therefore, it is important to analyze the viewing patterns of workers to gain a better understanding of their hazard recognition performance. This paper proposes a method that can automatically map the gaze fixations collected using a wearable eye-tracker to the predefined areas of interests. The proposed method detects these areas or objects (i.e., hazards) of interests through a computer vision-based segmentation technique and transfer learning. The mapped fixation data is then used to analyze the viewing behaviors of workers and compute their attention distribution. The proposed method is implemented on an under construction road as a case study to evaluate the performance of the proposed method. △ Less

Submitted 1 February, 2019; v1 submitted 30 January, 2019; originally announced January 2019.

Comments: 2019 TRB Annual meeting

arXiv:1901.08630 [pdf]

Real-time Scene Segmentation Using a Light Deep Neural Network Architecture for Autonomous Robot Navigation on Construction Sites

Authors: Khashayar Asadi, Pengyu Chen, Kevin Han, Tianfu Wu, Edgar Lobaton

Abstract: Camera-equipped unmanned vehicles (UVs) have received a lot of attention in data collection for construction monitoring applications. To develop an autonomous platform, the UV should be able to process multiple modules (e.g., context-awareness, control, localization, and map**) on an embedded platform. Pixel-wise semantic segmentation provides a UV with the ability to be contextually aware of it… ▽ More Camera-equipped unmanned vehicles (UVs) have received a lot of attention in data collection for construction monitoring applications. To develop an autonomous platform, the UV should be able to process multiple modules (e.g., context-awareness, control, localization, and map**) on an embedded platform. Pixel-wise semantic segmentation provides a UV with the ability to be contextually aware of its surrounding environment. However, in the case of mobile robotic systems with limited computing resources, the large size of the segmentation model and high memory usage requires high computing resources, which a major challenge for mobile UVs (e.g., a small-scale vehicle with limited payload and space). To overcome this challenge, this paper presents a light and efficient deep neural network architecture to run on an embedded platform in real-time. The proposed model segments navigable space on an image sequence (i.e., a video stream), which is essential for an autonomous vehicle that is based on machine vision. The results demonstrate the performance efficiency of the proposed architecture compared to the existing models and suggest possible improvements that could make the model even more efficient, which is necessary for the future development of the autonomous robotics systems. △ Less

Submitted 24 January, 2019; originally announced January 2019.

Comments: The 2019 ASCE International Conference on Computing in Civil Engineering

arXiv:1901.08180 [pdf]

Vision-based Obstacle Removal System for Autonomous Ground Vehicles Using a Robotic Arm

Authors: Khashayar Asadi, Rahul Jain, Ziqian Qin, Mingda Sun, Mojtaba Noghabaei, Jeremy Cole, Kevin Han, Edgar Lobaton

Abstract: Over the past few years, the use of camera-equipped robotic platforms for data collection and visually monitoring applications has exponentially grown. Cluttered construction sites with many objects (e.g., bricks, pipes, etc.) on the ground are challenging environments for a mobile unmanned ground vehicle (UGV) to navigate. To address this issue, this study presents a mobile UGV equipped with a st… ▽ More Over the past few years, the use of camera-equipped robotic platforms for data collection and visually monitoring applications has exponentially grown. Cluttered construction sites with many objects (e.g., bricks, pipes, etc.) on the ground are challenging environments for a mobile unmanned ground vehicle (UGV) to navigate. To address this issue, this study presents a mobile UGV equipped with a stereo camera and a robotic arm that can remove obstacles along the UGV's path. To achieve this objective, the surrounding environment is captured by the stereo camera and obstacles are detected. The obstacle's relative location to the UGV is sent to the robotic arm module through Robot Operating System (ROS). Then, the robotic arm picks up and removes the obstacle. The proposed method will greatly enhance the degree of automation and the frequency of data collection for construction monitoring. The proposed system is validated through two case studies. The results successfully demonstrate the detection and removal of obstacles, serving as one of the enabling factors for develo** an autonomous UGV with various construction operating applications. △ Less

Submitted 23 January, 2019; originally announced January 2019.

Comments: The 2019 ASCE International Conference on Computing in Civil Engineering

arXiv:1812.01129 [pdf, other]

Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Authors: Dilip Arumugam, David Abel, Kavosh Asadi, Nakul Gopalan, Christopher Grimm, Jun Ki Lee, Lucas Lehnert, Michael L. Littman

Abstract: An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slo… ▽ More An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments. △ Less

Submitted 19 March, 2020; v1 submitted 3 December, 2018; originally announced December 2018.

arXiv:1811.00128 [pdf, other]

Towards a Simple Approach to Multi-step Model-based Reinforcement Learning

Authors: Kavosh Asadi, Evan Cater, Dipendra Misra, Michael L. Littman

Abstract: When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes. Model-based agents typically learn a single-step transition model. In this paper, we propose a multi-step model that predicts the outcome of an action sequence with variable length. We show that this model is easy to learn, and that the model can make po… ▽ More When environmental interaction is expensive, model-based reinforcement learning offers a solution by planning ahead and avoiding costly mistakes. Model-based agents typically learn a single-step transition model. In this paper, we propose a multi-step model that predicts the outcome of an action sequence with variable length. We show that this model is easy to learn, and that the model can make policy-conditional predictions. We report preliminary results that show a clear advantage for the multi-step model compared to its one-step counterpart. △ Less

Submitted 31 October, 2018; originally announced November 2018.

arXiv:1806.01265 [pdf, ps, other]

Equivalence Between Wasserstein and Value-Aware Loss for Model-based Reinforcement Learning

Authors: Kavosh Asadi, Evan Cater, Dipendra Misra, Michael L. Littman

Abstract: Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recen… ▽ More Learning a generative model is a key component of model-based reinforcement learning. Though learning a good model in the tabular setting is a simple task, learning a useful model in the approximate setting is challenging. In this context, an important question is the loss function used for model learning as varying the loss function can have a remarkable impact on effectiveness of planning. Recently Farahmand et al. (2017) proposed a value-aware model learning (VAML) objective that captures the structure of value function during model learning. Using tools from Asadi et al. (2018), we show that minimizing the VAML objective is in fact equivalent to minimizing the Wasserstein metric. This equivalence improves our understanding of value-aware models, and also creates a theoretical foundation for applications of Wasserstein in model-based reinforcement~learning. △ Less

Submitted 8 July, 2018; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: Accepted at the FAIM workshop "Prediction and Generative Modeling in Reinforcement Learning", Stockholm, Sweden, 2018

arXiv:1804.07193 [pdf, other]

Lipschitz Continuity in Model-based Reinforcement Learning

Authors: Kavosh Asadi, Dipendra Misra, Michael L. Littman

Abstract: We examine the impact of learning Lipschitz continuous models in the context of model-based reinforcement learning. We provide a novel bound on multi-step prediction error of Lipschitz models where we quantify the error using the Wasserstein metric. We go on to prove an error bound for the value-function estimate arising from Lipschitz models and show that the estimated value function is itself Li… ▽ More We examine the impact of learning Lipschitz continuous models in the context of model-based reinforcement learning. We provide a novel bound on multi-step prediction error of Lipschitz models where we quantify the error using the Wasserstein metric. We go on to prove an error bound for the value-function estimate arising from Lipschitz models and show that the estimated value function is itself Lipschitz. We conclude with empirical results that show the benefits of controlling the Lipschitz constant of neural-network models. △ Less

Submitted 27 July, 2018; v1 submitted 19 April, 2018; originally announced April 2018.

Comments: Accepted for the 35th International Conference on Machine Learning (ICML 2018)

arXiv:1803.01745 [pdf, other]

Building an Integrated Mobile Robotic System for Real-Time Applications in Construction

Authors: Khashayar Asadi, Hariharan Ramshankar, Harish Pullagurla, Aishwarya Bhandare, Suraj Shanbhag, Pooja Mehta, Spondon Kundu, Kevin Han, Edgar Lobaton, Tianfu Wu

Abstract: One of the major challenges of a real-time autonomous robotic system for construction monitoring is to simultaneously localize, map, and navigate over the lifetime of the robot, with little or no human intervention. Past research on Simultaneous Localization and Map** (SLAM) and context-awareness are two active research areas in the computer vision and robotics communities. The studies that inte… ▽ More One of the major challenges of a real-time autonomous robotic system for construction monitoring is to simultaneously localize, map, and navigate over the lifetime of the robot, with little or no human intervention. Past research on Simultaneous Localization and Map** (SLAM) and context-awareness are two active research areas in the computer vision and robotics communities. The studies that integrate both in real-time into a single modular framework for construction monitoring still need further investigation. A monocular vision system and real-time scene understanding are computationally heavy and the major state-of-the-art algorithms are tested on high-end desktops and/or servers with a high CPU- and/or GPU- computing capabilities, which affect their mobility and deployment for real-world applications. To address these challenges and achieve automation, this paper proposes an integrated robotic computer vision system, which generates a real-world spatial map of the obstacles and traversable space present in the environment in near real-time. This is done by integrating contextual Awareness and visual SLAM into a ground robotics agent. This paper presents the hardware utilization and performance of the aforementioned system for three different outdoor environments, which represent the applicability of this pipeline to diverse outdoor scenes in near real-time. The entire system is also self-contained and does not require user input, which demonstrates the potential of this computer vision system for autonomous navigation. △ Less

Submitted 18 April, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

arXiv:1709.00503 [pdf, other]

Mean Actor Critic

Authors: Cameron Allen, Kavosh Asadi, Melrose Roderick, Abdel-rahman Mohamed, George Konidaris, Michael Littman

Abstract: We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent's explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. We prove that this approach reduces variance in the policy gradient estimate rel… ▽ More We propose a new algorithm, Mean Actor-Critic (MAC), for discrete-action continuous-state reinforcement learning. MAC is a policy gradient algorithm that uses the agent's explicit representation of all action values to estimate the gradient of the policy, rather than using only the actions that were actually executed. We prove that this approach reduces variance in the policy gradient estimate relative to traditional actor-critic methods. We show empirical results on two control domains and on six Atari games, where MAC is competitive with state-of-the-art policy search algorithms. △ Less

Submitted 22 May, 2018; v1 submitted 1 September, 2017; originally announced September 2017.

arXiv:1702.03274 [pdf, other]

Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning

Authors: Jason D. Williams, Kavosh Asadi, Geoffrey Zweig

Abstract: End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors. We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates. Compared to existing end-to-end approaches, HCNs… ▽ More End-to-end learning of recurrent neural networks (RNNs) is an attractive solution for dialog systems; however, current techniques are data-intensive and require thousands of dialogs to learn simple behaviors. We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates. Compared to existing end-to-end approaches, HCNs considerably reduce the amount of training data required, while retaining the key benefit of inferring a latent representation of dialog state. In addition, HCNs can be optimized with supervised learning, reinforcement learning, or a mixture of both. HCNs attain state-of-the-art performance on the bAbI dialog dataset, and outperform two commercially deployed customer-facing dialog systems. △ Less

Submitted 24 April, 2017; v1 submitted 10 February, 2017; originally announced February 2017.

Comments: Accepted as a long paper for the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)

arXiv:1612.06000 [pdf, other]

Sample-efficient Deep Reinforcement Learning for Dialog Control

Authors: Kavosh Asadi, Jason D. Williams

Abstract: Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL). For RL, a policy gradient approach is natural, but is sample inefficient. In this paper, we present 3 methods for reducing the number of dialogs required t… ▽ More Representing a dialog policy as a recurrent neural network (RNN) is attractive because it handles partial observability, infers a latent representation of state, and can be optimized with supervised learning (SL) or reinforcement learning (RL). For RL, a policy gradient approach is natural, but is sample inefficient. In this paper, we present 3 methods for reducing the number of dialogs required to optimize an RNN-based dialog policy with RL. The key idea is to maintain a second RNN which predicts the value of the current policy, and to apply experience replay to both networks. On two tasks, these methods reduce the number of dialogs/episodes required by about a third, vs. standard policy gradient methods. △ Less

Submitted 18 December, 2016; originally announced December 2016.

arXiv:1612.05628 [pdf, other]

An Alternative Softmax Operator for Reinforcement Learning

Authors: Kavosh Asadi, Michael L. Littman

Abstract: A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly… ▽ More A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice. △ Less

Submitted 14 June, 2017; v1 submitted 16 December, 2016; originally announced December 2016.

Showing 1–26 of 26 results for author: Asadi, K