Search | arXiv e-print repository

Grounding Multimodal Large Language Models in Actions

Authors: Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens… ▽ More Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2310.17722 [pdf, other]

Large Language Models as Generalizable Policies for Embodied Tasks

Authors: Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, Alexander Toshev

Abstract: We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and… ▽ More We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io. △ Less

Submitted 16 April, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

arXiv:2306.07290 [pdf, other]

Value function estimation using conditional diffusion models for control

Authors: Bogdan Mazoure, Walter Talbott, Miguel Angel Bautista, Devon Hjelm, Alexander Toshev, Josh Susskind

Abstract: A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly… ▽ More A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly human demonstrations or risking a simulation-to-real transfer with uncertain effects, it would be beneficial to leverage vast amounts of readily-available low-quality data. Since classical control algorithms such as behavior cloning or temporal difference learning cannot be used on reward-free or action-free data out-of-the-box, this solution warrants novel training paradigms for continuous control. We propose a simple algorithm called Diffused Value Function (DVF), which learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. This model can be efficiently learned from state sequences (i.e., without access to reward functions nor actions), and subsequently used to estimate the value of each action out-of-the-box. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers, and show promising qualitative and quantitative results on challenging robotics benchmarks. △ Less

Submitted 9 June, 2023; originally announced June 2023.

arXiv:2304.00046 [pdf, other]

Accelerating exploration and representation learning with offline pre-training

Authors: Bogdan Mazoure, Jake Bruce, Doina Precup, Rob Fergus, Ankit Anand

Abstract: Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement learning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned… ▽ More Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement learning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights. △ Less

Submitted 31 March, 2023; originally announced April 2023.

arXiv:2211.02100 [pdf, other]

Contrastive Value Learning: Implicit Models for Simple Offline RL

Authors: Bogdan Mazoure, Benjamin Eysenbach, Ofir Nachum, Jonathan Tompson

Abstract: Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrat… ▽ More Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: Deep Reinforcement Learning Workshop, NeurIPS 2022

arXiv:2206.03923 [pdf, other]

Sequential Density Estimation via Nonlinear Continuous Weighted Finite Automata

Authors: Tianyu Li, Bogdan Mazoure, Guillaume Rabusseau

Abstract: Weighted finite automata (WFAs) have been widely applied in many fields. One of the classic problems for WFAs is probability distribution estimation over sequences of discrete symbols. Although WFAs have been extended to deal with continuous input data, namely continuous WFAs (CWFAs), it is still unclear how to approximate density functions over sequences of continuous random variables using WFA-b… ▽ More Weighted finite automata (WFAs) have been widely applied in many fields. One of the classic problems for WFAs is probability distribution estimation over sequences of discrete symbols. Although WFAs have been extended to deal with continuous input data, namely continuous WFAs (CWFAs), it is still unclear how to approximate density functions over sequences of continuous random variables using WFA-based models, due to the limitation on the expressiveness of the model as well as the tractability of approximating density functions via CWFAs. In this paper, we propose a nonlinear extension to the CWFA model to first improve its expressiveness, we refer to it as the nonlinear continuous WFAs (NCWFAs). Then we leverage the so-called RNADE method, which is a well-known density estimator based on neural networks, and propose the RNADE-NCWFA model. The RNADE-NCWFA model computes a density function by design. We show that this model is strictly more expressive than the Gaussian HMM model, which CWFA cannot approximate. Empirically, we conduct a synthetic experiment using Gaussian HMM generated data. We focus on evaluating the model's ability to estimate densities for sequences of varying lengths (longer length than the training data). We observe that our model performs the best among the compared baseline methods. △ Less

Submitted 12 December, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

arXiv:2203.10351 [pdf, other]

The Sandbox Environment for Generalizable Agent Research (SEGAR)

Authors: R Devon Hjelm, Bogdan Mazoure, Florian Golemo, Felipe Frujeri, Mihai Jalobeanu, Andrey Kolobov

Abstract: A broad challenge of research on generalization for sequential decision-making tasks in interactive environments is designing benchmarks that clearly landmark progress. While there has been notable headway, current benchmarks either do not provide suitable exposure nor intuitive control of the underlying factors, are not easy-to-implement, customizable, or extensible, or are computationally expens… ▽ More A broad challenge of research on generalization for sequential decision-making tasks in interactive environments is designing benchmarks that clearly landmark progress. While there has been notable headway, current benchmarks either do not provide suitable exposure nor intuitive control of the underlying factors, are not easy-to-implement, customizable, or extensible, or are computationally expensive to run. We built the Sandbox Environment for Generalizable Agent Research (SEGAR) with all of these things in mind. SEGAR improves the ease and accountability of generalization research in RL, as generalization objectives can be easy designed by specifying task distributions, which in turns allows the researcher to measure the nature of the generalization objective. We present an overview of SEGAR and how it contributes to these goals, as well as experiments that demonstrate a few types of research questions SEGAR can help answer. △ Less

Submitted 19 March, 2022; originally announced March 2022.

arXiv:2111.14629 [pdf, other]

Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

Authors: Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

Abstract: Reinforcement learning (RL) agents are widely used for solving complex sequential decision making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they st… ▽ More Reinforcement learning (RL) agents are widely used for solving complex sequential decision making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using \emph{generalized value functions}. We show that GSF is general enough to recover existing SSL objectives while also improving zero-shot generalization performance on a complex offline RL benchmark, offline Procgen. △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: Offline RL workshop at NeurIPS 2021

arXiv:2106.02193 [pdf, other]

Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL

Authors: Bogdan Mazoure, Ahmed M. Ahmed, Patrick MacAlpine, R Devon Hjelm, Andrey Kolobov

Abstract: A highly desirable property of a reinforcement learning (RL) agent -- and a major difficulty for deep RL approaches -- is the ability to generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training. Many promising approaches to this challenge consider RL as a process of training two functions simultaneously: a complex nonlinear enco… ▽ More A highly desirable property of a reinforcement learning (RL) agent -- and a major difficulty for deep RL approaches -- is the ability to generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training. Many promising approaches to this challenge consider RL as a process of training two functions simultaneously: a complex nonlinear encoder that maps high-dimensional observations to a latent representation space, and a simple linear policy over this space. We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations, as reward-based signal can cause overfitting in the encoder (Raileanu et al., 2021). We propose Cross-Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agent's policies. CTRL can be viewed as having the same effect as inducing a pseudo-bisimulation metric but, crucially, avoids the use of rewards and associated overfitting risks. Our experiments ablate various components of CTRL and demonstrate that in combination with PPO it achieves better generalization performance on the challenging Procgen benchmark suite (Cobbe et al., 2020). △ Less

Submitted 16 March, 2022; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: ICLR 2022

arXiv:2106.00589 [pdf, other]

Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Reinforcement Learning

Authors: Bogdan Mazoure, Paul Mineiro, Pavithra Srinath, Reza Sharifi Sedeh, Doina Precup, Adith Swaminathan

Abstract: We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Targeting immediately measurable proxies… ▽ More We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their long-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Targeting immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. We develop a new reinforcement learning algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced drift in user behavior across sessions. SHPI is a straightforward modification of episodic RL algorithms for session-based recommendation, that additionally gives an appropriate termination bonus in each session. Empirical results on four recommendation tasks show that SHPI can outperform state-of-the-art recommendation techniques like matrix factorization with offline proxy signals, bandits with myopic online proxies, and RL baselines with limited amounts of user interaction. △ Less

Submitted 14 September, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

arXiv:2010.04003 [pdf, other]

A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix

Authors: Thang Doan, Mehdi Bennani, Bogdan Mazoure, Guillaume Rabusseau, Pierre Alquier

Abstract: Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data during its entire lifetime. Although major advances have been made in the field, one recurring problem which remains unsolved is that of Catastrophic Forgetting (CF). While the issue has been extensively studied empirically, little attention has been paid from a theoretical angle. In this paper, we… ▽ More Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data during its entire lifetime. Although major advances have been made in the field, one recurring problem which remains unsolved is that of Catastrophic Forgetting (CF). While the issue has been extensively studied empirically, little attention has been paid from a theoretical angle. In this paper, we show that the impact of CF increases as two tasks increasingly align. We introduce a measure of task similarity called the NTK overlap matrix which is at the core of CF. We analyze common projected gradient algorithms and demonstrate how they mitigate forgetting. Then, we propose a variant of Orthogonal Gradient Descent (OGD) which leverages structure of the data through Principal Component Analysis (PCA). Experiments support our theoretical findings and show how our method can help reduce CF on classical CL datasets. △ Less

Submitted 25 February, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Accepted to AISTATS 2021. Keywords: continual learning, catastrophic forgetting, NTK regime, orthgonal gradient descent

Journal ref: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021)

arXiv:2006.07217 [pdf, other]

Deep Reinforcement and InfoMax Learning

Authors: Bogdan Mazoure, Remi Tachet des Combes, Thang Doan, Philip Bachman, R Devon Hjelm

Abstract: We begin with the hypothesis that a model-free agent whose representations are predictive of properties of future states (beyond expected rewards) will be more capable of solving and adapting to new RL problems. To test that hypothesis, we introduce an objective based on Deep InfoMax (DIM) which trains the agent to predict the future by maximizing the mutual information between its internal repres… ▽ More We begin with the hypothesis that a model-free agent whose representations are predictive of properties of future states (beyond expected rewards) will be more capable of solving and adapting to new RL problems. To test that hypothesis, we introduce an objective based on Deep InfoMax (DIM) which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps. We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future. Finally, we augment C51, a strong RL baseline, with our temporal DIM objective and demonstrate improved performance on a continual learning task and on the recently introduced Procgen environment. △ Less

Submitted 16 November, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: NeurIPS 2020

arXiv:2002.02863 [pdf, other]

Representation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces

Authors: Bogdan Mazoure, Thang Doan, Tianyu Li, Vladimir Makarenkov, Joelle Pineau, Doina Precup, Guillaume Rabusseau

Abstract: We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box mode… ▽ More We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly embedded in a low-dimensional space while the embedded policy incurs almost no decrease in return. △ Less

Submitted 15 October, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

arXiv:1911.05010 [pdf, other]

Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning

Authors: Tianyu Li, Bogdan Mazoure, Doina Precup, Guillaume Rabusseau

Abstract: Learning and planning in partially-observable domains is one of the most difficult problems in reinforcement learning. Traditional methods consider these two problems as independent, resulting in a classical two-stage paradigm: first learn the environment dynamics and then plan accordingly. This approach, however, disconnects the two problems and can consequently lead to algorithms that are sample… ▽ More Learning and planning in partially-observable domains is one of the most difficult problems in reinforcement learning. Traditional methods consider these two problems as independent, resulting in a classical two-stage paradigm: first learn the environment dynamics and then plan accordingly. This approach, however, disconnects the two problems and can consequently lead to algorithms that are sample inefficient and time consuming. In this paper, we propose a novel algorithm that combines learning and planning together. Our algorithm is closely related to the spectral learning algorithm for predicitive state representations and offers appealing theoretical guarantees and time complexity. We empirically show on two domains that our approach is more sample and time efficient compared to classical methods. △ Less

Submitted 21 November, 2019; v1 submitted 12 November, 2019; originally announced November 2019.

arXiv:1909.07543 [pdf, other]

Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning

Authors: Thang Doan, Bogdan Mazoure, Moloud Abdar, Audrey Durand, Joelle Pineau, R Devon Hjelm

Abstract: Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverag… ▽ More Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensional state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem. In this work, we present a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies. Empirical results on the MuJoCo suite demonstrate a high performance gain for our algorithm compared to prior work, including Soft-Actor Critic (SAC). △ Less

Submitted 9 July, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

arXiv:1906.02719 [pdf, other]

Learning Gaussian Graphical Models with Ordered Weighted L1 Regularization

Authors: Cody Mazza-Anthony, Bogdan Mazoure, Mark Coates

Abstract: We address the task of identifying densely connected subsets of multivariate Gaussian random variables within a graphical model framework. We propose two novel estimators based on the Ordered Weighted $\ell_1$ (OWL) norm: 1) The Graphical OWL (GOWL) is a penalized likelihood method that applies the OWL norm to the lower triangle components of the precision matrix. 2) The column-by-column Graphical… ▽ More We address the task of identifying densely connected subsets of multivariate Gaussian random variables within a graphical model framework. We propose two novel estimators based on the Ordered Weighted $\ell_1$ (OWL) norm: 1) The Graphical OWL (GOWL) is a penalized likelihood method that applies the OWL norm to the lower triangle components of the precision matrix. 2) The column-by-column Graphical OWL (ccGOWL) estimates the precision matrix by performing OWL regularized linear regressions. Both methods can simultaneously identify highly correlated groups of variables and control the sparsity in the resulting precision matrix. We formulate GOWL such that it solves a composite optimization problem and establish that the estimator has a unique global solution. In addition, we prove sufficient grou** conditions for each column of the ccGOWL precision matrix estimate. We propose proximal descent algorithms to find the optimum for both estimators. For synthetic data where group structure is present, the ccGOWL estimator requires significantly reduced computation and achieves similar or greater accuracy than state-of-the-art estimators. Timing comparisons are presented and demonstrates the superior computational efficiency of the ccGOWL. We illustrate the grou** performance of the ccGOWL method on a cancer gene expression data set and an equities data set. △ Less

Submitted 19 November, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

Comments: Published in IEEE Transactions on Signal Processing

arXiv:1905.06893 [pdf, other]

Leveraging exploration in off-policy algorithms via normalizing flows

Authors: Bogdan Mazoure, Thang Doan, Audrey Durand, R Devon Hjelm, Joelle Pineau

Abstract: The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is a… ▽ More The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model. △ Less

Submitted 24 September, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: Accepted to 3rd Conference on Robot Learning (CoRL 2019); Keywords: Exploration, soft actor-critic, normalizing flow, off-policy; maximum entropy, reinforcement learning; deceptive reward; sparse reward; inverse autoregressive flow

arXiv:1902.00570 [pdf, other]

Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed

Authors: Atta Norouzian, Bogdan Mazoure, Dermot Connolly, Daniel Willett

Abstract: Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he or she wants the VA to do something. Eliminating the need for saying the wake-up word for every interaction could improve the user experience. This would require the VA to have the… ▽ More Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he or she wants the VA to do something. Eliminating the need for saying the wake-up word for every interaction could improve the user experience. This would require the VA to have the capability to detect the speech that is being directed at it and respond accordingly. In other words, the challenge is to distinguish between system-directed and non-system-directed speech utterances. In this paper, we present a number of neural network architectures for tackling this classification problem based on using only acoustic features. These architectures are based on using convolutional, recurrent and feed-forward layers. In addition, we investigate the use of an attention mechanism applied to the output of the convolutional and the recurrent layers. It is shown that incorporating the proposed attention mechanism into the models always leads to significant improvement in classification accuracy. The best model achieved equal error rates of 16.25 and 15.62 percents on two distinct realistic datasets. △ Less

Submitted 1 February, 2019; originally announced February 2019.

Comments: Accpeted for presentation at ICASSP2019

arXiv:1808.00020 [pdf, other]

On-line Adaptative Curriculum Learning for GANs

Authors: Thang Doan, Joao Monteiro, Isabela Albuquerque, Bogdan Mazoure, Audrey Durand, Joelle Pineau, R Devon Hjelm

Abstract: Generative Adversarial Networks (GANs) can successfully approximate a probability distribution and produce realistic samples. However, open questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen… ▽ More Generative Adversarial Networks (GANs) can successfully approximate a probability distribution and produce realistic samples. However, open questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen as a one-student/multiple-teachers setting. We formalize this problem within the full-information adversarial bandit framework, where we evaluate the capability of an algorithm to select mixtures of discriminators for providing the generator with feedback during learning. To this end, we propose a reward function which reflects the progress made by the generator and dynamically update the mixture weights allocated to each discriminator. We also draw connections between our algorithm and stochastic optimization methods and then show that existing approaches using multiple discriminators in literature can be recovered from our framework. We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support. On the other hand, highly expressive discriminators ensure samples quality. Finally, experimental results show that our approach improves samples quality and diversity over existing baselines by effectively learning a curriculum. These results also support the claim that weaker discriminators have higher entropy improving modes coverage. Keywords: multiple discriminators, curriculum learning, multiple resolutions discriminators, multi-armed bandits, generative adversarial networks, smooth discriminators, multi-discriminator gan training, multiple experts. △ Less

Submitted 11 March, 2019; v1 submitted 31 July, 2018; originally announced August 2018.

Comments: Accepted to the Thirty-Third AAAI Conference On Artificial Intelligence, 2019 (Added 128x128 CelebA samples to the end of the appendix)

Journal ref: Proceedings of 33rd AAAI Conference on Artificial Intelligence (AAAI 2019)

arXiv:1805.04874 [pdf, other]

GAN Q-learning

Authors: Thang Doan, Bogdan Mazoure, Clare Lyle

Abstract: Distributional reinforcement learning (distributional RL) has seen empirical success in complex Markov Decision Processes (MDPs) in the setting of nonlinear function approximation. However, there are many different ways in which one can leverage the distributional approach to reinforcement learning. In this paper, we propose GAN Q-learning, a novel distributional RL method based on generative adve… ▽ More Distributional reinforcement learning (distributional RL) has seen empirical success in complex Markov Decision Processes (MDPs) in the setting of nonlinear function approximation. However, there are many different ways in which one can leverage the distributional approach to reinforcement learning. In this paper, we propose GAN Q-learning, a novel distributional RL method based on generative adversarial networks (GANs) and analyze its performance in simple tabular environments, as well as OpenAI Gym. We empirically show that our algorithm leverages the flexibility and blackbox approach of deep learning models while providing a viable alternative to traditional methods. △ Less

Submitted 20 July, 2018; v1 submitted 13 May, 2018; originally announced May 2018.

arXiv:1711.04345 [pdf, other]

Alpha-Divergences in Variational Dropout

Authors: Bogdan Mazoure, Riashat Islam

Abstract: We investigate the use of alternative divergences to Kullback-Leibler (KL) in variational inference(VI), based on the Variational Dropout \cite{kingma2015}. Stochastic gradient variational Bayes (SGVB) \cite{aevb} is a general framework for estimating the evidence lower bound (ELBO) in Variational Bayes. In this work, we extend the SGVB estimator with using Alpha-Divergences, which are alternative… ▽ More We investigate the use of alternative divergences to Kullback-Leibler (KL) in variational inference(VI), based on the Variational Dropout \cite{kingma2015}. Stochastic gradient variational Bayes (SGVB) \cite{aevb} is a general framework for estimating the evidence lower bound (ELBO) in Variational Bayes. In this work, we extend the SGVB estimator with using Alpha-Divergences, which are alternative to divergences to VI' KL objective. The Gaussian dropout can be seen as a local reparametrization trick of the SGVB objective. We extend the Variational Dropout to use alpha divergences for variational inference. Our results compare $α$-divergence variational dropout with standard variational dropout with correlated and uncorrelated weight noise. We show that the $α$-divergence with $α\rightarrow 1$ (or KL divergence) is still a good measure for use in variational inference, in spite of the efficient use of Alpha-divergences for Dropout VI \cite{Li17}. $α\rightarrow 1$ can yield the lowest training error, and optimizes a good lower bound for the evidence lower bound (ELBO) among all values of the parameter $α\in [0,\infty)$. △ Less

Submitted 12 November, 2017; originally announced November 2017.

Comments: Bogdan Mazoure and Riashat Islam contributed equally

Showing 1–21 of 21 results for author: Mazoure, B