-
Singular value automata and approximate minimization
Authors:
Borja Balle,
Prakash Panangaden,
Doina Precup
Abstract:
The present paper uses spectral theory of linear operators to construct approximately minimal realizations of weighted languages. Our new contributions are: (i) a new algorithm for the SVD decomposition of infinite Hankel matrices based on their representation in terms of weighted automata, (ii) a new canonical form for weighted automata arising from the SVD of its corresponding Hankel matrix and…
▽ More
The present paper uses spectral theory of linear operators to construct approximately minimal realizations of weighted languages. Our new contributions are: (i) a new algorithm for the SVD decomposition of infinite Hankel matrices based on their representation in terms of weighted automata, (ii) a new canonical form for weighted automata arising from the SVD of its corresponding Hankel matrix and (iii) an algorithm to construct approximate minimizations of given weighted automata by truncating the canonical form. We give bounds on the quality of our approximation.
△ Less
Submitted 27 May, 2019; v1 submitted 16 November, 2017;
originally announced November 2017.
-
Learning with Options that Terminate Off-Policy
Authors:
Anna Harutyunyan,
Peter Vrancx,
Pierre-Luc Bacon,
Doina Precup,
Ann Nowe
Abstract:
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal…
▽ More
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.
△ Less
Submitted 2 December, 2017; v1 submitted 10 November, 2017;
originally announced November 2017.
-
OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning
Authors:
Peter Henderson,
Wei-Di Chang,
Pierre-Luc Bacon,
David Meger,
Joelle Pineau,
Doina Precup
Abstract:
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories…
▽ More
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.
△ Less
Submitted 24 November, 2017; v1 submitted 19 September, 2017;
originally announced September 2017.
-
Deep Reinforcement Learning that Matters
Authors:
Peter Henderson,
Riashat Islam,
Philip Bachman,
Joelle Pineau,
Doina Precup,
David Meger
Abstract:
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determ…
▽ More
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
△ Less
Submitted 29 January, 2019; v1 submitted 19 September, 2017;
originally announced September 2017.
-
When Waiting is not an Option : Learning Options with a Deliberation Cost
Authors:
Jean Harb,
Pierre-Luc Bacon,
Martin Klissarov,
Doina Precup
Abstract:
Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through…
▽ More
Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.
△ Less
Submitted 13 September, 2017;
originally announced September 2017.
-
Neural Network Based Nonlinear Weighted Finite Automata
Authors:
Tianyu Li,
Guillaume Rabusseau,
Doina Precup
Abstract:
Weighted finite automata (WFA) can expressively model functions defined over strings but are inherently linear models. Given the recent successes of nonlinear models in machine learning, it is natural to wonder whether ex-tending WFA to the nonlinear setting would be beneficial. In this paper, we propose a novel model of neural network based nonlinearWFA model (NL-WFA) along with a learning algori…
▽ More
Weighted finite automata (WFA) can expressively model functions defined over strings but are inherently linear models. Given the recent successes of nonlinear models in machine learning, it is natural to wonder whether ex-tending WFA to the nonlinear setting would be beneficial. In this paper, we propose a novel model of neural network based nonlinearWFA model (NL-WFA) along with a learning algorithm. Our learning algorithm is inspired by the spectral learning algorithm for WFAand relies on a nonlinear decomposition of the so-called Hankel matrix, by means of an auto-encoder network. The expressive power of NL-WFA and the proposed learning algorithm are assessed on both synthetic and real-world data, showing that NL-WFA can lead to smaller model sizes and infer complex grammatical structures from data.
△ Less
Submitted 21 December, 2017; v1 submitted 13 September, 2017;
originally announced September 2017.
-
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
Authors:
Riashat Islam,
Peter Henderson,
Maziar Gomrokchi,
Doina Precup
Abstract:
Policy gradient methods in reinforcement learning have become increasingly prevalent for state-of-the-art performance in continuous control tasks. Novel methods typically benchmark against a few key algorithms such as deep deterministic policy gradients and trust region policy optimization. As such, it is important to present and use consistent baselines experiments. However, this can be difficult…
▽ More
Policy gradient methods in reinforcement learning have become increasingly prevalent for state-of-the-art performance in continuous control tasks. Novel methods typically benchmark against a few key algorithms such as deep deterministic policy gradients and trust region policy optimization. As such, it is important to present and use consistent baselines experiments. However, this can be difficult due to general variance in the algorithms, hyper-parameter tuning, and environment stochasticity. We investigate and discuss: the significance of hyper-parameters in policy gradients for continuous control, general variance in the algorithms, and reproducibility of reported results. We provide guidelines on reporting novel results as comparisons against baseline methods such that future researchers can make informed decisions when investigating novel methods.
△ Less
Submitted 10 August, 2017;
originally announced August 2017.
-
Independently Controllable Factors
Authors:
Valentin Thomas,
Jules Pondard,
Emmanuel Bengio,
Marc Sarfati,
Philippe Beaudoin,
Marie-Jean Meurs,
Joelle Pineau,
Doina Precup,
Yoshua Bengio
Abstract:
It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to int…
▽ More
It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.
△ Less
Submitted 25 August, 2017; v1 submitted 3 August, 2017;
originally announced August 2017.
-
Variational Generative Stochastic Networks with Collaborative Sha**
Authors:
Philip Bachman,
Doina Precup
Abstract:
We develop an approach to training generative models based on unrolling a variational auto-encoder into a Markov chain, and sha** the chain's trajectories using a technique inspired by recent work in Approximate Bayesian computation. We show that the global minimizer of the resulting objective is achieved when the generative model reproduces the target distribution. To allow finer control over t…
▽ More
We develop an approach to training generative models based on unrolling a variational auto-encoder into a Markov chain, and sha** the chain's trajectories using a technique inspired by recent work in Approximate Bayesian computation. We show that the global minimizer of the resulting objective is achieved when the generative model reproduces the target distribution. To allow finer control over the behavior of the models, we add a regularization term inspired by techniques used for regularizing certain types of policy search in reinforcement learning. We present empirical results on the MNIST and TFD datasets which show that our approach offers state-of-the-art performance, both quantitatively and from a qualitative point of view.
△ Less
Submitted 2 August, 2017;
originally announced August 2017.
-
Convergent Tree Backup and Retrace with Function Approximation
Authors:
Ahmed Touati,
Pierre-Luc Bacon,
Doina Precup,
Pascal Vincent
Abstract:
Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrap** in a way that leads to both stable and efficient algorithms. In this work, we show that the \textsc{Tre…
▽ More
Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrap** in a way that leads to both stable and efficient algorithms. In this work, we show that the \textsc{Tree Backup} and \textsc{Retrace} algorithms are unstable with linear function approximation, both in theory and in practice with specific examples. Based on our analysis, we then derive stable and efficient gradient-based algorithms using a quadratic convex-concave saddle-point formulation. By exploiting the problem structure proper to these algorithms, we are able to provide convergence guarantees and finite-sample bounds. The applicability of our new analysis also goes beyond \textsc{Tree Backup} and \textsc{Retrace} and allows us to provide new convergence rates for the GTD and GTD2 algorithms without having recourse to projections or Polyak averaging.
△ Less
Submitted 22 October, 2018; v1 submitted 25 May, 2017;
originally announced May 2017.
-
Investigating Recurrence and Eligibility Traces in Deep Q-Networks
Authors:
Jean Harb,
Doina Precup
Abstract:
Eligibility traces in reinforcement learning are used as a bias-variance trade-off and can often speed up training time by propagating knowledge back over time-steps in a single update. We investigate the use of eligibility traces in combination with recurrent networks in the Atari domain. We illustrate the benefits of both recurrent nets and eligibility traces in some Atari games, and highlight a…
▽ More
Eligibility traces in reinforcement learning are used as a bias-variance trade-off and can often speed up training time by propagating knowledge back over time-steps in a single update. We investigate the use of eligibility traces in combination with recurrent networks in the Atari domain. We illustrate the benefits of both recurrent nets and eligibility traces in some Atari games, and highlight also the importance of the optimization used in the training.
△ Less
Submitted 18 April, 2017;
originally announced April 2017.
-
Independently Controllable Features
Authors:
Emmanuel Bengio,
Valentin Thomas,
Joelle Pineau,
Doina Precup,
Yoshua Bengio
Abstract:
Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive environments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects.…
▽ More
Finding features that disentangle the different causes of variation in real data is a difficult task, that has nonetheless received considerable attention in static domains like natural images. Interactive environments, in which an agent can deliberately take actions, offer an opportunity to tackle this task better, because the agent can experiment with different actions and observe their effects. We introduce the idea that in interactive environments, latent factors that control the variation in observed data can be identified by figuring out what the agent can control. We propose a naive method to find factors that explain or measure the effect of the actions of a learner, and test it in illustrative experiments.
△ Less
Submitted 22 March, 2017;
originally announced March 2017.
-
Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options
Authors:
Peeyush Kumar,
Doina Precup
Abstract:
Deliberating on large or continuous state spaces have been long standing challenges in reinforcement learning. Temporal Abstraction have somewhat made this possible, but efficiently planing using temporal abstraction still remains an issue. Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. We propose her…
▽ More
Deliberating on large or continuous state spaces have been long standing challenges in reinforcement learning. Temporal Abstraction have somewhat made this possible, but efficiently planing using temporal abstraction still remains an issue. Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. We propose here an efficient algorithm which is convergent under linear function approximation while planning using temporally abstract actions. We show how this algorithm can be used along with randomly generated option models over multiple time scales to plan agents which need to act real time. Using these randomly generated option models over multiple time scales are shown to reduce number of decision epochs required to solve the given task, hence effectively reducing the time needed for deliberation.
△ Less
Submitted 19 March, 2017;
originally announced March 2017.
-
A Matrix Splitting Perspective on Planning with Options
Authors:
Pierre-Luc Bacon,
Doina Precup
Abstract:
We show that the Bellman operator underlying the options framework leads to a matrix splitting, an approach traditionally used to speed up convergence of iterative solvers for large linear systems of equations. Based on standard comparison theorems for matrix splittings, we then show how the asymptotic rate of convergence varies as a function of the inherent timescales of the options. This new per…
▽ More
We show that the Bellman operator underlying the options framework leads to a matrix splitting, an approach traditionally used to speed up convergence of iterative solvers for large linear systems of equations. Based on standard comparison theorems for matrix splittings, we then show how the asymptotic rate of convergence varies as a function of the inherent timescales of the options. This new perspective highlights a trade-off between asymptotic performance and the cost of computation associated with building a good set of options.
△ Less
Submitted 10 July, 2017; v1 submitted 2 December, 2016;
originally announced December 2016.
-
The Option-Critic Architecture
Authors:
Pierre-Luc Bacon,
Jean Harb,
Doina Precup
Abstract:
Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new opt…
▽ More
Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework.
△ Less
Submitted 2 December, 2016; v1 submitted 16 September, 2016;
originally announced September 2016.
-
Leveraging Lexical Resources for Learning Entity Embeddings in Multi-Relational Data
Authors:
Teng Long,
Ryan Lowe,
Jackie Chi Kit Cheung,
Doina Precup
Abstract:
Recent work in learning vector-space embeddings for multi-relational data has focused on combining relational information derived from knowledge bases with distributional information derived from large text corpora. We propose a simple approach that leverages the descriptions of entities or phrases available in lexical resources, in conjunction with distributional semantics, in order to derive a b…
▽ More
Recent work in learning vector-space embeddings for multi-relational data has focused on combining relational information derived from knowledge bases with distributional information derived from large text corpora. We propose a simple approach that leverages the descriptions of entities or phrases available in lexical resources, in conjunction with distributional semantics, in order to derive a better initialization for training relational models. Applying this initialization to the TransE model results in significant new state-of-the-art performances on the WordNet dataset, decreasing the mean rank from the previous best of 212 to 51. It also results in faster convergence of the entity representations. We find that there is a trade-off between improving the mean rank and the hits@10 with this approach. This illustrates that much remains to be understood regarding performance improvements in relational models.
△ Less
Submitted 17 May, 2016;
originally announced May 2016.
-
Differentially Private Policy Evaluation
Authors:
Borja Balle,
Maziar Gomrokchi,
Doina Precup
Abstract:
We present the first differentially private algorithms for reinforcement learning, which apply to the task of evaluating a fixed policy. We establish two approaches for achieving differential privacy, provide a theoretical analysis of the privacy and utility of the two algorithms, and show promising results on simple empirical examples.
We present the first differentially private algorithms for reinforcement learning, which apply to the task of evaluating a fixed policy. We establish two approaches for achieving differential privacy, provide a theoretical analysis of the privacy and utility of the two algorithms, and show promising results on simple empirical examples.
△ Less
Submitted 7 March, 2016;
originally announced March 2016.
-
Policy Gradient Methods for Off-policy Control
Authors:
Lucas Lehnert,
Doina Precup
Abstract:
Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ, converge even when using function approximation and incremental updates. However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to…
▽ More
Off-policy learning refers to the problem of learning the value function of a way of behaving, or policy, while following a different policy. Gradient-based off-policy learning algorithms, such as GTD and TDC/GQ, converge even when using function approximation and incremental updates. However, they have been developed for the case of a fixed behavior policy. In control problems, one would like to adapt the behavior policy over time to become more greedy with respect to the existing value function. In this paper, we present the first gradient-based learning algorithms for this problem, which rely on the framework of policy gradient in order to modify the behavior policy. We present derivations of the algorithms, a convergence theorem, and empirical evidence showing that they compare favorably to existing approaches.
△ Less
Submitted 13 December, 2015;
originally announced December 2015.
-
Conditional Computation in Neural Networks for faster models
Authors:
Emmanuel Bengio,
Pierre-Luc Bacon,
Joelle Pineau,
Doina Precup
Abstract:
Deep learning has become the state-of-art tool in many applications, but the evaluation and training of deep models can be time-consuming and computationally expensive. The conditional computation approach has been proposed to tackle this problem (Bengio et al., 2013; Davis & Arel, 2013). It operates by selectively activating only parts of the network at a time. In this paper, we use reinforcement…
▽ More
Deep learning has become the state-of-art tool in many applications, but the evaluation and training of deep models can be time-consuming and computationally expensive. The conditional computation approach has been proposed to tackle this problem (Bengio et al., 2013; Davis & Arel, 2013). It operates by selectively activating only parts of the network at a time. In this paper, we use reinforcement learning as a tool to optimize conditional computation policies. More specifically, we cast the problem of learning activation-dependent policies for drop** out blocks of units as a reinforcement learning problem. We propose a learning scheme motivated by computation speed, capturing the idea of wanting to have parsimonious activations while maintaining prediction accuracy. We apply a policy gradient algorithm for learning policies that optimize this loss function and propose a regularization mechanism that encourages diversification of the dropout policy. We present encouraging empirical results showing that this approach improves the speed of computation without impacting the quality of the approximation.
△ Less
Submitted 7 January, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Testing Visual Attention in Dynamic Environments
Authors:
Philip Bachman,
David Krueger,
Doina Precup
Abstract:
We investigate attention as the active pursuit of useful information. This contrasts with attention as a mechanism for the attenuation of irrelevant information. We also consider the role of short-term memory, whose use is critical to any model incapable of simultaneously perceiving all information on which its output depends. We present several simple synthetic tasks, which become considerably mo…
▽ More
We investigate attention as the active pursuit of useful information. This contrasts with attention as a mechanism for the attenuation of irrelevant information. We also consider the role of short-term memory, whose use is critical to any model incapable of simultaneously perceiving all information on which its output depends. We present several simple synthetic tasks, which become considerably more interesting when we impose strong constraints on how a model can interact with its input, and on how long it can take to produce its output. We develop a model with a different structure from those seen in previous work, and we train it using stochastic variational inference with a learned proposal distribution.
△ Less
Submitted 29 October, 2015;
originally announced October 2015.
-
Data Generation as Sequential Decision Making
Authors:
Philip Bachman,
Doina Precup
Abstract:
We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation -- perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation…
▽ More
We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation -- perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation as an MDP and develop models capable of representing effective policies for it. We construct the models using neural networks and train them using a form of guided policy search. Our models generate predictions through an iterative process of feedback and refinement. We show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets.
△ Less
Submitted 2 November, 2015; v1 submitted 10 June, 2015;
originally announced June 2015.
-
A Canonical Form for Weighted Automata and Applications to Approximate Minimization
Authors:
Borja Balle,
Prakash Panangaden,
Doina Precup
Abstract:
We study the problem of constructing approximations to a weighted automaton. Weighted finite automata (WFA) are closely related to the theory of rational series. A rational series is a function from strings to real numbers that can be computed by a finite WFA. Among others, this includes probability distributions generated by hidden Markov models and probabilistic automata. The relationship betwee…
▽ More
We study the problem of constructing approximations to a weighted automaton. Weighted finite automata (WFA) are closely related to the theory of rational series. A rational series is a function from strings to real numbers that can be computed by a finite WFA. Among others, this includes probability distributions generated by hidden Markov models and probabilistic automata. The relationship between rational series and WFA is analogous to the relationship between regular languages and ordinary automata. Associated with such rational series are infinite matrices called Hankel matrices which play a fundamental role in the theory of minimal WFA. Our contributions are: (1) an effective procedure for computing the singular value decomposition (SVD) of such infinite Hankel matrices based on their representation in terms of finite WFA; (2) a new canonical form for finite WFA based on this SVD decomposition; and, (3) an algorithm to construct approximate minimizations of a given WFA. The goal of our approximate minimization algorithm is to start from a minimal WFA and produce a smaller WFA that is close to the given one in a certain sense. The desired size of the approximating automaton is given as input. We give bounds describing how well the approximation emulates the behavior of the original WFA.
△ Less
Submitted 24 April, 2015; v1 submitted 27 January, 2015;
originally announced January 2015.
-
Learning with Pseudo-Ensembles
Authors:
Philip Bachman,
Ouais Alsharif,
Doina Precup
Abstract:
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior o…
▽ More
We formalize the notion of a pseudo-ensemble, a (possibly infinite) collection of child models spawned from a parent model by perturbing it according to some noise process. E.g., dropout (Hinton et. al, 2012) in a deep neural network trains a pseudo-ensemble of child subnetworks generated by randomly masking nodes in the parent network. We present a novel regularizer based on making the behavior of a pseudo-ensemble robust with respect to the noise process generating it. In the fully-supervised setting, our regularizer matches the performance of dropout. But, unlike dropout, our regularizer naturally extends to the semi-supervised setting, where it produces state-of-the-art results. We provide a case study in which we transform the Recursive Neural Tensor Network of (Socher et. al, 2013) into a pseudo-ensemble, which significantly improves its performance on a real-world sentiment analysis benchmark.
△ Less
Submitted 15 December, 2014;
originally announced December 2014.
-
Practical Kernel-Based Reinforcement Learning
Authors:
André M. S. Barreto,
Doina Precup,
Joelle Pineau
Abstract:
Kernel-based reinforcement learning (KBRL) stands out among reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which is statistically consistent and converges to a unique solution. Unfortunately, the model constructed by KBRL grows with the number of sample tran…
▽ More
Kernel-based reinforcement learning (KBRL) stands out among reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which is statistically consistent and converges to a unique solution. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale or on-line domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernel-based stochastic factorization (KBSF) builds on a simple idea: when a transition matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL's model into an approximator of fixed size. This makes it possible to build an approximation that takes into account both the difficulty of the problem and the associated computational cost. KBSF's computational complexity is linear in the number of sample transitions, which is the best one can do without discarding data. Moreover, the algorithm's simple mechanics allow for a fully incremental implementation that makes the amount of memory used independent of the number of sample transitions. The result is a kernel-based reinforcement learning algorithm that can be applied to large-scale problems in both off-line and on-line regimes. We derive upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. We also illustrate the potential of our algorithm in an extensive empirical study in which KBSF is applied to difficult tasks based on real-world data.
△ Less
Submitted 20 July, 2014;
originally announced July 2014.
-
Classification-based Approximate Policy Iteration: Experiments and Extended Discussions
Authors:
Amir-massoud Farahmand,
Doina Precup,
André M. S. Barreto,
Mohammad Ghavamzadeh
Abstract:
Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasses a…
▽ More
Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasses a large class of algorithms that can exploit regularities of both the value function and the policy space, depending on what is advantageous. This framework has two main components: a generic value function estimator and a classifier that learns a policy based on the estimated value function. We establish theoretical guarantees for the sample complexity of CAPI-style algorithms, which allow the policy evaluation step to be performed by a wide variety of algorithms (including temporal-difference-style methods), and can handle nonparametric representations of policies. Our bounds on the estimation error of the performance loss are tighter than existing results. We also illustrate this approach empirically on several problems, including a large HIV control task.
△ Less
Submitted 1 July, 2014;
originally announced July 2014.
-
Algorithms for multi-armed bandit problems
Authors:
Volodymyr Kuleshov,
Doina Precup
Abstract:
Although many algorithms for the multi-armed bandit problem are well-understood theoretically, empirical confirmation of their effectiveness is generally scarce. This paper presents a thorough empirical study of the most popular multi-armed bandit algorithms. Three important observations can be made from our results. Firstly, simple heuristics such as epsilon-greedy and Boltzmann exploration outpe…
▽ More
Although many algorithms for the multi-armed bandit problem are well-understood theoretically, empirical confirmation of their effectiveness is generally scarce. This paper presents a thorough empirical study of the most popular multi-armed bandit algorithms. Three important observations can be made from our results. Firstly, simple heuristics such as epsilon-greedy and Boltzmann exploration outperform theoretically sound algorithms on most settings by a significant margin. Secondly, the performance of most algorithms varies dramatically with the parameters of the bandit problem. Our study identifies for each algorithm the settings where it performs well, and the settings where it performs poorly. Thirdly, the algorithms' performance relative each to other is affected only by the number of bandit arms and the variance of the rewards. This finding may guide the design of subsequent empirical evaluations. In the second part of the paper, we turn our attention to an important area of application of bandit algorithms: clinical trials. Although the design of clinical trials has been one of the principal practical problems motivating research on multi-armed bandits, bandit algorithms have never been evaluated as potential treatment allocation strategies. Using data from a real study, we simulate the outcome that a 2001-2002 clinical trial would have had if bandit algorithms had been used to allocate patients to treatments. We find that an adaptive trial would have successfully treated at least 50% more patients, while significantly reducing the number of adverse effects and increasing patient retention. At the end of the trial, the best treatment could have still been identified with a high level of statistical confidence. Our findings demonstrate that bandit algorithms are attractive alternatives to current adaptive treatment allocation strategies.
△ Less
Submitted 24 February, 2014;
originally announced February 2014.
-
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
Authors:
Mahdi Milani Fard,
Yuri Grinberg,
Amir-massoud Farahmand,
Joelle Pineau,
Doina Precup
Abstract:
We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We…
▽ More
We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method.
△ Less
Submitted 21 September, 2012; v1 submitted 23 July, 2012;
originally announced July 2012.
-
Metrics for Finite Markov Decision Processes
Authors:
Norman Ferns,
Prakash Panangaden,
Doina Precup
Abstract:
We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based o…
▽ More
We present metrics for measuring the similarity of states in a finite Markov decision process (MDP). The formulation of our metrics is based on the notion of bisimulation for MDPs, with an aim towards solving discounted infinite horizon reinforcement learning tasks. Such metrics can be used to aggregate states, as well as to better structure other value function approximators (e.g., memory-based or nearest-neighbor approximators). We provide bounds that relate our metric distances to the optimal values of states in the given MDP.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Metrics for Markov Decision Processes with Infinite State Spaces
Authors:
Norman Ferns,
Prakash Panangaden,
Doina Precup
Abstract:
We present metrics for measuring state similarity in Markov decision processes (MDPs) with infinitely many states, including MDPs with continuous state spaces. Such metrics provide a stable quantitative analogue of the notion of bisimulation for MDPs, and are suitable for use in MDP approximation. We show that the optimal value function associated with a discounted infinite horizon planning task v…
▽ More
We present metrics for measuring state similarity in Markov decision processes (MDPs) with infinitely many states, including MDPs with continuous state spaces. Such metrics provide a stable quantitative analogue of the notion of bisimulation for MDPs, and are suitable for use in MDP approximation. We show that the optimal value function associated with a discounted infinite horizon planning task varies continuously with respect to our metric distances.
△ Less
Submitted 4 July, 2012;
originally announced July 2012.
-
Methods for computing state similarity in Markov Decision Processes
Authors:
Norman Ferns,
Pablo Samuel Castro,
Doina Precup,
Prakash Panangaden
Abstract:
A popular approach to solving large probabilistic systems relies on aggregating states based on a measure of similarity. Many approaches in the literature are heuristic. A number of recent methods rely instead on metrics based on the notion of bisimulation, or behavioral equivalence between states (Givan et al, 2001, 2003; Ferns et al, 2004). An integral component of such metrics is the Kantorovic…
▽ More
A popular approach to solving large probabilistic systems relies on aggregating states based on a measure of similarity. Many approaches in the literature are heuristic. A number of recent methods rely instead on metrics based on the notion of bisimulation, or behavioral equivalence between states (Givan et al, 2001, 2003; Ferns et al, 2004). An integral component of such metrics is the Kantorovich metric between probability distributions. However, while this metric enables many satisfying theoretical properties, it is costly to compute in practice. In this paper, we use techniques from network optimization and statistical sampling to overcome this problem. We obtain in this manner a variety of distance functions for MDP state aggregation, which differ in the tradeoff between time and space complexity, as well as the quality of the aggregation. We provide an empirical evaluation of these trade-offs.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Improved Estimation in Time Varying Models
Authors:
Doina Precup,
Philip Bachman
Abstract:
Locally adapted parameterizations of a model (such as locally weighted regression) are expressive but often suffer from high variance. We describe an approach for reducing the variance, based on the idea of estimating simultaneously a transformed space for the model, as well as locally adapted parameterizations in this new space. We present a new problem formulation that captures this idea and ill…
▽ More
Locally adapted parameterizations of a model (such as locally weighted regression) are expressive but often suffer from high variance. We describe an approach for reducing the variance, based on the idea of estimating simultaneously a transformed space for the model, as well as locally adapted parameterizations in this new space. We present a new problem formulation that captures this idea and illustrate it in the important context of time varying models. We develop an algorithm for learning a set of bases for approximating a time varying sparse network; each learned basis constitutes an archetypal sparse network structure. We also provide an extension for learning task-driven bases. We present empirical results on synthetic data sets, as well as on a BCI EEG classification task.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.