-
Offline Learning from Demonstrations and Unlabeled Experience
Authors:
Konrad Zolna,
Alexander Novikov,
Ksenia Konyushkova,
Caglar Gulcehre,
Ziyu Wang,
Yusuf Aytar,
Misha Denil,
Nando de Freitas,
Scott Reed
Abstract:
Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human…
▽ More
Behavior cloning (BC) is often practical for robot learning because it allows a policy to be trained offline without rewards, by supervised learning on expert demonstrations. However, BC does not effectively leverage what we will refer to as unlabeled experience: data of mixed and unknown quality without reward annotations. This unlabeled data can be generated by a variety of sources such as human teleoperation, scripted policies and other agents on the same robot. Towards data-driven offline robot learning that can use this unlabeled experience, we introduce Offline Reinforced Imitation Learning (ORIL). ORIL first learns a reward function by contrasting observations from demonstrator and unlabeled trajectories, then annotates all data with the learned reward, and finally trains an agent via offline reinforcement learning. Across a diverse set of continuous control and simulated robotic manipulation tasks, we show that ORIL consistently outperforms comparable BC agents by effectively leveraging unlabeled experience.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
Hyperparameter Selection for Offline Reinforcement Learning
Authors:
Tom Le Paine,
Cosmin Paduraru,
Andrea Michi,
Caglar Gulcehre,
Konrad Zolna,
Alexander Novikov,
Ziyu Wang,
Nando de Freitas
Abstract:
Offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios. However, existing hyperparameter selection methods for offline RL break the offline assumption by evaluating policies corresponding to each hyperparameter setting in the environment. This online execution is often infeasible and hence undermines the main aim of of…
▽ More
Offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios. However, existing hyperparameter selection methods for offline RL break the offline assumption by evaluating policies corresponding to each hyperparameter setting in the environment. This online execution is often infeasible and hence undermines the main aim of offline RL. Therefore, in this work, we focus on \textit{offline hyperparameter selection}, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data. Through large-scale empirical evaluation we show that: 1) offline RL algorithms are not robust to hyperparameter choices, 2) factors such as the offline RL algorithm and method for estimating Q values can have a big impact on hyperparameter selection, and 3) when we control those factors carefully, we can reliably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set. Overall, our results present an optimistic view that offline hyperparameter selection is within reach, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizon.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Critic Regularized Regression
Authors:
Ziyu Wang,
Alexander Novikov,
Konrad Zolna,
Jost Tobias Springenberg,
Scott Reed,
Bobak Shahriari,
Noah Siegel,
Josh Merel,
Caglar Gulcehre,
Nicolas Heess,
Nando de Freitas
Abstract:
Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learnin…
▽ More
Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.
△ Less
Submitted 22 September, 2021; v1 submitted 26 June, 2020;
originally announced June 2020.
-
RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning
Authors:
Caglar Gulcehre,
Ziyu Wang,
Alexander Novikov,
Tom Le Paine,
Sergio Gomez Colmenarejo,
Konrad Zolna,
Rishabh Agarwal,
Josh Merel,
Daniel Mankowitz,
Cosmin Paduraru,
Gabriel Dulac-Arnold,
Jerry Li,
Mohammad Norouzi,
Matt Hoffman,
Ofir Nachum,
George Tucker,
Nicolas Heess,
Nando de Freitas
Abstract:
Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concerns. In this paper, we propose a benchmark called RL Unplugged…
▽ More
Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concerns. In this paper, we propose a benchmark called RL Unplugged to evaluate and compare offline RL methods. RL Unplugged includes data from a diverse range of domains including games (e.g., Atari benchmark) and simulated motor control problems (e.g., DM Control Suite). The datasets include domains that are partially or fully observable, use continuous or discrete actions, and have stochastic vs. deterministic dynamics. We propose detailed evaluation protocols for each domain in RL Unplugged and provide an extensive analysis of supervised learning and offline RL methods using these protocols. We will release data for all our tasks and open-source all algorithms presented in this paper. We hope that our suite of benchmarks will increase the reproducibility of experiments and make it possible to study challenging tasks with a limited computational budget, thus making RL research both more systematic and more accessible across the community. Moving forward, we view RL Unplugged as a living benchmark suite that will evolve and grow with datasets contributed by the research community and ourselves. Our project page is available on https://git.io/JJUhd.
△ Less
Submitted 12 February, 2021; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Combating False Negatives in Adversarial Imitation Learning
Authors:
Konrad Zolna,
Chitwan Saharia,
Leonard Boussioux,
David Yu-Tung Hui,
Maxime Chevalier-Boisvert,
Dzmitry Bahdanau,
Yoshua Bengio
Abstract:
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. However, as the trained policy learns to be more successful, the negative examples (the ones produced by the agent) become increasingly similar to expert ones. Despite the fact that the task is successfully accomplished in some of the agent's t…
▽ More
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. However, as the trained policy learns to be more successful, the negative examples (the ones produced by the agent) become increasingly similar to expert ones. Despite the fact that the task is successfully accomplished in some of the agent's trajectories, the discriminator is trained to output low values for them. We hypothesize that this inconsistent training signal for the discriminator can impede its learning, and consequently leads to worse overall performance of the agent. We show experimental evidence for this hypothesis and that the 'False Negatives' (i.e. successful agent episodes) significantly hinder adversarial imitation learning, which is the first contribution of this paper. Then, we propose a method to alleviate the impact of false negatives and test it on the BabyAI environment. This method consistently improves sample efficiency over the baselines by at least an order of magnitude.
△ Less
Submitted 2 February, 2020;
originally announced February 2020.
-
Task-Relevant Adversarial Imitation Learning
Authors:
Konrad Zolna,
Scott Reed,
Alexander Novikov,
Sergio Gomez Colmenarejo,
David Budden,
Serkan Cabi,
Misha Denil,
Nando de Freitas,
Ziyu Wang
Abstract:
We show that a critical vulnerability in adversarial imitation is the tendency of discriminator networks to learn spurious associations between visual features and expert labels. When the discriminator focuses on task-irrelevant features, it does not provide an informative reward signal, leading to poor task performance. We analyze this problem in detail and propose a solution that outperforms sta…
▽ More
We show that a critical vulnerability in adversarial imitation is the tendency of discriminator networks to learn spurious associations between visual features and expert labels. When the discriminator focuses on task-irrelevant features, it does not provide an informative reward signal, leading to poor task performance. We analyze this problem in detail and propose a solution that outperforms standard Generative Adversarial Imitation Learning (GAIL). Our proposed method, Task-Relevant Adversarial Imitation Learning (TRAIL), uses constrained discriminator optimization to learn informative rewards. In comprehensive experiments, we show that TRAIL can solve challenging robotic manipulation tasks from pixels by imitating human operators without access to any task rewards, and clearly outperforms comparable baseline imitation agents, including those trained via behaviour cloning and conventional GAIL.
△ Less
Submitted 12 November, 2020; v1 submitted 2 October, 2019;
originally announced October 2019.
-
The Dynamics of Handwriting Improves the Automated Diagnosis of Dysgraphia
Authors:
Konrad Zolna,
Thibault Asselborn,
Caroline Jolly,
Laurence Casteran,
Marie-Ange~Nguyen-Morel,
Wafa Johal,
Pierre Dillenbourg
Abstract:
Handwriting disorder (termed dysgraphia) is a far from a singular problem as nearly 8.6% of the population in France is considered dysgraphic. Moreover, research highlights the fundamental importance to detect and remediate these handwriting difficulties as soon as possible as they may affect a child's entire life, undermining performance and self-confidence in a wide variety of school activities.…
▽ More
Handwriting disorder (termed dysgraphia) is a far from a singular problem as nearly 8.6% of the population in France is considered dysgraphic. Moreover, research highlights the fundamental importance to detect and remediate these handwriting difficulties as soon as possible as they may affect a child's entire life, undermining performance and self-confidence in a wide variety of school activities. At the moment, the detection of handwriting difficulties is performed through a standard test called BHK. This detection, performed by therapists, is laborious because of its high cost and subjectivity. We present a digital approach to identify and characterize handwriting difficulties via a Recurrent Neural Network model (RNN). The child under investigation is asked to write on a graphics tablet all the letters of the alphabet as well as the ten digits. Once complete, the RNN delivers a diagnosis in a few milliseconds and demonstrates remarkable efficiency as it correctly identifies more than 90% of children diagnosed as dysgraphic using the BHK test. The main advantage of our tablet-based system is that it captures the dynamic features of writing -- something a human expert, such as a teacher, is unable to do. We show that incorporating the dynamic information available by the use of tablet is highly beneficial to our digital test to discriminate between typically-develo** and dysgraphic children.
△ Less
Submitted 12 June, 2019;
originally announced June 2019.
-
Split Batch Normalization: Improving Semi-Supervised Learning under Domain Shift
Authors:
Michał Zając,
Konrad Zolna,
Stanisław Jastrzębski
Abstract:
Recent work has shown that using unlabeled data in semi-supervised learning is not always beneficial and can even hurt generalization, especially when there is a class mismatch between the unlabeled and labeled examples. We investigate this phenomenon for image classification on the CIFAR-10 and the ImageNet datasets, and with many other forms of domain shifts applied (e.g. salt-and-pepper noise).…
▽ More
Recent work has shown that using unlabeled data in semi-supervised learning is not always beneficial and can even hurt generalization, especially when there is a class mismatch between the unlabeled and labeled examples. We investigate this phenomenon for image classification on the CIFAR-10 and the ImageNet datasets, and with many other forms of domain shifts applied (e.g. salt-and-pepper noise). Our main contribution is Split Batch Normalization (Split-BN), a technique to improve SSL when the additional unlabeled data comes from a shifted distribution. We achieve it by using separate batch normalization statistics for unlabeled examples. Due to its simplicity, we recommend it as a standard practice. Finally, we analyse how domain shift affects the SSL training process. In particular, we find that during training the statistics of hidden activations in late layers become markedly different between the unlabeled and the labeled examples.
△ Less
Submitted 6 April, 2019;
originally announced April 2019.
-
Reinforced Imitation in Heterogeneous Action Space
Authors:
Konrad Zolna,
Negar Rostamzadeh,
Yoshua Bengio,
Sung** Ahn,
Pedro O. Pinheiro
Abstract:
Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a challenging setting where an agent and an expert use different actions from each other. We assume that the agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learni…
▽ More
Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a challenging setting where an agent and an expert use different actions from each other. We assume that the agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learning cost and the reinforcement learning objective. In addition, this method adapts the agent's policy based on either mimicking expert behavior or maximizing sparse reward. We show, through navigation scenarios, that (i) an agent is able to efficiently leverage sparse rewards to outperform standard state-only imitation learning, (ii) it can learn a policy even when its actions are different from the expert, and (iii) the performance of the agent is not bounded by that of the expert, due to the optimized usage of sparse rewards.
△ Less
Submitted 26 August, 2019; v1 submitted 6 April, 2019;
originally announced April 2019.
-
Adversarial Framing for Image and Video Classification
Authors:
Konrad Zolna,
Michal Zajac,
Negar Rostamzadeh,
Pedro O. Pinheiro
Abstract:
Neural networks are prone to adversarial attacks. In general, such attacks deteriorate the quality of the input by either slightly modifying most of its pixels, or by occluding it with a patch. In this paper, we propose a method that keeps the image unchanged and only adds an adversarial framing on the border of the image. We show empirically that our method is able to successfully attack state-of…
▽ More
Neural networks are prone to adversarial attacks. In general, such attacks deteriorate the quality of the input by either slightly modifying most of its pixels, or by occluding it with a patch. In this paper, we propose a method that keeps the image unchanged and only adds an adversarial framing on the border of the image. We show empirically that our method is able to successfully attack state-of-the-art methods on both image and video classification problems. Notably, the proposed method results in a universal attack which is very fast at test time. Source code can be found at https://github.com/zajaczajac/adv_framing .
△ Less
Submitted 17 October, 2019; v1 submitted 11 December, 2018;
originally announced December 2018.
-
Focused Hierarchical RNNs for Conditional Sequence Processing
Authors:
Nan Rosemary Ke,
Konrad Zolna,
Alessandro Sordoni,
Zhouhan Lin,
Adam Trischler,
Yoshua Bengio,
Joelle Pineau,
Laurent Charlin,
Chris Pal
Abstract:
Recurrent Neural Networks (RNNs) with attention mechanisms have obtained state-of-the-art results for many sequence processing tasks. Most of these models use a simple form of encoder with attention that looks over the entire sequence and assigns a weight to each token independently. We present a mechanism for focusing RNN encoders for sequence modelling tasks which allows them to attend to key pa…
▽ More
Recurrent Neural Networks (RNNs) with attention mechanisms have obtained state-of-the-art results for many sequence processing tasks. Most of these models use a simple form of encoder with attention that looks over the entire sequence and assigns a weight to each token independently. We present a mechanism for focusing RNN encoders for sequence modelling tasks which allows them to attend to key parts of the input as needed. We formulate this using a multi-layer conditional sequence encoder that reads in one token at a time and makes a discrete decision on whether the token is relevant to the context or question being asked. The discrete gating mechanism takes in the context embedding and the current hidden state as inputs and controls information flow into the layer above. We train it using policy gradient methods. We evaluate this method on several types of tasks with different attributes. First, we evaluate the method on synthetic tasks which allow us to evaluate the model for its generalization ability and probe the behavior of the gates in more controlled settings. We then evaluate this approach on large scale Question Answering tasks including the challenging MS MARCO and SearchQA tasks. Our models shows consistent improvements for both tasks over prior work and our baselines. It has also shown to generalize significantly better on synthetic tasks as compared to the baselines.
△ Less
Submitted 12 June, 2018;
originally announced June 2018.
-
Classifier-agnostic saliency map extraction
Authors:
Konrad Zolna,
Krzysztof J. Geras,
Kyunghyun Cho
Abstract:
Currently available methods for extracting saliency maps identify parts of the input which are the most important to a specific fixed classifier. We show that this strong dependence on a given classifier hinders their performance. To address this problem, we propose classifier-agnostic saliency map extraction, which finds all parts of the image that any classifier could use, not just one given in…
▽ More
Currently available methods for extracting saliency maps identify parts of the input which are the most important to a specific fixed classifier. We show that this strong dependence on a given classifier hinders their performance. To address this problem, we propose classifier-agnostic saliency map extraction, which finds all parts of the image that any classifier could use, not just one given in advance. We observe that the proposed approach extracts higher quality saliency maps than prior work while being conceptually simple and easy to implement. The method sets the new state of the art result for localization task on the ImageNet data, outperforming all existing weakly-supervised localization techniques, despite not using the ground truth labels at the inference time. The code reproducing the results is available at https://github.com/kondiz/casme .
The final version of this manuscript is published in Computer Vision and Image Understanding and is available online at https://doi.org/10.1016/j.cviu.2020.102969 .
△ Less
Submitted 19 July, 2020; v1 submitted 21 May, 2018;
originally announced May 2018.
-
Fraternal Dropout
Authors:
Konrad Zolna,
Devansh Arpit,
Dendi Suhubdy,
Yoshua Bengio
Abstract:
Recurrent neural networks (RNNs) are important class of architectures among neural networks useful for language modeling and sequential prediction. However, optimizing RNNs is known to be harder compared to feed-forward neural networks. A number of techniques have been proposed in literature to address this problem. In this paper we propose a simple technique called fraternal dropout that takes ad…
▽ More
Recurrent neural networks (RNNs) are important class of architectures among neural networks useful for language modeling and sequential prediction. However, optimizing RNNs is known to be harder compared to feed-forward neural networks. A number of techniques have been proposed in literature to address this problem. In this paper we propose a simple technique called fraternal dropout that takes advantage of dropout to achieve this goal. Specifically, we propose to train two identical copies of an RNN (that share parameters) with different dropout masks while minimizing the difference between their (pre-softmax) predictions. In this way our regularization encourages the representations of RNNs to be invariant to dropout mask, thus being robust. We show that our regularization term is upper bounded by the expectation-linear dropout objective which has been shown to address the gap due to the difference between the train and inference phases of dropout. We evaluate our model and achieve state-of-the-art results in sequence modeling tasks on two benchmark datasets - Penn Treebank and Wikitext-2. We also show that our approach leads to performance improvement by a significant margin in image captioning (Microsoft COCO) and semi-supervised (CIFAR-10) tasks.
△ Less
Submitted 28 March, 2018; v1 submitted 31 October, 2017;
originally announced November 2017.
-
Improving the Performance of Neural Networks in Regression Tasks Using Drawering
Authors:
Konrad Zolna
Abstract:
The method presented extends a given regression neural network to make its performance improve. The modification affects the learning procedure only, hence the extension may be easily omitted during evaluation without any change in prediction. It means that the modified model may be evaluated as quickly as the original one but tends to perform better.
This improvement is possible because the mod…
▽ More
The method presented extends a given regression neural network to make its performance improve. The modification affects the learning procedure only, hence the extension may be easily omitted during evaluation without any change in prediction. It means that the modified model may be evaluated as quickly as the original one but tends to perform better.
This improvement is possible because the modification gives better expressive power, provides better behaved gradients and works as a regularization. The knowledge gained by the temporarily extended neural network is contained in the parameters shared with the original neural network.
The only cost is an increase in learning time.
△ Less
Submitted 5 December, 2016;
originally announced December 2016.