-
Mirror-mediated ultralong-range atomic dipole-dipole interactions
Authors:
Nicholas Furtak-Wells,
Benjamin Dawson,
Thomas Mann,
Gin Jose,
Almut Beige
Abstract:
In three dimensions, dipole-dipole interactions which alter atomic level shifts and spontaneous decay rates only persist over distances comparable to the wavelength of the emitted light. In this paper we show that it is possible to significantly extend the range of these interactions with the help of a partially transparent asymmetric mirror interface. Suppose two two-level atoms are placed on opp…
▽ More
In three dimensions, dipole-dipole interactions which alter atomic level shifts and spontaneous decay rates only persist over distances comparable to the wavelength of the emitted light. In this paper we show that it is possible to significantly extend the range of these interactions with the help of a partially transparent asymmetric mirror interface. Suppose two two-level atoms are placed on opposite sides of the interface, each at the position of the mirror image of the other. In this case, their emitted light interferes almost exactly as it would when the atoms are right next to each other. Hence their dipole-dipole interaction assumes an additional maximum, even when the actual distance of the atoms is several orders of magnitude larger than the transition wavelength. Although the resulting ultralong-range interactions are in general relatively weak, we expect them to find applications in quantum technology, like non-invasive quantum sensing.
△ Less
Submitted 27 June, 2024; v1 submitted 30 May, 2023;
originally announced May 2023.
-
MuZero with Self-competition for Rate Control in VP9 Video Compression
Authors:
Amol Mandhane,
Anton Zhernov,
Maribeth Rauh,
Chenjie Gu,
Miaosen Wang,
Flora Xue,
Wendy Shang,
Derek Pang,
Rene Claus,
Ching-Han Chiang,
Cheng Chen,
**gning Han,
Angie Chen,
Daniel J. Mankowitz,
Jackson Broshear,
Julian Schrittwieser,
Thomas Hubert,
Oriol Vinyals,
Timothy Mann
Abstract:
Video streaming usage has seen a significant rise as entertainment, education, and business increasingly rely on online video. Optimizing video compression has the potential to increase access and quality of content to users, and reduce energy use and costs overall. In this paper, we present an application of the MuZero algorithm to the challenge of video compression. Specifically, we target the p…
▽ More
Video streaming usage has seen a significant rise as entertainment, education, and business increasingly rely on online video. Optimizing video compression has the potential to increase access and quality of content to users, and reduce energy use and costs overall. In this paper, we present an application of the MuZero algorithm to the challenge of video compression. Specifically, we target the problem of learning a rate control policy to select the quantization parameters (QP) in the encoding process of libvpx, an open source VP9 video compression library widely used by popular video-on-demand (VOD) services. We treat this as a sequential decision making problem to maximize the video quality with an episodic constraint imposed by the target bitrate. Notably, we introduce a novel self-competition based reward mechanism to solve constrained RL with variable constraint satisfaction difficulty, which is challenging for existing constrained RL methods. We demonstrate that the MuZero-based rate control achieves an average 6.28% reduction in size of the compressed videos for the same delivered video quality level (measured as PSNR BD-rate) compared to libvpx's two-pass VBR rate control policy, while having better constraint satisfaction behavior.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
Data Augmentation Can Improve Robustness
Authors:
Sylvestre-Alvise Rebuffi,
Sven Gowal,
Dan A. Calian,
Florian Stimberg,
Olivia Wiles,
Timothy Mann
Abstract:
Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on reducing robust overfitting by using common data augmentation schemes. We demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost robust accuracy. Furthermore, w…
▽ More
Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on reducing robust overfitting by using common data augmentation schemes. We demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost robust accuracy. Furthermore, we compare various augmentations techniques and observe that spatial composition techniques work the best for adversarial training. Finally, we evaluate our approach on CIFAR-10 against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $ε= 8/255$ and $ε= 128/255$, respectively. We show large absolute improvements of +2.93% and +2.16% in robust accuracy compared to previous state-of-the-art methods. In particular, against $\ell_\infty$ norm-bounded perturbations of size $ε= 8/255$, our model reaches 60.07% robust accuracy without using any external data. We also achieve a significant performance boost with this approach while using other architectures and datasets such as CIFAR-100, SVHN and TinyImageNet.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Improving Robustness using Generated Data
Authors:
Sven Gowal,
Sylvestre-Alvise Rebuffi,
Olivia Wiles,
Florian Stimberg,
Dan Andrei Calian,
Timothy Mann
Abstract:
Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explor…
▽ More
Recent work argues that robust training requires substantially larger datasets than those required for standard classification. On CIFAR-10 and CIFAR-100, this translates into a sizable robust-accuracy gap between models trained solely on data from the original training set and those trained with additional data extracted from the "80 Million Tiny Images" dataset (TI-80M). In this paper, we explore how generative models trained solely on the original training set can be leveraged to artificially increase the size of the original training set and improve adversarial robustness to $\ell_p$ norm-bounded perturbations. We identify the sufficient conditions under which incorporating additional generated data can improve robustness, and demonstrate that it is possible to significantly reduce the robust-accuracy gap to models trained with additional real data. Surprisingly, we even show that even the addition of non-realistic random data (generated by Gaussian sampling) can improve robustness. We evaluate our approach on CIFAR-10, CIFAR-100, SVHN and TinyImageNet against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $ε= 8/255$ and $ε= 128/255$, respectively. We show large absolute improvements in robust accuracy compared to previous state-of-the-art methods. Against $\ell_\infty$ norm-bounded perturbations of size $ε= 8/255$, our models achieve 66.10% and 33.49% robust accuracy on CIFAR-10 and CIFAR-100, respectively (improving upon the state-of-the-art by +8.96% and +3.29%). Against $\ell_2$ norm-bounded perturbations of size $ε= 128/255$, our model achieves 78.31% on CIFAR-10 (+3.81%). These results beat most prior works that use external data.
△ Less
Submitted 14 December, 2021; v1 submitted 18 October, 2021;
originally announced October 2021.
-
The quantum optics of asymmetric mirrors with coherent light absorption
Authors:
Benjamin Dawson,
Nicholas Furtak-Wells,
Thomas Mann,
Gin Jose,
Almut Beige
Abstract:
The local observables of the quantised electromagnetic field near a mirror-coated interface depend strongly on the properties of the media on {\em both} sides. In macroscopic quantum electrodynamics, this fact is taken into account with the help of optical Green's functions which correlate the position of an observer with all other spatial positions and photon frequencies. Here we present an alter…
▽ More
The local observables of the quantised electromagnetic field near a mirror-coated interface depend strongly on the properties of the media on {\em both} sides. In macroscopic quantum electrodynamics, this fact is taken into account with the help of optical Green's functions which correlate the position of an observer with all other spatial positions and photon frequencies. Here we present an alternative, more intuitive approach and obtain the local field observables with the help of a quantum mirror image detector method [Furtak-Wells et al., Phys. Rev. A 97, 043827 (2018)]. In order to correctly normalise electric field operators, we demand that spontaneous atomic decay rates simplify to their respective free space values far away from the reflecting surface. Our approach is interesting, since mirror-coated interfaces constitute a common basic building block for quantum photonic devices.
△ Less
Submitted 2 July, 2021;
originally announced July 2021.
-
Defending Against Image Corruptions Through Adversarial Augmentations
Authors:
Dan A. Calian,
Florian Stimberg,
Olivia Wiles,
Sylvestre-Alvise Rebuffi,
Andras Gyorgy,
Timothy Mann,
Sven Gowal
Abstract:
Modern neural networks excel at image classification, yet they remain vulnerable to common image corruptions such as blur, speckle noise or fog. Recent methods that focus on this problem, such as AugMix and DeepAugment, introduce defenses that operate in expectation over a distribution of image corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations focuses on defenses agai…
▽ More
Modern neural networks excel at image classification, yet they remain vulnerable to common image corruptions such as blur, speckle noise or fog. Recent methods that focus on this problem, such as AugMix and DeepAugment, introduce defenses that operate in expectation over a distribution of image corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations focuses on defenses against worst-case corruptions. In this work, we reconcile both approaches by proposing AdversarialAugment, a technique which optimizes the parameters of image-to-image models to generate adversarially corrupted augmented images. We theoretically motivate our method and give sufficient conditions for the consistency of its idealized version as well as that of DeepAugment. Our classifiers improve upon the state-of-the-art on common image corruption benchmarks conducted in expectation on CIFAR-10-C and improve worst-case performance against $\ell_p$-norm bounded perturbations on both CIFAR-10 and ImageNet.
△ Less
Submitted 16 December, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Fixing Data Augmentation to Improve Adversarial Robustness
Authors:
Sylvestre-Alvise Rebuffi,
Sven Gowal,
Dan A. Calian,
Florian Stimberg,
Olivia Wiles,
Timothy Mann
Abstract:
Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on both heuristics-driven and data-driven augmentations as a means to reduce robust overfitting. First, we demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost r…
▽ More
Adversarial training suffers from robust overfitting, a phenomenon where the robust test accuracy starts to decrease during training. In this paper, we focus on both heuristics-driven and data-driven augmentations as a means to reduce robust overfitting. First, we demonstrate that, contrary to previous findings, when combined with model weight averaging, data augmentation can significantly boost robust accuracy. Second, we explore how state-of-the-art generative models can be leveraged to artificially increase the size of the training set and further improve adversarial robustness. Finally, we evaluate our approach on CIFAR-10 against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $ε= 8/255$ and $ε= 128/255$, respectively. We show large absolute improvements of +7.06% and +5.88% in robust accuracy compared to previous state-of-the-art methods. In particular, against $\ell_\infty$ norm-bounded perturbations of size $ε= 8/255$, our model reaches 64.20% robust accuracy without using any external data, beating most prior works that use external data.
△ Less
Submitted 18 October, 2021; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification
Authors:
Daniel J. Mankowitz,
Dan A. Calian,
Rae Jeong,
Cosmin Paduraru,
Nicolas Heess,
Sumanth Dathathri,
Martin Riedmiller,
Timothy Mann
Abstract:
Many real-world physical control systems are required to satisfy constraints upon deployment. Furthermore, real-world systems are often subject to effects such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such effects effectively perturb the system dynamics and can cause a policy trained successfully in one domain to perform poorly when deployed to a perturbed version of the…
▽ More
Many real-world physical control systems are required to satisfy constraints upon deployment. Furthermore, real-world systems are often subject to effects such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such effects effectively perturb the system dynamics and can cause a policy trained successfully in one domain to perform poorly when deployed to a perturbed version of the same domain. This can affect a policy's ability to maximize future rewards as well as the extent to which it satisfies constraints. We refer to this as constrained model misspecification. We present an algorithm that mitigates this form of misspecification, and showcase its performance in multiple simulated Mujoco tasks from the Real World Reinforcement Learning (RWRL) suite.
△ Less
Submitted 3 March, 2021; v1 submitted 20 October, 2020;
originally announced October 2020.
-
Balancing Constraints and Rewards with Meta-Gradient D4PG
Authors:
Dan A. Calian,
Daniel J. Mankowitz,
Tom Zahavy,
Zhongwen Xu,
Junhyuk Oh,
Nir Levine,
Timothy Mann
Abstract:
Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved w…
▽ More
Deploying Reinforcement Learning (RL) agents to solve real-world applications often requires satisfying complex system constraints. Often the constraint thresholds are incorrectly set due to the complex nature of a system or the inability to verify the thresholds offline (e.g, no simulator or reasonable offline evaluation procedure exists). This results in solutions where a task cannot be solved without violating the constraints. However, in many real-world cases, constraint violations are undesirable yet they are not catastrophic, motivating the need for soft-constrained RL approaches. We present a soft-constrained RL approach that utilizes meta-gradients to find a good trade-off between expected return and minimizing constraint violations. We demonstrate the effectiveness of this approach by showing that it consistently outperforms the baselines across four different MuJoCo domains.
△ Less
Submitted 27 November, 2020; v1 submitted 13 October, 2020;
originally announced October 2020.
-
Uncovering the Limits of Adversarial Training against Norm-Bounded Adversarial Examples
Authors:
Sven Gowal,
Chongli Qin,
Jonathan Uesato,
Timothy Mann,
Pushmeet Kohli
Abstract:
Adversarial training and its variants have become de facto standards for learning robust deep neural networks. In this paper, we explore the landscape around adversarial training in a bid to uncover its limits. We systematically study the effect of different training losses, model sizes, activation functions, the addition of unlabeled data (through pseudo-labeling) and other factors on adversarial…
▽ More
Adversarial training and its variants have become de facto standards for learning robust deep neural networks. In this paper, we explore the landscape around adversarial training in a bid to uncover its limits. We systematically study the effect of different training losses, model sizes, activation functions, the addition of unlabeled data (through pseudo-labeling) and other factors on adversarial robustness. We discover that it is possible to train robust models that go well beyond state-of-the-art results by combining larger models, Swish/SiLU activations and model weight averaging. We demonstrate large improvements on CIFAR-10 and CIFAR-100 against $\ell_\infty$ and $\ell_2$ norm-bounded perturbations of size $8/255$ and $128/255$, respectively. In the setting with additional unlabeled data, we obtain an accuracy under attack of 65.88% against $\ell_\infty$ perturbations of size $8/255$ on CIFAR-10 (+6.35% with respect to prior art). Without additional data, we obtain an accuracy under attack of 57.20% (+3.46%). To test the generality of our findings and without any additional modifications, we obtain an accuracy under attack of 80.53% (+7.62%) against $\ell_2$ perturbations of size $128/255$ on CIFAR-10, and of 36.88% (+8.46%) against $\ell_\infty$ perturbations of size $8/255$ on CIFAR-100. All models are available at https://github.com/deepmind/deepmind-research/tree/master/adversarial_robustness.
△ Less
Submitted 30 March, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Non-Stationary Delayed Bandits with Intermediate Observations
Authors:
Claire Vernade,
Andras Gyorgy,
Timothy Mann
Abstract:
Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about th…
▽ More
Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. We develop a computationally efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance. Experimental results demonstrate that our method is able to learn in non-stationary delayed environments where existing methods fail.
△ Less
Submitted 11 August, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Spontaneous emission of atomic dipoles near two-sided semi-transparent mirrors
Authors:
Benjamin Dawson,
Nicholas Furtak-Wells,
Thomas Mann,
Gin Jose,
Almut Beige
Abstract:
Atom-field interactions near optical interfaces have a wide range of applications in quantum technology. Motivated by this, this paper revisits the spontaneous emission of atomic dipoles in the presence of a two sided semi-transparent mirror. First we review the main properties of the quantised electromagnetic field near a semi-transparent mirror. To do so, we employ a quantum mirror image detecto…
▽ More
Atom-field interactions near optical interfaces have a wide range of applications in quantum technology. Motivated by this, this paper revisits the spontaneous emission of atomic dipoles in the presence of a two sided semi-transparent mirror. First we review the main properties of the quantised electromagnetic field near a semi-transparent mirror. To do so, we employ a quantum mirror image detector method which maps the experimental setup which we consider here onto analogous free space scenarios. We emphasise that the local density of states of the electromagnetic field depends on the reflection rates of both sides of the mirror surface. Hence it is not surprising that also the spontaneous decay rate of an atomic dipole in front of a semi-transparent mirror depends on both reflectance rates. Although the effect which we describe here only holds for relatively short atom-mirror distances, it can aid the design of novel photonics devices.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
Achieving Robustness in the Wild via Adversarial Mixing with Disentangled Representations
Authors:
Sven Gowal,
Chongli Qin,
Po-Sen Huang,
Taylan Cemgil,
Krishnamurthy Dvijotham,
Timothy Mann,
Pushmeet Kohli
Abstract:
Recent research has made the surprising finding that state-of-the-art deep learning models sometimes fail to generalize to small variations of the input. Adversarial training has been shown to be an effective approach to overcome this problem. However, its application has been limited to enforcing invariance to analytically defined transformations like $\ell_p$-norm bounded perturbations. Such per…
▽ More
Recent research has made the surprising finding that state-of-the-art deep learning models sometimes fail to generalize to small variations of the input. Adversarial training has been shown to be an effective approach to overcome this problem. However, its application has been limited to enforcing invariance to analytically defined transformations like $\ell_p$-norm bounded perturbations. Such perturbations do not necessarily cover plausible real-world variations that preserve the semantics of the input (such as a change in lighting conditions). In this paper, we propose a novel approach to express and formalize robustness to these kinds of real-world transformations of the input. The two key ideas underlying our formulation are (1) leveraging disentangled representations of the input to define different factors of variations, and (2) generating new input images by adversarially composing the representations of different images. We use a StyleGAN model to demonstrate the efficacy of this framework. Specifically, we leverage the disentangled latent representations computed by a StyleGAN model to generate perturbations of an image that are similar to real-world variations (like adding make-up, or changing the skin-tone of a person) and train models to be invariant to these perturbations. Extensive experiments show that our method improves generalization and reduces the effect of spurious correlations (reducing the error rate of a "smile" detector by 21% for example).
△ Less
Submitted 25 March, 2020; v1 submitted 6 December, 2019;
originally announced December 2019.
-
An Alternative Surrogate Loss for PGD-based Adversarial Testing
Authors:
Sven Gowal,
Jonathan Uesato,
Chongli Qin,
Po-Sen Huang,
Timothy Mann,
Pushmeet Kohli
Abstract:
Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods and explains the effect of different hyperparameters (i.e., optimizer, step size and surrogate loss). We introduce the concept of MultiTargeted testing, which make…
▽ More
Adversarial testing methods based on Projected Gradient Descent (PGD) are widely used for searching norm-bounded perturbations that cause the inputs of neural networks to be misclassified. This paper takes a deeper look at these methods and explains the effect of different hyperparameters (i.e., optimizer, step size and surrogate loss). We introduce the concept of MultiTargeted testing, which makes clever use of alternative surrogate losses, and explain when and how MultiTargeted is guaranteed to find optimal perturbations. Finally, we demonstrate that MultiTargeted outperforms more sophisticated methods and often requires less iterative steps than other variants of PGD found in the literature. Notably, MultiTargeted ranks first on MadryLab's white-box MNIST and CIFAR-10 leaderboards, reducing the accuracy of their MNIST model to 88.36% (with $\ell_\infty$ perturbations of $ε= 0.3$) and the accuracy of their CIFAR-10 model to 44.03% (at $ε= 8/255$). MultiTargeted also ranks first on the TRADES leaderboard reducing the accuracy of their CIFAR-10 model to 53.07% (with $\ell_\infty$ perturbations of $ε= 0.031$).
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates
Authors:
Hugo Penedones,
Carlos Riquelme,
Damien Vincent,
Hartmut Maennel,
Timothy Mann,
Andre Barreto,
Sylvain Gelly,
Gergely Neu
Abstract:
We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this pa…
▽ More
We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
Robust Reinforcement Learning for Continuous Control with Model Misspecification
Authors:
Daniel J. Mankowitz,
Nir Levine,
Rae Jeong,
Yuanyuan Shi,
Jackie Kay,
Abbas Abdolmaleki,
Jost Tobias Springenberg,
Timothy Mann,
Todd Hester,
Martin Riedmiller
Abstract:
We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a…
▽ More
We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. In addition, we show improved robust performance on a high-dimensional, simulated, dexterous robotic hand. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework. This includes an adaptation to another continuous control RL algorithm as well as learning the uncertainty set from offline data. Performance videos can be found online at https://sites.google.com/view/robust-rl.
△ Less
Submitted 11 February, 2020; v1 submitted 18 June, 2019;
originally announced June 2019.
-
A Bayesian Approach to Robust Reinforcement Learning
Authors:
Esther Derman,
Daniel Mankowitz,
Timothy Mann,
Shie Mannor
Abstract:
Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach.…
▽ More
Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach. We introduce the Uncertainty Robust Bellman Equation (URBE) which encourages safe exploration for adapting the uncertainty set to new observations while preserving robustness. We propose a URBE-based algorithm, DQN-URBE, that scales this method to higher dimensional domains. Our experiments show that the derived URBE-based strategy leads to a better trade-off between less conservative solutions and robustness in the presence of model misspecification. In addition, we show that the DQN-URBE algorithm can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets.
△ Less
Submitted 23 July, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models
Authors:
Sven Gowal,
Krishnamurthy Dvijotham,
Robert Stanforth,
Rudy Bunel,
Chongli Qin,
Jonathan Uesato,
Relja Arandjelovic,
Timothy Mann,
Pushmeet Kohli
Abstract:
Recent work has shown that it is possible to train deep neural networks that are provably robust to norm-bounded adversarial perturbations. Most of these methods are based on minimizing an upper bound on the worst-case loss over all possible adversarial perturbations. While these techniques show promise, they often result in difficult optimization procedures that remain hard to scale to larger net…
▽ More
Recent work has shown that it is possible to train deep neural networks that are provably robust to norm-bounded adversarial perturbations. Most of these methods are based on minimizing an upper bound on the worst-case loss over all possible adversarial perturbations. While these techniques show promise, they often result in difficult optimization procedures that remain hard to scale to larger networks. Through a comprehensive analysis, we show how a simple bounding technique, interval bound propagation (IBP), can be exploited to train large provably robust neural networks that beat the state-of-the-art in verified accuracy. While the upper bound computed by IBP can be quite weak for general networks, we demonstrate that an appropriate loss and clever hyper-parameter schedule allow the network to adapt such that the IBP bound is tight. This results in a fast and stable learning algorithm that outperforms more sophisticated methods and achieves state-of-the-art results on MNIST, CIFAR-10 and SVHN. It also allows us to train the largest model to be verified beyond vacuous bounds on a downscaled version of ImageNet.
△ Less
Submitted 29 August, 2019; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems
Authors:
Timothy A. Mann,
Sven Gowal,
András György,
Ray Jiang,
Huiyi Hu,
Balaji Lakshminarayanan,
Prav Srinivasan
Abstract:
Predicting delayed outcomes is an important problem in recommender systems (e.g., if customers will finish reading an ebook). We formalize the problem as an adversarial, delayed online learning problem and consider how a proxy for the delayed outcome (e.g., if customers read a third of the book in 24 hours) can help minimize regret, even though the proxy is not available when making a prediction.…
▽ More
Predicting delayed outcomes is an important problem in recommender systems (e.g., if customers will finish reading an ebook). We formalize the problem as an adversarial, delayed online learning problem and consider how a proxy for the delayed outcome (e.g., if customers read a third of the book in 24 hours) can help minimize regret, even though the proxy is not available when making a prediction. Motivated by our regret analysis, we propose two neural network architectures: Factored Forecaster (FF) which is ideal if the proxy is informative of the outcome in hindsight, and Residual Factored Forecaster (RFF) that is robust to a non-informative proxy. Experiments on two real-world datasets for predicting human behavior show that RFF outperforms both FF and a direct forecaster that does not make use of the proxy. Our results suggest that exploiting proxies by factorization is a promising way to mitigate the impact of long delays in human-behavior prediction tasks.
△ Less
Submitted 15 October, 2019; v1 submitted 24 July, 2018;
originally announced July 2018.
-
Temporal Difference Learning with Neural Networks - Study of the Leakage Propagation Problem
Authors:
Hugo Penedones,
Damien Vincent,
Hartmut Maennel,
Sylvain Gelly,
Timothy Mann,
Andre Barreto
Abstract:
Temporal-Difference learning (TD) [Sutton, 1988] with function approximation can converge to solutions that are worse than those obtained by Monte-Carlo regression, even in the simple case of on-policy evaluation. To increase our understanding of the problem, we investigate the issue of approximation errors in areas of sharp discontinuities of the value function being further propagated by bootstr…
▽ More
Temporal-Difference learning (TD) [Sutton, 1988] with function approximation can converge to solutions that are worse than those obtained by Monte-Carlo regression, even in the simple case of on-policy evaluation. To increase our understanding of the problem, we investigate the issue of approximation errors in areas of sharp discontinuities of the value function being further propagated by bootstrap updates. We show empirical evidence of this leakage propagation, and show analytically that it must occur, in a simple Markov chain, when function approximation errors are present. For reversible policies, the result can be interpreted as the tension between two terms of the loss function that TD minimises, as recently described by [Ollivier, 2018]. We show that the upper bounds from [Tsitsiklis and Van Roy, 1997] hold, but they do not imply that leakage propagation occurs and under what conditions. Finally, we test whether the problem could be mitigated with a better state representation, and whether it can be learned in an unsupervised manner, without rewards or privileged information.
△ Less
Submitted 9 July, 2018;
originally announced July 2018.
-
A Dual Approach to Scalable Verification of Deep Networks
Authors:
Krishnamurthy,
Dvijotham,
Robert Stanforth,
Sven Gowal,
Timothy Mann,
Pushmeet Kohli
Abstract:
This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the network, network architecture and th…
▽ More
This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to a general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem (seeking to find the largest violation of the specification) and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the worst case violation of the specification being verified. Our approach is anytime i.e. it can be stopped at any time and a valid bound on the maximum violation can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.
△ Less
Submitted 3 August, 2018; v1 submitted 17 March, 2018;
originally announced March 2018.
-
Soft-Robust Actor-Critic Policy-Gradient
Authors:
Esther Derman,
Daniel J. Mankowitz,
Timothy A. Mann,
Shie Mannor
Abstract:
Robust Reinforcement Learning aims to derive optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC). It learns an…
▽ More
Robust Reinforcement Learning aims to derive optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC). It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach on different domains by comparing it against regular learning methods and their robust formulations.
△ Less
Submitted 24 October, 2018; v1 submitted 11 March, 2018;
originally announced March 2018.
-
Beyond Greedy Ranking: Slate Optimization via List-CVAE
Authors:
Ray Jiang,
Sven Gowal,
Timothy A. Mann,
Danilo J. Rezende
Abstract:
The conventional solution to the recommendation problem greedily ranks individual document candidates by prediction scores. However, this method fails to optimize the slate as a whole, and hence, often struggles to capture biases caused by the page layout and document interdepedencies. The slate recommendation problem aims to directly find the optimally ordered subset of documents (i.e. slates) th…
▽ More
The conventional solution to the recommendation problem greedily ranks individual document candidates by prediction scores. However, this method fails to optimize the slate as a whole, and hence, often struggles to capture biases caused by the page layout and document interdepedencies. The slate recommendation problem aims to directly find the optimally ordered subset of documents (i.e. slates) that best serve users' interests. Solving this problem is hard due to the combinatorial explosion in all combinations of document candidates and their display positions on the page. Therefore we propose a paradigm shift from the traditional viewpoint of solving a ranking problem to a direct slate generation framework. In this paper, we introduce List Conditional Variational Auto-Encoders (List-CVAE), which learns the joint distribution of documents on the slate conditioned on user responses, and directly generates full slates. Experiments on simulated and real-world data show that List-CVAE outperforms popular comparable ranking methods consistently on various scales of documents corpora.
△ Less
Submitted 23 February, 2019; v1 submitted 5 March, 2018;
originally announced March 2018.
-
Learning Robust Options
Authors:
Daniel J. Mankowitz,
Timothy A. Mann,
Pierre-Luc Bacon,
Doina Precup,
Shie Mannor
Abstract:
Robust reinforcement learning aims to produce policies that have strong guarantees even in the face of environments/transition models whose parameters have strong uncertainty. Existing work uses value-based methods and the usual primitive action setting. In this paper, we propose robust methods for learning temporally abstract actions, in the framework of options. We present a Robust Options Polic…
▽ More
Robust reinforcement learning aims to produce policies that have strong guarantees even in the face of environments/transition models whose parameters have strong uncertainty. Existing work uses value-based methods and the usual primitive action setting. In this paper, we propose robust methods for learning temporally abstract actions, in the framework of options. We present a Robust Options Policy Iteration (ROPI) algorithm with convergence guarantees, which learns options that are robust to model uncertainty. We utilize ROPI to learn robust options with the Robust Options Deep Q Network (RO-DQN) that solves multiple tasks and mitigates model misspecification due to model uncertainty. We present experimental results which suggest that policy iteration with linear features may have an inherent form of robustness when using coarse feature representations. In addition, we present experimental results which demonstrate that robustness helps policy iteration implemented on top of deep neural networks to generalize over a much broader range of dynamics than non-robust policy iteration.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Adaptive Lambda Least-Squares Temporal Difference Learning
Authors:
Timothy A. Mann,
Hugo Penedones,
Shie Mannor,
Todd Hester
Abstract:
Temporal Difference learning or TD($λ$) is a fundamental algorithm in the field of reinforcement learning. However, setting TD's $λ$ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the $λ$ selection problem as a bias-variance trade-off where the solution is the value of $λ$ that leads to the smallest Mean Squared Value Error (MSVE). To…
▽ More
Temporal Difference learning or TD($λ$) is a fundamental algorithm in the field of reinforcement learning. However, setting TD's $λ$ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the $λ$ selection problem as a bias-variance trade-off where the solution is the value of $λ$ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of $λ$ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune $λ$ and apply function optimization methods to efficiently search the space of $λ$ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the naïve LOTO-CV implementation while achieving similar performance.
△ Less
Submitted 30 December, 2016;
originally announced December 2016.
-
Adaptive Skills, Adaptive Partitions (ASAP)
Authors:
Daniel J. Mankowitz,
Timothy A. Mann,
Shie Mannor
Abstract:
We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1) learns skills (i.e., temporally extended actions or options) as well as (2) where to apply them. We believe that both (1) and (2) are necessary for a truly general skill learning framework, which is a key building block needed to scale up to lifelong learning agents. The ASAP framework can also solve related new tasks…
▽ More
We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1) learns skills (i.e., temporally extended actions or options) as well as (2) where to apply them. We believe that both (1) and (2) are necessary for a truly general skill learning framework, which is a key building block needed to scale up to lifelong learning agents. The ASAP framework can also solve related new tasks simply by adapting where it applies its existing learned skills. We prove that ASAP converges to a local optimum under natural conditions. Finally, our experimental results, which include a RoboCup domain, demonstrate the ability of ASAP to learn where to reuse skills as well as solve multiple tasks with considerably less experience than solving each task from scratch.
△ Less
Submitted 7 June, 2016; v1 submitted 10 February, 2016;
originally announced February 2016.
-
Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)
Authors:
Daniel J. Mankowitz,
Timothy A. Mann,
Shie Mannor
Abstract:
For complex, high-dimensional Markov Decision Processes (MDPs), it may be necessary to represent the policy with function approximation. A problem is misspecified whenever, the representation cannot express any policy with acceptable performance. We introduce IHOMP : an approach for solving misspecified problems. IHOMP iteratively learns a set of context specialized options and combines these opti…
▽ More
For complex, high-dimensional Markov Decision Processes (MDPs), it may be necessary to represent the policy with function approximation. A problem is misspecified whenever, the representation cannot express any policy with acceptable performance. We introduce IHOMP : an approach for solving misspecified problems. IHOMP iteratively learns a set of context specialized options and combines these options to solve an otherwise misspecified problem. Our main contribution is proving that IHOMP enjoys theoretical convergence guarantees. In addition, we extend IHOMP to exploit Option Interruption (OI) enabling it to decide where the learned options can be reused. Our experiments demonstrate that IHOMP can find near-optimal solutions to otherwise misspecified problems and that OI can further improve the solutions.
△ Less
Submitted 7 June, 2016; v1 submitted 10 February, 2016;
originally announced February 2016.
-
Deep Reinforcement Learning in Large Discrete Action Spaces
Authors:
Gabriel Dulac-Arnold,
Richard Evans,
Hado van Hasselt,
Peter Sunehag,
Timothy Lillicrap,
Jonathan Hunt,
Timothy Mann,
Theophane Weber,
Thomas Degris,
Ben Coppin
Abstract:
Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to general…
▽ More
Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to generalize over the set of actions as well as sub-linear complexity relative to the size of the set are both necessary to handle such tasks. Current approaches are not able to provide both of these, which motivates the work in this paper. Our proposed approach leverages prior information about the actions to embed them in a continuous space upon which it can generalize. Additionally, approximate nearest-neighbor methods allow for logarithmic-time lookup complexity relative to the number of actions, which is necessary for time-wise tractable training. This combined approach allows reinforcement learning methods to be applied to large-scale learning problems previously intractable with current methods. We demonstrate our algorithm's abilities on a series of tasks having up to one million actions.
△ Less
Submitted 4 April, 2016; v1 submitted 23 December, 2015;
originally announced December 2015.
-
Bootstrap** Skills
Authors:
Daniel J. Mankowitz,
Timothy A. Mann,
Shie Mannor
Abstract:
The monolithic approach to policy representation in Markov Decision Processes (MDPs) looks for a single policy that can be represented as a function from states to actions. For the monolithic approach to succeed (and this is not always possible), a complex feature representation is often necessary since the policy is a complex object that has to prescribe what actions to take all over the state sp…
▽ More
The monolithic approach to policy representation in Markov Decision Processes (MDPs) looks for a single policy that can be represented as a function from states to actions. For the monolithic approach to succeed (and this is not always possible), a complex feature representation is often necessary since the policy is a complex object that has to prescribe what actions to take all over the state space. This is especially true in large domains with complicated dynamics. It is also computationally inefficient to both learn and plan in MDPs using a complex monolithic approach. We present a different approach where we restrict the policy space to policies that can be represented as combinations of simpler, parameterized skills---a type of temporally extended action, with a simple policy representation. We introduce Learning Skills via Bootstrap** (LSB) that can use a broad family of Reinforcement Learning (RL) algorithms as a "black box" to iteratively learn parametrized skills. Initially, the learned skills are short-sighted but each iteration of the algorithm allows the skills to bootstrap off one another, improving each skill in the process. We prove that this bootstrap** process returns a near-optimal policy. Furthermore, our experiments demonstrate that LSB can solve MDPs that, given the same representational power, could not be solved by a monolithic approach. Thus, planning with learned skills results in better policies without requiring complex policy representations.
△ Less
Submitted 11 June, 2015;
originally announced June 2015.
-
MEIC Design Summary
Authors:
S. Abeyratne,
D. Barber,
A. Bogacz,
P. Brindza,
Y. Cai,
A. Camsonne,
A. Castilla,
P. Chevtsov,
E. Daly,
Y. S. Derbenev,
D. Douglas,
V. Dudnikov,
R. Ent,
B. Erdelyi,
Y. Filatov,
D. Gaskell,
J. Grames,
J. Guo,
L. Harwood,
A. Hutton,
C. Hyde,
K. Jordan,
A. Kimber,
G. A. Krafft,
A. Kondratenko
, et al. (30 additional authors not shown)
Abstract:
This document summarizes the design of Jefferson Lab's electron-ion collider, MEIC, as of January 20, 2015, and describes the facility whose cost was estimated for the United States Department of Energy Nuclear Sciences Advisory Committee EIC cost review of January 26-28, 2015. In particular, each of the main technical systems within the collider is presented to the level of the best current infor…
▽ More
This document summarizes the design of Jefferson Lab's electron-ion collider, MEIC, as of January 20, 2015, and describes the facility whose cost was estimated for the United States Department of Energy Nuclear Sciences Advisory Committee EIC cost review of January 26-28, 2015. In particular, each of the main technical systems within the collider is presented to the level of the best current information.
△ Less
Submitted 29 April, 2015;
originally announced April 2015.
-
Actively Learning to Attract Followers on Twitter
Authors:
Nir Levine,
Timothy A. Mann,
Shie Mannor
Abstract:
Twitter, a popular social network, presents great opportunities for on-line machine learning research. However, previous research has focused almost entirely on learning from passively collected data. We study the problem of learning to acquire followers through normative user behavior, as opposed to the mass following policies applied by many bots. We formalize the problem as a contextual bandit…
▽ More
Twitter, a popular social network, presents great opportunities for on-line machine learning research. However, previous research has focused almost entirely on learning from passively collected data. We study the problem of learning to acquire followers through normative user behavior, as opposed to the mass following policies applied by many bots. We formalize the problem as a contextual bandit problem, in which we consider retweeting content to be the action chosen and each tweet (content) is accompanied by context. We design reward signals based on the change in followers. The result of our month long experiment with 60 agents suggests that (1) aggregating experience across agents can adversely impact prediction accuracy and (2) the Twitter community's response to different actions is non-stationary. Our findings suggest that actively learning on-line can provide deeper insights about how to attract followers than machine learning over passively collected data alone.
△ Less
Submitted 16 April, 2015;
originally announced April 2015.
-
Off-policy evaluation for MDPs with unknown structure
Authors:
Assaf Hallak,
François Schnitzler,
Timothy Mann,
Shie Mannor
Abstract:
Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient…
▽ More
Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. We present a finite sample analysis of our approach and show through experiments that the algorithm scales well on high-dimensional problems with few samples.
△ Less
Submitted 11 February, 2015;
originally announced February 2015.
-
Higgs Factory and 100 TeV Hadron Collider: Opportunity for a New World Laboratory within a Decade
Authors:
Saeed Assadi,
Chase Collins,
Peter McIntyre,
James Gerity,
Joshua Kellams,
Thomas Mann,
Christopher Mathewson,
Nathaniel Pogue,
Akhdiyor Sattarov,
Richard York
Abstract:
Suggestions have been made for a 80-100 km circumference Future Circular Collider (FCC) that could ultimately contain a circular e+e- ring collider operating as a Higgs Factory as well as a 100 TeV hadron collider. Those suggestions have motivated us to propose an approach in which the project is sited at the location at the SSC tunnel, which has the lowest tunnel cost ever. The low tunnel cost wo…
▽ More
Suggestions have been made for a 80-100 km circumference Future Circular Collider (FCC) that could ultimately contain a circular e+e- ring collider operating as a Higgs Factory as well as a 100 TeV hadron collider. Those suggestions have motivated us to propose an approach in which the project is sited at the location at the SSC tunnel, which has the lowest tunnel cost ever. The low tunnel cost would make it cost-effective to locate the 100 TeV Hadron Collider in a 270 km circumference tunne, using 4.5 Tesla superconducting magnets. The SSC tunnel itself would be used to house the Higgs Factory and the injector for the Hadron Collider. The injector for the Higgs Factory would be also used as a driver for an X-ray Free Electron Laser with unique capabilities for protein crystallography. The location of the project at a location with favorable geotechnology for minimum-cost tunneling, and low-cost/low-risk technology for the SRF and superconducting magnets, open the possibility to build the proposed laboratory within a decade.
△ Less
Submitted 24 February, 2014;
originally announced February 2014.