-
tsGT: Stochastic Time Series Modeling With Transformer
Authors:
Łukasz Kuciński,
Witold Drzewakowski,
Mateusz Olko,
Piotr Kozakowski,
Łukasz Maziarka,
Marta Emilia Nowakowska,
Łukasz Kaiser,
Piotr Miłoś
Abstract:
Time series methods are of fundamental importance in virtually any field of science that deals with temporally structured data. Recently, there has been a surge of deterministic transformer models with time series-specific architectural biases. In this paper, we go in a different direction by introducing tsGT, a stochastic time series model built on a general-purpose transformer architecture. We f…
▽ More
Time series methods are of fundamental importance in virtually any field of science that deals with temporally structured data. Recently, there has been a surge of deterministic transformer models with time series-specific architectural biases. In this paper, we go in a different direction by introducing tsGT, a stochastic time series model built on a general-purpose transformer architecture. We focus on using a well-known and theoretically justified rolling window backtesting and evaluation protocol. We show that tsGT outperforms the state-of-the-art models on MAD and RMSE, and surpasses its stochastic peers on QL and CRPS, on four commonly used datasets. We complement these results with a detailed analysis of tsGT's ability to model the data distribution and predict marginal quantile values.
△ Less
Submitted 3 April, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Efficient Numerical Wave Propagation Enhanced By An End-to-End Deep Learning Model
Authors:
Luis Kaiser,
Richard Tsai,
Christian Klingenberg
Abstract:
Recent advances in wave modeling use sufficiently accurate fine solver outputs to train a neural network that enhances the accuracy of a fast but inaccurate coarse solver. In this paper we build upon the work of Nguyen and Tsai (2023) and present a novel unified system that integrates a numerical solver with a deep learning component into an end-to-end framework. In the proposed setting, we invest…
▽ More
Recent advances in wave modeling use sufficiently accurate fine solver outputs to train a neural network that enhances the accuracy of a fast but inaccurate coarse solver. In this paper we build upon the work of Nguyen and Tsai (2023) and present a novel unified system that integrates a numerical solver with a deep learning component into an end-to-end framework. In the proposed setting, we investigate refinements to the network architecture and data generation algorithm. A stable and fast solver further allows the use of Parareal, a parallel-in-time algorithm to correct high-frequency wave components. Our results show that the cohesive structure improves performance without sacrificing speed, and demonstrate the importance of temporal dynamics, as well as Parareal, for accurate wave propagation.
△ Less
Submitted 13 February, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Stochastic modelling of cosmic ray sources for diffuse high-energy gamma-rays and neutrinos
Authors:
Anton Stall,
Leonard Kaiser,
Philipp Mertsch
Abstract:
Cosmic rays of energies up to a few PeV are believed to be of galactic origin, yet individual sources have still not been firmly identified. Due to inelastic collisions with the interstellar gas, cosmic-ray nuclei produce a diffuse flux of high-energy gamma-rays and neutrinos. Fermi-LAT has provided maps of galactic gamma-rays at GeV energies which can be produced by both hadronic and leptonic pro…
▽ More
Cosmic rays of energies up to a few PeV are believed to be of galactic origin, yet individual sources have still not been firmly identified. Due to inelastic collisions with the interstellar gas, cosmic-ray nuclei produce a diffuse flux of high-energy gamma-rays and neutrinos. Fermi-LAT has provided maps of galactic gamma-rays at GeV energies which can be produced by both hadronic and leptonic processes. Neutrinos, on the other hand, are exclusively produced by the sought-after hadronic processes, yet they can be detected above backgrounds only at hundreds of TeV. Oftentimes, diffuse emission maps are extrapolated from GeV to PeV energies, but the sources contributing at either energies likely differ. We have modelled the production of diffuse emission from GeV through PeV energies in a Monte Carlo approach, taking into consideration the discrete nature of sources. We can generate realisations of the diffuse sky in a matter of seconds, thus allowing for characterising correlations in direction and energy. At hundreds of TeV, relevant for observations with LHAASO, Tibet AS-gamma, IceCube and the upcoming SWGO, variations between different realisations are sizeable. Specifically, we show that extrapolations of diffuse emission from GeV to PeV energies must fail and apply our results on the recent experimental findings.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
GPT-4 Technical Report
Authors:
OpenAI,
Josh Achiam,
Steven Adler,
Sandhini Agarwal,
Lama Ahmad,
Ilge Akkaya,
Florencia Leoni Aleman,
Diogo Almeida,
Janko Altenschmidt,
Sam Altman,
Shyamal Anadkat,
Red Avila,
Igor Babuschkin,
Suchir Balaji,
Valerie Balcom,
Paul Baltescu,
Haiming Bao,
Mohammad Bavarian,
Jeff Belgum,
Irwan Bello,
Jake Berdine,
Gabriel Bernadett-Shapiro,
Christopher Berner,
Lenny Bogdonoff,
Oleg Boiko
, et al. (256 additional authors not shown)
Abstract:
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo…
▽ More
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was develo** infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
△ Less
Submitted 4 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Modelling the response of a turbulent jet flame to acoustic forcing in a linearized framework using an active flame approach
Authors:
Thomas Ludwig Kaiser,
Gregoire Varillon,
Wolfgang Polifke,
Feichi Zhang,
Thorsten Zirwes,
Henning Bockhorn,
Kilian Oberleithner
Abstract:
This study performs a linear analysis of a turbulent reacting methane-air jet flame, with the goal of predicting the response of the reacting flow to upstream acoustic actuation. Accounting for heat release fluctuations is a vital component when investigating thermoacoustic instabilities and flame noise in a linearized framework. Unlike previous studies this work develops and applies an active fla…
▽ More
This study performs a linear analysis of a turbulent reacting methane-air jet flame, with the goal of predicting the response of the reacting flow to upstream acoustic actuation. Accounting for heat release fluctuations is a vital component when investigating thermoacoustic instabilities and flame noise in a linearized framework. Unlike previous studies this work develops and applies an active flame approach, meaning the heat release oscillations of the flame resulting from the acoustic fluctuations are taken into account. To yield an active flame approach in the linear framework, a combustion model needs to be linearized. It is demonstrated that linearizing Large Eddy Simulation (LES) and Direct Numerical Simulation (DNS) combustion models leads to closure problems, making their application in the linearized framework troublesome. Reynolds-averaged Navier Stokes (RANS) combustion models, however, prove to circumvent this problem, which makes them suitable candidates for this purpose. The RANS combustion models are linearized around the temporal mean flow of the turbulent jet flame, which is obtained by LES. An a priori analysis shows that a linearized RANS-Eddy Break Up (EBU) model is the best suited among all investigated combustion models for the investigated set-up and reproduces with high accuracy the fluctuations in reaction rate obtained in the LES. Furthermore, the linearized governing equations of the flow including the linearized EBU model for the reaction rate are solved for incoming acoustic perturbations. The response modes show that the reaction rate oscillations are caused by Kelvin-Helmholtz vortex rings, which perturb the jet flame. The results are in good agreement with the LES simulations in terms of the mode shapes of both reaction rate and velocity fluctuations.
△ Less
Submitted 1 December, 2022; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Mean flow data assimilation based on physics-informed neural networks
Authors:
Jakob G. R. von Saldern,
Johann Moritz Reumschüssel,
Thomas L. Kaiser,
Moritz Sieber,
Kilian Oberleithner
Abstract:
Physics-informed neural networks (PINNs) can be used to solve partial differential equations (PDEs) and identify hidden variables by incorporating the governing equations into neural network training. In this study, we apply PINNs to the assimilation of turbulent mean flow data and investigate the method's ability to identify inaccessible variables and closure terms from sparse data. Using high-fi…
▽ More
Physics-informed neural networks (PINNs) can be used to solve partial differential equations (PDEs) and identify hidden variables by incorporating the governing equations into neural network training. In this study, we apply PINNs to the assimilation of turbulent mean flow data and investigate the method's ability to identify inaccessible variables and closure terms from sparse data. Using high-fidelity large-eddy simulation (LES) data and particle image velocimetry (PIV) measured mean fields, we show that PINNs are suitable for simultaneously identifying multiple missing quantities in turbulent flows and providing continuous and differentiable mean fields consistent with the provided PDEs. In this way, consistent and complete mean states can be provided, which are essential for linearized mean field methods. The presented method does not require a grid or discretization scheme, is easy to implement, and can be used for a wide range of applications, making it a very promising tool for mean field-based methods in fluid mechanics.
△ Less
Submitted 8 December, 2022; v1 submitted 5 August, 2022;
originally announced August 2022.
-
Sparse is Enough in Scaling Transformers
Authors:
Sebastian Jaszczur,
Aakanksha Chowdhery,
Afroz Mohiuddin,
Łukasz Kaiser,
Wojciech Gajewski,
Henryk Michalewski,
Jonni Kanerva
Abstract:
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to sca…
▽ More
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.
△ Less
Submitted 24 November, 2021;
originally announced November 2021.
-
Shared Model of Sense-making for Human-Machine Collaboration
Authors:
Gheorghe Tecuci,
Dorin Marcu,
Louis Kaiser,
Mihai Boicu
Abstract:
We present a model of sense-making that greatly facilitates the collaboration between an intelligent analyst and a knowledge-based agent. It is a general model grounded in the science of evidence and the scientific method of hypothesis generation and testing, where sense-making hypotheses that explain an observation are generated, relevant evidence is then discovered, and the hypotheses are tested…
▽ More
We present a model of sense-making that greatly facilitates the collaboration between an intelligent analyst and a knowledge-based agent. It is a general model grounded in the science of evidence and the scientific method of hypothesis generation and testing, where sense-making hypotheses that explain an observation are generated, relevant evidence is then discovered, and the hypotheses are tested based on the discovered evidence. We illustrate how the model enables an analyst to directly instruct the agent to understand situations involving the possible production of weapons (e.g., chemical warfare agents) and how the agent becomes increasingly more competent in understanding other situations from that domain (e.g., possible production of centrifuge-enriched uranium or of stealth fighter aircraft).
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
Training Verifiers to Solve Math Word Problems
Authors:
Karl Cobbe,
Vineet Kosaraju,
Mohammad Bavarian,
Mark Chen,
Heewoo Jun,
Lukasz Kaiser,
Matthias Plappert,
Jerry Tworek,
Jacob Hilton,
Reiichiro Nakano,
Christopher Hesse,
John Schulman
Abstract:
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high tes…
▽ More
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
△ Less
Submitted 17 November, 2021; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Hierarchical Transformers Are More Efficient Language Models
Authors:
Piotr Nawrot,
Szymon Tworkowski,
Michał Tyrolski,
Łukasz Kaiser,
Yuhuai Wu,
Christian Szegedy,
Henryk Michalewski
Abstract:
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility.…
▽ More
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.
△ Less
Submitted 16 April, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
Measuring the photoelectron emission delay in the molecular frame
Authors:
Jonas Rist,
Kim Klyssek,
Nikolay M. Novikovskiy,
Max Kircher,
Isabel Vela-Pérez,
Daniel Trabert,
Sven Grundmann,
Dimitrios Tsitsonis,
Juliane Siebert,
Angelina Geyer,
Niklas Melzer,
Christian Schwarz,
Nils Anders,
Leon Kaiser,
Kilian Fehre,
Alexander Hartung,
Sebastian Eckart,
Lothar Ph. H. Schmidt,
Markus S. Schöffler,
Vernon T. Davis,
Joshua B. Williams,
Florian Trinter,
Reinhard Dörner,
Philipp V. Demekhin,
Till Jahnke
Abstract:
If matter absorbs a photon of sufficient energy it emits an electron. The question of the duration of the emission process has intrigued scientists for decades. With the advent of attosecond metrology, experiments addressing such ultrashort intervals became possible. While these types of studies require attosecond experimental precision, we present here a novel measurement approach that avoids tho…
▽ More
If matter absorbs a photon of sufficient energy it emits an electron. The question of the duration of the emission process has intrigued scientists for decades. With the advent of attosecond metrology, experiments addressing such ultrashort intervals became possible. While these types of studies require attosecond experimental precision, we present here a novel measurement approach that avoids those experimental difficulties. We instead extract the emission delay from the interference pattern generated as the emitted photoelectron is diffracted by the parent ion's potential. Targeting core electrons in CO, we measured a 2d map of photoelectron emission delays in the molecular frame over a wide range of electron energies. The measured emission times depend drastically on the emission direction and exhibit characteristic changes along the shape resonance of the molecule. Our approach can be routinely extended to other electron orbitals and more complex molecules.
△ Less
Submitted 13 July, 2021;
originally announced July 2021.
-
Evaluating Large Language Models Trained on Code
Authors:
Mark Chen,
Jerry Tworek,
Heewoo Jun,
Qiming Yuan,
Henrique Ponde de Oliveira Pinto,
Jared Kaplan,
Harri Edwards,
Yuri Burda,
Nicholas Joseph,
Greg Brockman,
Alex Ray,
Raul Puri,
Gretchen Krueger,
Michael Petrov,
Heidy Khlaaf,
Girish Sastry,
Pamela Mishkin,
Brooke Chan,
Scott Gray,
Nick Ryder,
Mikhail Pavlov,
Alethea Power,
Lukasz Kaiser,
Mohammad Bavarian,
Clemens Winter
, et al. (33 additional authors not shown)
Abstract:
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J sol…
▽ More
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
△ Less
Submitted 14 July, 2021; v1 submitted 7 July, 2021;
originally announced July 2021.
-
Q-Value Weighted Regression: Reinforcement Learning with Limited Data
Authors:
Piotr Kozakowski,
Łukasz Kaiser,
Henryk Michalewski,
Afroz Mohiuddin,
Katarzyna Kańska
Abstract:
Sample efficiency and performance in the offline setting have emerged as significant challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline se…
▽ More
Sample efficiency and performance in the offline setting have emerged as significant challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline setting, but has low sample efficiency and struggles with high-dimensional observation spaces. We perform an analysis of AWR that explains its shortcomings and use these insights to motivate QWR. We show experimentally that QWR matches the state-of-the-art algorithms both on tasks with continuous and discrete actions. In particular, QWR yields results on par with SAC on the MuJoCo suite and - with the same set of hyperparameters - yields results on par with a highly tuned Rainbow implementation on a set of Atari games. We also verify that QWR performs well in the offline RL setting.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Zeptosecond Birth Time Delay in Molecular Photoionization
Authors:
Sven Grundmann,
Daniel Trabert,
Kilian Fehre,
Nico Strenger,
Andreas Pier,
Leon Kaiser,
Max Kircher,
Miriam Weller,
Sebastian Eckart,
Lothar Ph. H. Schmidt,
Florian Trinter,
Till Jahnke,
Markus S. Schöffler,
Reinhard Dörner
Abstract:
Photoionization is one of the fundamental light-matter interaction processes in which the absorption of a photon launches the escape of an electron. The time scale of the process poses many open questions. Experiments found time delays in the attosecond ($10^{-18}$ s) domain between electron ejection from different orbitals, electronic bands, or in different directions. Here, we demonstrate that a…
▽ More
Photoionization is one of the fundamental light-matter interaction processes in which the absorption of a photon launches the escape of an electron. The time scale of the process poses many open questions. Experiments found time delays in the attosecond ($10^{-18}$ s) domain between electron ejection from different orbitals, electronic bands, or in different directions. Here, we demonstrate that across a molecular orbital the electron is not launched at the same time. The birth time rather depends on the travel time of the photon across the molecule, which is 247 zeptoseconds ($10^{-21}$ s) for the average bond length of H$_2$. Using an electron interferometric technique, we resolve this birth time delay between electron emission from the two centers of the hydrogen molecule.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Rethinking Attention with Performers
Authors:
Krzysztof Choromanski,
Valerii Likhosherstov,
David Dohan,
Xingyou Song,
Andreea Gane,
Tamas Sarlos,
Peter Hawkins,
Jared Davis,
Afroz Mohiuddin,
Lukasz Kaiser,
David Belanger,
Lucy Colwell,
Adrian Weller
Abstract:
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random featu…
▽ More
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
△ Less
Submitted 19 November, 2022; v1 submitted 30 September, 2020;
originally announced September 2020.
-
Revealing the Two-Electron Cusp in the Ground States of He and H2 via Quasifree Double Photoionization
Authors:
S. Grundmann,
V. Serov,
F. Trinter,
K. Fehre,
N. Strenger,
A. Pier,
M. Kircher,
D. Trabert,
M. Weller,
J. Rist,
L. Kaiser,
A. W. Bray,
L. Ph. H. Schmidt,
J. B. Williams,
T. Jahnke,
R. Dörner,
M. S. Schöffler,
A. S. Kheifets
Abstract:
We report on kinematically complete measurements and ab initio non-perturbative calculations of double ionization of He and H2 by a single 800 eV circularly polarized photon. We confirm the quasifree mechanism of photoionization for H2 and show how it originates from the two-electron cusp in the ground state of a two-electron target. Our approach establishes a new method for map** electrons rela…
▽ More
We report on kinematically complete measurements and ab initio non-perturbative calculations of double ionization of He and H2 by a single 800 eV circularly polarized photon. We confirm the quasifree mechanism of photoionization for H2 and show how it originates from the two-electron cusp in the ground state of a two-electron target. Our approach establishes a new method for map** electrons relative to each other and provides valuable insight into photoionization beyond the electric-dipole approximation.
△ Less
Submitted 1 July, 2020; v1 submitted 21 January, 2020;
originally announced January 2020.
-
Reformer: The Efficient Transformer
Authors:
Nikita Kitaev,
Łukasz Kaiser,
Anselm Levskaya
Abstract:
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is…
▽ More
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
△ Less
Submitted 18 February, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Parallel Scheduled Sampling
Authors:
Daniel Duckworth,
Arvind Neelakantan,
Ben Goodrich,
Lukasz Kaiser,
Samy Bengio
Abstract:
Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit (pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling aims to mitigate this discr…
▽ More
Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit (pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling aims to mitigate this discrepancy between train and test time by randomly replacing some discrete units in the history with the model's prediction. While teacher-forced training works well with ML accelerators as the computation can be parallelized across time, Scheduled Sampling involves undesirable sequential processing. In this paper, we introduce a simple technique to parallelize Scheduled Sampling across time. Experimentally, we find the proposed technique leads to equivalent or better performance on image generation, summarization, dialog generation, and translation compared to teacher-forced training. In dialog response generation task, Parallel Scheduled Sampling achieves 1.6 BLEU score (11.5%) improvement over teacher-forcing while in image generation it achieves 20% and 13.8% improvement in Frechet Inception Distance (FID) and Inception Score (IS) respectively. Further, we discuss the effects of different hyper-parameters associated with Scheduled Sampling on the model performance.
△ Less
Submitted 21 October, 2019; v1 submitted 10 June, 2019;
originally announced June 2019.
-
Sample Efficient Text Summarization Using a Single Pre-Trained Transformer
Authors:
Urvashi Khandelwal,
Kevin Clark,
Dan Jurafsky,
Lukasz Kaiser
Abstract:
Language model (LM) pre-training has resulted in impressive performance and sample efficiency on a variety of language understanding tasks. However, it remains unclear how to best use pre-trained LMs for generation tasks such as abstractive summarization, particularly to enhance sample efficiency. In these sequence-to-sequence settings, prior work has experimented with loading pre-trained weights…
▽ More
Language model (LM) pre-training has resulted in impressive performance and sample efficiency on a variety of language understanding tasks. However, it remains unclear how to best use pre-trained LMs for generation tasks such as abstractive summarization, particularly to enhance sample efficiency. In these sequence-to-sequence settings, prior work has experimented with loading pre-trained weights into the encoder and/or decoder networks, but used non-pre-trained encoder-decoder attention weights. We instead use a pre-trained decoder-only network, where the same Transformer LM both encodes the source and generates the summary. This ensures that all parameters in the network, including those governing attention over source states, have been pre-trained before the fine-tuning step. Experiments on the CNN/Daily Mail dataset show that our pre-trained Transformer LM substantially improves over pre-trained Transformer encoder-decoder networks in limited-data settings. For instance, it achieves 13.1 ROUGE-2 using only 1% of the training data (~3000 examples), while pre-trained encoder-decoder models score 2.3 ROUGE-2.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Model-Based Reinforcement Learning for Atari
Authors:
Lukasz Kaiser,
Mohammad Babaeizadeh,
Piotr Milos,
Blazej Osinski,
Roy H Campbell,
Konrad Czechowski,
Dumitru Erhan,
Chelsea Finn,
Piotr Kozakowski,
Sergey Levine,
Afroz Mohiuddin,
Ryan Sepassi,
George Tucker,
Henryk Michalewski
Abstract:
Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and…
▽ More
Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games in low data regime of 100k interactions between the agent and the environment, which corresponds to two hours of real-time play. In most games SimPLe outperforms state-of-the-art model-free algorithms, in some games by over an order of magnitude.
△ Less
Submitted 3 April, 2024; v1 submitted 1 March, 2019;
originally announced March 2019.
-
Area Attention
Authors:
Yang Li,
Lukasz Kaiser,
Samy Bengio,
Si Si
Abstract:
Existing attention mechanisms are trained to attend to individual items in a collection (the memory) with a predefined, fixed granularity, e.g., a word token or an image grid. We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such…
▽ More
Existing attention mechanisms are trained to attend to individual items in a collection (the memory) with a predefined, fixed granularity, e.g., a word token or an image grid. We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such as natural language sentences. Importantly, the shape and the size of an area are dynamically determined via learning, which enables a model to attend to information with varying granularity. Area attention can easily work with existing model architectures such as multi-head attention for simultaneously attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation (both character and token-level) and image captioning, and improve upon strong (state-of-the-art) baselines in all the cases. These improvements are obtainable with a basic form of area attention that is parameter free.
△ Less
Submitted 7 May, 2020; v1 submitted 23 October, 2018;
originally announced October 2018.
-
Co-Arg: Cogent Argumentation with Crowd Elicitation
Authors:
Mihai Boicu,
Dorin Marcu,
Gheorghe Tecuci,
Lou Kaiser,
Chirag Uttamsingh,
Navya Kalale
Abstract:
This paper presents Co-Arg, a new type of cognitive assistant to an intelligence analyst that enables the synergistic integration of analyst imagination and expertise, computer knowledge and critical reasoning, and crowd wisdom, to draw defensible and persuasive conclusions from masses of evidence of all types, in a world that is changing all the time. Co-Arg's goal is to improve the quality of th…
▽ More
This paper presents Co-Arg, a new type of cognitive assistant to an intelligence analyst that enables the synergistic integration of analyst imagination and expertise, computer knowledge and critical reasoning, and crowd wisdom, to draw defensible and persuasive conclusions from masses of evidence of all types, in a world that is changing all the time. Co-Arg's goal is to improve the quality of the analytic results and enhance their understandability for both experts and novices. The performed analysis is based on a sound and transparent argumentation that links evidence to conclusions in a way that shows very clearly how the conclusions have been reached, what evidence was used and how, what is not known, and what assumptions have been made. The analytic results are presented in a report describes the analytic conclusion and its probability, the main favoring and disfavoring arguments, the justification of the key judgments and assumptions, and the missing information that might increase the accuracy of the solution.
△ Less
Submitted 2 October, 2018;
originally announced October 2018.
-
Universal Transformers
Authors:
Mostafa Dehghani,
Stephan Gouws,
Oriol Vinyals,
Jakob Uszkoreit,
Łukasz Kaiser
Abstract:
Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine tr…
▽ More
Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine translation, with the added advantage that they concurrently process all inputs in the sequence, leading to easy parallelization and faster training times. Despite these successes, however, popular feed-forward sequence models like the Transformer fail to generalize in many simple tasks that recurrent models handle with ease, e.g. copying strings or even simple logical inference when the string or formula lengths exceed those observed at training time. We propose the Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses these issues. UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks. In contrast to the standard Transformer, under certain assumptions, UTs can be shown to be Turing-complete. Our experiments show that UTs outperform standard Transformers on a wide range of algorithmic and language understanding tasks, including the challenging LAMBADA language modeling task where UTs achieve a new state of the art, and machine translation where UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-De dataset.
△ Less
Submitted 5 March, 2019; v1 submitted 10 July, 2018;
originally announced July 2018.
-
Tensor2Tensor for Neural Machine Translation
Authors:
Ashish Vaswani,
Samy Bengio,
Eugene Brevdo,
Francois Chollet,
Aidan N. Gomez,
Stephan Gouws,
Llion Jones,
Łukasz Kaiser,
Nal Kalchbrenner,
Niki Parmar,
Ryan Sepassi,
Noam Shazeer,
Jakob Uszkoreit
Abstract:
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
△ Less
Submitted 16 March, 2018;
originally announced March 2018.
-
Fast Decoding in Sequence Models using Discrete Latent Variables
Authors:
Łukasz Kaiser,
Aurko Roy,
Ashish Vaswani,
Niki Parmar,
Samy Bengio,
Jakob Uszkoreit,
Noam Shazeer
Abstract:
Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet st…
▽ More
Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet still operate sequentially during decoding.
Inspired by [arxiv:1711.00937], we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods. Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. While lower in BLEU than purely autoregressive models, our model achieves higher scores than previously proposed non-autoregressive translation models.
△ Less
Submitted 7 June, 2018; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Image Transformer
Authors:
Niki Parmar,
Ashish Vaswani,
Jakob Uszkoreit,
Łukasz Kaiser,
Noam Shazeer,
Alexander Ku,
Dustin Tran
Abstract:
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By…
▽ More
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.
△ Less
Submitted 15 June, 2018; v1 submitted 15 February, 2018;
originally announced February 2018.
-
Generating Wikipedia by Summarizing Long Sequences
Authors:
Peter J. Liu,
Mohammad Saleh,
Etienne Pot,
Ben Goodrich,
Ryan Sepassi,
Lukasz Kaiser,
Noam Shazeer
Abstract:
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical enco…
▽ More
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.
△ Less
Submitted 30 January, 2018;
originally announced January 2018.
-
Discrete Autoencoders for Sequence Models
Authors:
Łukasz Kaiser,
Samy Bengio
Abstract:
Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose…
▽ More
Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.
-
Unsupervised Cipher Cracking Using Discrete GANs
Authors:
Aidan N. Gomez,
Sicong Huang,
Ivan Zhang,
Bryan M. Li,
Muhammad Osama,
Lukasz Kaiser
Abstract:
This work details CipherGAN, an architecture inspired by CycleGAN used for inferring the underlying cipher map** given banks of unpaired ciphertext and plaintext. We demonstrate that CipherGAN is capable of cracking language data enciphered using shift and Vigenere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how CycleGAN can be made…
▽ More
This work details CipherGAN, an architecture inspired by CycleGAN used for inferring the underlying cipher map** given banks of unpaired ciphertext and plaintext. We demonstrate that CipherGAN is capable of cracking language data enciphered using shift and Vigenere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how CycleGAN can be made compatible with discrete data and train in a stable way. We then prove that the technique used in CipherGAN avoids the common problem of uninformative discrimination associated with GANs applied to discrete data.
△ Less
Submitted 15 January, 2018;
originally announced January 2018.
-
One Model To Learn Them All
Authors:
Lukasz Kaiser,
Aidan N. Gomez,
Noam Shazeer,
Ashish Vaswani,
Niki Parmar,
Llion Jones,
Jakob Uszkoreit
Abstract:
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrentl…
▽ More
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.
△ Less
Submitted 15 June, 2017;
originally announced June 2017.
-
Attention Is All You Need
Authors:
Ashish Vaswani,
Noam Shazeer,
Niki Parmar,
Jakob Uszkoreit,
Llion Jones,
Aidan N. Gomez,
Lukasz Kaiser,
Illia Polosukhin
Abstract:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experi…
▽ More
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
△ Less
Submitted 1 August, 2023; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Depthwise Separable Convolutions for Neural Machine Translation
Authors:
Lukasz Kaiser,
Aidan N. Gomez,
Francois Chollet
Abstract:
Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters requir…
▽ More
Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters required to perform at a given level (the MobileNets family of architectures). Recently, convolutional sequence-to-sequence networks have been applied to machine translation tasks with good results. In this work, we study how depthwise separable convolutions can be applied to neural machine translation. We introduce a new architecture inspired by Xception and ByteNet, called SliceNet, which enables a significant reduction of the parameter count and amount of computation needed to obtain results like ByteNet, and, with a similar parameter count, achieves new state-of-the-art results. In addition to showing that depthwise separable convolutions perform well for machine translation, we investigate the architectural changes that they enable: we observe that thanks to depthwise separability, we can increase the length of convolution windows, removing the need for filter dilation. We also introduce a new "super-separable" convolution operation that further reduces the number of parameters and computational cost for obtaining state-of-the-art results.
△ Less
Submitted 15 June, 2017; v1 submitted 9 June, 2017;
originally announced June 2017.
-
Learning to Remember Rare Events
Authors:
Łukasz Kaiser,
Ofir Nachum,
Aurko Roy,
Samy Bengio
Abstract:
Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the modul…
▽ More
Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training.
Our memory module can be easily added to any part of a supervised neural network. To show its versatility we add it to a number of networks, from simple convolutional ones tested on image classification to deep sequence-to-sequence and recurrent-convolutional models. In all cases, the enhanced network gains the ability to remember and do life-long one-shot learning. Our module remembers training examples shown many thousands of steps in the past and it can successfully generalize from them. We set new state-of-the-art for one-shot learning on the Omniglot dataset and demonstrate, for the first time, life-long one-shot learning in recurrent neural networks on a large-scale machine translation task.
△ Less
Submitted 8 March, 2017;
originally announced March 2017.
-
Random Spatial Networks: Small Worlds without Clustering, Traveling Waves, and Hop-and-Spread Disease Dynamics
Authors:
John Lang,
Hans De Sterck,
Jamieson L. Kaiser,
Joel C. Miller
Abstract:
Random network models play a prominent role in modeling, analyzing and understanding complex phenomena on real-life networks. However, a key property of networks is often neglected: many real-world networks exhibit spatial structure, the tendency of a node to select neighbors with a probability depending on physical distance. Here, we introduce a class of random spatial networks (RSNs) which gener…
▽ More
Random network models play a prominent role in modeling, analyzing and understanding complex phenomena on real-life networks. However, a key property of networks is often neglected: many real-world networks exhibit spatial structure, the tendency of a node to select neighbors with a probability depending on physical distance. Here, we introduce a class of random spatial networks (RSNs) which generalizes many existing random network models but adds spatial structure. In these networks, nodes are placed randomly in space and joined in edges with a probability depending on their distance and their individual expected degrees, in a manner that crucially remains analytically tractable. We use this network class to propose a new generalization of small-world networks, where the average shortest path lengths in the graph are small, as in classical Watts-Strogatz small-world networks, but with close spatial proximity of nodes that are neighbors in the network playing the role of large clustering. Small-world effects are demonstrated on these spatial small-world networks without clustering. We are able to derive partial integro-differential equations governing susceptible-infectious-recovered disease spreading through an RSN, and we demonstrate the existence of traveling wave solutions. If the distance kernel governing edge placement decays slower than exponential, the population-scale dynamics are dominated by long-range hops followed by local spread of traveling waves. This provides a theoretical modeling framework for recent observations of how epidemics like Ebola evolve in modern connected societies, with long-range connections seeding new focal points from which the epidemic locally spreads in a wavelike manner.
△ Less
Submitted 4 February, 2017;
originally announced February 2017.
-
Regularizing Neural Networks by Penalizing Confident Output Distributions
Authors:
Gabriel Pereyra,
George Tucker,
Jan Chorowski,
Łukasz Kaiser,
Geoffrey Hinton
Abstract:
We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the…
▽ More
We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.
△ Less
Submitted 23 January, 2017;
originally announced January 2017.
-
Can Active Memory Replace Attention?
Authors:
Łukasz Kaiser,
Samy Bengio
Abstract:
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation.
Recently, similar improvem…
▽ More
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation.
Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling.
So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.
△ Less
Submitted 6 March, 2017; v1 submitted 27 October, 2016;
originally announced October 2016.
-
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Authors:
Yonghui Wu,
Mike Schuster,
Zhifeng Chen,
Quoc V. Le,
Mohammad Norouzi,
Wolfgang Macherey,
Maxim Krikun,
Yuan Cao,
Qin Gao,
Klaus Macherey,
Jeff Klingner,
Apurva Shah,
Melvin Johnson,
Xiaobing Liu,
Łukasz Kaiser,
Stephan Gouws,
Yoshikiyo Kato,
Taku Kudo,
Hideto Kazawa,
Keith Stevens,
George Kurian,
Nishant Patil,
Wei Wang,
Cliff Young,
Jason Smith
, et al. (6 additional authors not shown)
Abstract:
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NM…
▽ More
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NMT's use in practical deployments and services, where both accuracy and speed are essential. In this work, we present GNMT, Google's Neural Machine Translation system, which attempts to address many of these issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections. To improve parallelism and therefore decrease training time, our attention mechanism connects the bottom layer of the decoder to the top layer of the encoder. To accelerate the final translation speed, we employ low-precision arithmetic during inference computations. To improve handling of rare words, we divide words into a limited set of common sub-word units ("wordpieces") for both input and output. This method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models, naturally handles translation of rare words, and ultimately improves the overall accuracy of the system. Our beam search technique employs a length-normalization procedure and uses a coverage penalty, which encourages generation of an output sentence that is most likely to cover all the words in the source sentence. On the WMT'14 English-to-French and English-to-German benchmarks, GNMT achieves competitive results to state-of-the-art. Using a human side-by-side evaluation on a set of isolated simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
△ Less
Submitted 8 October, 2016; v1 submitted 26 September, 2016;
originally announced September 2016.
-
Machine Learning with Guarantees using Descriptive Complexity and SMT Solvers
Authors:
Charles Jordan,
Łukasz Kaiser
Abstract:
Machine learning is a thriving part of computer science. There are many efficient approaches to machine learning that do not provide strong theoretical guarantees, and a beautiful general learning theory. Unfortunately, machine learning approaches that give strong theoretical guarantees have not been efficient enough to be applicable. In this paper we introduce a logical approach to machine learni…
▽ More
Machine learning is a thriving part of computer science. There are many efficient approaches to machine learning that do not provide strong theoretical guarantees, and a beautiful general learning theory. Unfortunately, machine learning approaches that give strong theoretical guarantees have not been efficient enough to be applicable. In this paper we introduce a logical approach to machine learning. Models are represented by tuples of logical formulas and inputs and outputs are logical structures. We present our framework together with several applications where we evaluate it using SAT and SMT solvers. We argue that this approach to machine learning is particularly suited to bridge the gap between efficiency and theoretical soundness. We exploit results from descriptive complexity theory to prove strong theoretical guarantees for our approach. To show its applicability, we present experimental results including learning complexity-theoretic reductions rules for board games. We also explain how neural networks fit into our framework, although the current implementation does not scale to provide guarantees for real-world neural networks.
△ Less
Submitted 9 September, 2016;
originally announced September 2016.
-
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Authors:
Martín Abadi,
Ashish Agarwal,
Paul Barham,
Eugene Brevdo,
Zhifeng Chen,
Craig Citro,
Greg S. Corrado,
Andy Davis,
Jeffrey Dean,
Matthieu Devin,
Sanjay Ghemawat,
Ian Goodfellow,
Andrew Harp,
Geoffrey Irving,
Michael Isard,
Yangqing Jia,
Rafal Jozefowicz,
Lukasz Kaiser,
Manjunath Kudlur,
Josh Levenberg,
Dan Mane,
Rajat Monga,
Sherry Moore,
Derek Murray,
Chris Olah
, et al. (15 additional authors not shown)
Abstract:
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational de…
▽ More
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
△ Less
Submitted 16 March, 2016; v1 submitted 14 March, 2016;
originally announced March 2016.
-
Neural GPUs Learn Algorithms
Authors:
Łukasz Kaiser,
Ilya Sutskever
Abstract:
Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel an…
▽ More
Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel and are are hard to train due to their large depth when unfolded.
We present a neural network architecture to address this problem: the Neural GPU. It is based on a type of convolutional gated recurrent unit and, like the NTM, is computationally universal. Unlike the NTM, the Neural GPU is highly parallel which makes it easier to train and efficient to run.
An essential property of algorithms is their ability to handle inputs of arbitrary size. We show that the Neural GPU can be trained on short instances of an algorithmic task and successfully generalize to long instances. We verified it on a number of tasks including long addition and long multiplication of numbers represented in binary. We train the Neural GPU on numbers with upto 20 bits and observe no errors whatsoever while testing it, even on much longer numbers.
To achieve these results we introduce a technique for training deep recurrent networks: parameter sharing relaxation. We also found a small amount of dropout and gradient noise to have a large positive effect on learning and generalization.
△ Less
Submitted 14 March, 2016; v1 submitted 25 November, 2015;
originally announced November 2015.
-
Adding Gradient Noise Improves Learning for Very Deep Networks
Authors:
Arvind Neelakantan,
Luke Vilnis,
Quoc V. Le,
Ilya Sutskever,
Lukasz Kaiser,
Karol Kurach,
James Martens
Abstract:
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than…
▽ More
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a low-overhead and easy-to-implement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fully-connected 20-layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72% relative reduction in error rate over a carefully-tuned baseline on a challenging question-answering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.
△ Less
Submitted 20 November, 2015;
originally announced November 2015.
-
Multi-task Sequence to Sequence Learning
Authors:
Minh-Thang Luong,
Quoc V. Le,
Ilya Sutskever,
Oriol Vinyals,
Lukasz Kaiser
Abstract:
Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machi…
▽ More
Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting - useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting - where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation. Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks. Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F1. Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.
△ Less
Submitted 1 March, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.
-
Low-frequency type II radio detections and coronagraph data to describe and forecast the propagation of 71 CMEs/shocks
Authors:
H. Cremades,
F. A. Iglesias,
O. C. St. Cyr,
H. Xie,
M. L. Kaiser,
N. Gopalswamy
Abstract:
The vulnerability of technology on which present society relies demands that a solar event, its time of arrival at Earth, and its degree of geoeffectiveness be promptly forecasted. Motivated by improving predictions of arrival times at Earth of shocks driven by coronal mass ejections (CMEs), we have analyzed 71 Earth-directed events in different stages of their propagation. The study is primarily…
▽ More
The vulnerability of technology on which present society relies demands that a solar event, its time of arrival at Earth, and its degree of geoeffectiveness be promptly forecasted. Motivated by improving predictions of arrival times at Earth of shocks driven by coronal mass ejections (CMEs), we have analyzed 71 Earth-directed events in different stages of their propagation. The study is primarily based on approximated locations of interplanetary (IP) shocks derived from type II radio emissions detected by the Wind/WAVES experiment during 1997-2007. Distance-time diagrams resulting from the combination of white-light corona, IP type II radio, and in situ data lead to the formulation of descriptive profiles of each CME's journey toward Earth. Furthermore, two different methods to track and predict the location of CME-driven IP shocks are presented. The linear method, solely based on Wind/WAVES data, arises after key modifications to a pre-existing technique that linearly projects the drifting low-frequency type II emissions to 1 AU. This upgraded method improves forecasts of shock arrival time by almost 50%. The second predictive method is proposed on the basis of information derived from the descriptive profiles, and relies on a single CME height-time point and on low-frequency type II radio emissions to obtain an approximate value of the shock arrival time at Earth. In addition, we discuss results on CME-radio emission associations, characteristics of IP propagation, and the relative success of the forecasting methods.
△ Less
Submitted 7 May, 2015;
originally announced May 2015.
-
Grammar as a Foreign Language
Authors:
Oriol Vinyals,
Lukasz Kaiser,
Terry Koo,
Slav Petrov,
Ilya Sutskever,
Geoffrey Hinton
Abstract:
Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used…
▽ More
Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. It also matches the performance of standard parsers when trained only on a small human-annotated dataset, which shows that this model is highly data-efficient, in contrast to sequence-to-sequence models without the attention mechanism. Our parser is also fast, processing over a hundred sentences per second with an unoptimized CPU implementation.
△ Less
Submitted 9 June, 2015; v1 submitted 23 December, 2014;
originally announced December 2014.
-
Directed Width Measures and Monotonicity of Directed Graph Searching
Authors:
Łukasz Kaiser,
Stephan Kreutzer,
Roman Rabinovich,
Sebastian Siebertz
Abstract:
We consider generalisations of tree width to directed graphs, that attracted much attention in the last fifteen years. About their relative strength with respect to "bounded width in one measure implies bounded width in the other" many problems remain unsolved. Only some results separating directed width measures are known. We give an almost complete picture of this relation. For this, we consider…
▽ More
We consider generalisations of tree width to directed graphs, that attracted much attention in the last fifteen years. About their relative strength with respect to "bounded width in one measure implies bounded width in the other" many problems remain unsolved. Only some results separating directed width measures are known. We give an almost complete picture of this relation. For this, we consider the cops and robber games characterising DAG-width and directed tree width (up to a constant factor). For DAG-width games, it is an open question whether the robber-monotonicity cost (the difference between the minimal numbers of cops capturing the robber in the general and in the monotone case) can be bounded by any function. Examples show that this function (if it exists) is at least $f(k) > 4k/3$ (Kreutzer, Ordyniak 2008). We approach a solution by defining weak monotonicity and showing that if $k$ cops win weakly monotonically, then $O(k^2)$ cops win monotonically. It follows that bounded Kelly-width implies bounded DAG-width, which has been open since the definition of Kelly-width by Hunter and Kreutzer in 2008. For directed tree width games we show that, unexpectedly, the cop-monotonicity cost (no cop revisits any vertex) is not bounded by any function. This separates directed tree width from D-width defined by Safari in 2005, refuting his conjecture.
△ Less
Submitted 20 August, 2014;
originally announced August 2014.
-
Model Checking the Quantitative mu-Calculus on Linear Hybrid Systems
Authors:
Diana Fischer,
Lukasz Kaiser
Abstract:
We study the model-checking problem for a quantitative extension of the modal mu-calculus on a class of hybrid systems. Qualitative model checking has been proved decidable and implemented for several classes of systems, but this is not the case for quantitative questions that arise naturally in this context. Recently, quantitative formalisms that subsume classical temporal logics and allow the m…
▽ More
We study the model-checking problem for a quantitative extension of the modal mu-calculus on a class of hybrid systems. Qualitative model checking has been proved decidable and implemented for several classes of systems, but this is not the case for quantitative questions that arise naturally in this context. Recently, quantitative formalisms that subsume classical temporal logics and allow the measurement of interesting quantitative phenomena were introduced. We show how a powerful quantitative logic, the quantitative mu-calculus, can be model checked with arbitrary precision on initialised linear hybrid systems. To this end, we develop new techniques for the discretisation of continuous state spaces based on a special class of strategies in model-checking games and present a reduction to a class of counter parity games.
△ Less
Submitted 19 September, 2012; v1 submitted 8 September, 2012;
originally announced September 2012.
-
Degrees of Lookahead in Regular Infinite Games
Authors:
Michael Holtmann,
Lukasz Kaiser,
Wolfgang Thomas
Abstract:
We study variants of regular infinite games where the strict alternation of moves between the two players is subject to modifications. The second player may postpone a move for a finite number of steps, or, in other words, exploit in his strategy some lookahead on the moves of the opponent. This captures situations in distributed systems, e.g. when buffers are present in communication or when sig…
▽ More
We study variants of regular infinite games where the strict alternation of moves between the two players is subject to modifications. The second player may postpone a move for a finite number of steps, or, in other words, exploit in his strategy some lookahead on the moves of the opponent. This captures situations in distributed systems, e.g. when buffers are present in communication or when signal transmission between components is deferred. We distinguish strategies with different degrees of lookahead, among them being the continuous and the bounded lookahead strategies. In the first case the lookahead is of finite possibly unbounded size, whereas in the second case it is of bounded size. We show that for regular infinite games the solvability by continuous strategies is decidable, and that a continuous strategy can always be reduced to one of bounded lookahead. Moreover, this lookahead is at most doubly exponential in the size of a given parity automaton recognizing the winning condition. We also show that the result fails for non-regular gamesxwhere the winning condition is given by a context-free omega-language.
△ Less
Submitted 25 September, 2012; v1 submitted 4 September, 2012;
originally announced September 2012.
-
Radio-loud CMEs from the disk center lacking shocks at 1 AU
Authors:
N. Gopalswamy,
P. Makela,
S. Akiyama,
S. Yashiro,
H. Xie,
R. J. MacDowall,
M. L. Kaiser
Abstract:
A coronal mass ejection (CME) associated with a type II burst and originating close to the center of the solar disk typically results in a shock at Earth in 2-3 days and hence can be used to predict shock arrival at Earth. However, a significant fraction (about 28%) of such CMEs producing type II bursts were not associated with shocks at Earth. We examined a set of 21 type II bursts observed by th…
▽ More
A coronal mass ejection (CME) associated with a type II burst and originating close to the center of the solar disk typically results in a shock at Earth in 2-3 days and hence can be used to predict shock arrival at Earth. However, a significant fraction (about 28%) of such CMEs producing type II bursts were not associated with shocks at Earth. We examined a set of 21 type II bursts observed by the Wind/WAVES experiment at decameter-hectometric (DH) wavelengths that had CME sources very close to the disk center (within a central meridian distance of 30 degrees), but did not have a shock at Earth. We find that the near-Sun speeds of these CMEs average to ~644 km/s, only slightly higher than the average speed of CMEs associated with radio-quiet shocks. However, the fraction of halo CMEs is only ~30%, compared to 54% for the radio-quiet shocks and 91% for all radio-loud shocks. We conclude that the disk-center radio-loud CMEs with no shocks at 1 AU are generally of lower energy and they drive shocks only close to the Sun and dissipate before arriving at Earth. There is also evidence for other possible processes that lead to the lack of shock at 1 AU: (i) overtaking CME shocks merge and one observes a single shock at Earth, and (ii) deflection by nearby coronal holes can push the shocks away from the Sun-Earth line, such that Earth misses these shocks. The probability of observing a shock at 1 AU increases rapidly above 60% when the CME speed exceeds 1000 km/s and when the type II bursts propagate to frequencies below 1 MHz.
△ Less
Submitted 29 June, 2012;
originally announced July 2012.
-
Interplanetary shocks lacking type II radio bursts
Authors:
N. Gopalswamy,
H. Xie,
P. Makela,
S. Akiyama,
S. Yashiro,
M. L. Kaiser,
R. A. Howard,
J. -L. Bougeret
Abstract:
We report on the radio-emission characteristics of 222 interplanetary (IP) shocks. A surprisingly large fraction of the IP shocks (~34%) is radio quiet (i.e., the shocks lacked type II radio bursts). The CMEs associated with the RQ shocks are generally slow (average speed ~535 km/s) and only ~40% of the CMEs were halos. The corresponding numbers for CMEs associated with radio loud (RL) shocks ar…
▽ More
We report on the radio-emission characteristics of 222 interplanetary (IP) shocks. A surprisingly large fraction of the IP shocks (~34%) is radio quiet (i.e., the shocks lacked type II radio bursts). The CMEs associated with the RQ shocks are generally slow (average speed ~535 km/s) and only ~40% of the CMEs were halos. The corresponding numbers for CMEs associated with radio loud (RL) shocks are 1237 km/s and 72%, respectively. The RQ shocks are also accompanied by lower peak soft X-ray flux. CMEs associated with RQ (RL) shocks are generally accelerating (decelerating). The kinematics of CMEs associated with the km type II bursts is similar to those of RQ shocks, except that the former are slightly more energetic. Comparison of the shock The RQ shocks seem to be mostly subcritical and quasi-perpendicular. The radio-quietness is predominant in the rise phase and decreases through the maximum and declining phases of solar cycle 23. The solar sources of the shock-driving CMEs follow the sunspot butterfly diagram, consistent with the higher-energy requirement for driving shocks.
△ Less
Submitted 15 January, 2010; v1 submitted 23 December, 2009;
originally announced December 2009.
-
Dust detection by the wave instrument on STEREO: nanoparticles picked up by the solar wind?
Authors:
N. Meyer-Vernet,
M. Maksimovic,
A. Czechowski,
I. Mann,
I. Zouganelis,
K. Goetz,
M. L. Kaiser,
O. C. St. Cyr,
J. L. Bougeret,
S. D. Bale
Abstract:
The STEREO/WAVES instrument has detected a very large number of intense voltage pulses. We suggest that these events are produced by impact ionisation of nanoparticles striking the spacecraft at a velocity of the order of magnitude of the solar wind speed. Nanoparticles, which are half-way between micron-sized dust and atomic ions, have such a large charge-to-mass ratio that the electric field i…
▽ More
The STEREO/WAVES instrument has detected a very large number of intense voltage pulses. We suggest that these events are produced by impact ionisation of nanoparticles striking the spacecraft at a velocity of the order of magnitude of the solar wind speed. Nanoparticles, which are half-way between micron-sized dust and atomic ions, have such a large charge-to-mass ratio that the electric field induced by the solar wind magnetic field accelerates them very efficiently. Since the voltage produced by dust impacts increases very fast with speed, such nanoparticles produce signals as high as do much larger grains of smaller speeds. The flux of 10-nm radius grains inferred in this way is compatible with the interplanetary dust flux model. The present results may represent the first detection of fast nanoparticles in interplanetary space near Earth orbit.
△ Less
Submitted 4 April, 2009; v1 submitted 24 March, 2009;
originally announced March 2009.