Skip to main content

Showing 1–50 of 52 results for author: Kaiser, L

.
  1. arXiv:2403.05713  [pdf, other

    cs.LG

    tsGT: Stochastic Time Series Modeling With Transformer

    Authors: Łukasz Kuciński, Witold Drzewakowski, Mateusz Olko, Piotr Kozakowski, Łukasz Maziarka, Marta Emilia Nowakowska, Łukasz Kaiser, Piotr Miłoś

    Abstract: Time series methods are of fundamental importance in virtually any field of science that deals with temporally structured data. Recently, there has been a surge of deterministic transformer models with time series-specific architectural biases. In this paper, we go in a different direction by introducing tsGT, a stochastic time series model built on a general-purpose transformer architecture. We f… ▽ More

    Submitted 3 April, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2402.02304  [pdf, other

    math.AP cs.LG

    Efficient Numerical Wave Propagation Enhanced By An End-to-End Deep Learning Model

    Authors: Luis Kaiser, Richard Tsai, Christian Klingenberg

    Abstract: Recent advances in wave modeling use sufficiently accurate fine solver outputs to train a neural network that enhances the accuracy of a fast but inaccurate coarse solver. In this paper we build upon the work of Nguyen and Tsai (2023) and present a novel unified system that integrates a numerical solver with a deep learning component into an end-to-end framework. In the proposed setting, we invest… ▽ More

    Submitted 13 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  3. arXiv:2309.02860  [pdf, other

    astro-ph.HE astro-ph.GA

    Stochastic modelling of cosmic ray sources for diffuse high-energy gamma-rays and neutrinos

    Authors: Anton Stall, Leonard Kaiser, Philipp Mertsch

    Abstract: Cosmic rays of energies up to a few PeV are believed to be of galactic origin, yet individual sources have still not been firmly identified. Due to inelastic collisions with the interstellar gas, cosmic-ray nuclei produce a diffuse flux of high-energy gamma-rays and neutrinos. Fermi-LAT has provided maps of galactic gamma-rays at GeV energies which can be produced by both hadronic and leptonic pro… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: 8 pages, 4 figures, Presented at the 38th International Cosmic Ray Conference (ICRC2023)

    Journal ref: PoS ICRC2023 (2023) 687

  4. arXiv:2303.08774  [pdf, other

    cs.CL cs.AI

    GPT-4 Technical Report

    Authors: OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko , et al. (256 additional authors not shown)

    Abstract: We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based mo… ▽ More

    Submitted 4 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: 100 pages; updated authors list; fixed author names and added citation

  5. arXiv:2211.13888  [pdf, other

    physics.flu-dyn

    Modelling the response of a turbulent jet flame to acoustic forcing in a linearized framework using an active flame approach

    Authors: Thomas Ludwig Kaiser, Gregoire Varillon, Wolfgang Polifke, Feichi Zhang, Thorsten Zirwes, Henning Bockhorn, Kilian Oberleithner

    Abstract: This study performs a linear analysis of a turbulent reacting methane-air jet flame, with the goal of predicting the response of the reacting flow to upstream acoustic actuation. Accounting for heat release fluctuations is a vital component when investigating thermoacoustic instabilities and flame noise in a linearized framework. Unlike previous studies this work develops and applies an active fla… ▽ More

    Submitted 1 December, 2022; v1 submitted 24 November, 2022; originally announced November 2022.

    MSC Class: 80A32 (Primary) 80A25; 80A19; 76F25; 76F80 (Secondary)

  6. arXiv:2208.03109  [pdf, other

    physics.flu-dyn

    Mean flow data assimilation based on physics-informed neural networks

    Authors: Jakob G. R. von Saldern, Johann Moritz Reumschüssel, Thomas L. Kaiser, Moritz Sieber, Kilian Oberleithner

    Abstract: Physics-informed neural networks (PINNs) can be used to solve partial differential equations (PDEs) and identify hidden variables by incorporating the governing equations into neural network training. In this study, we apply PINNs to the assimilation of turbulent mean flow data and investigate the method's ability to identify inaccessible variables and closure terms from sparse data. Using high-fi… ▽ More

    Submitted 8 December, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

  7. arXiv:2111.12763  [pdf, other

    cs.LG cs.CL

    Sparse is Enough in Scaling Transformers

    Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

    Abstract: Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to sca… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: NeurIPS 2021

  8. arXiv:2111.03728  [pdf

    cs.AI

    Shared Model of Sense-making for Human-Machine Collaboration

    Authors: Gheorghe Tecuci, Dorin Marcu, Louis Kaiser, Mihai Boicu

    Abstract: We present a model of sense-making that greatly facilitates the collaboration between an intelligent analyst and a knowledge-based agent. It is a general model grounded in the science of evidence and the scientific method of hypothesis generation and testing, where sense-making hypotheses that explain an observation are generated, relevant evidence is then discovered, and the hypotheses are tested… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

    Comments: Presented at AAAI FSS-21: Artificial Intelligence in Government and Public Sector, Washington, DC, USA

  9. arXiv:2110.14168  [pdf, other

    cs.LG cs.CL

    Training Verifiers to Solve Math Word Problems

    Authors: Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

    Abstract: State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high tes… ▽ More

    Submitted 17 November, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

  10. arXiv:2110.13711  [pdf, other

    cs.LG cs.CL

    Hierarchical Transformers Are More Efficient Language Models

    Authors: Piotr Nawrot, Szymon Tworkowski, Michał Tyrolski, Łukasz Kaiser, Yuhuai Wu, Christian Szegedy, Henryk Michalewski

    Abstract: Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility.… ▽ More

    Submitted 16 April, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

  11. Measuring the photoelectron emission delay in the molecular frame

    Authors: Jonas Rist, Kim Klyssek, Nikolay M. Novikovskiy, Max Kircher, Isabel Vela-Pérez, Daniel Trabert, Sven Grundmann, Dimitrios Tsitsonis, Juliane Siebert, Angelina Geyer, Niklas Melzer, Christian Schwarz, Nils Anders, Leon Kaiser, Kilian Fehre, Alexander Hartung, Sebastian Eckart, Lothar Ph. H. Schmidt, Markus S. Schöffler, Vernon T. Davis, Joshua B. Williams, Florian Trinter, Reinhard Dörner, Philipp V. Demekhin, Till Jahnke

    Abstract: If matter absorbs a photon of sufficient energy it emits an electron. The question of the duration of the emission process has intrigued scientists for decades. With the advent of attosecond metrology, experiments addressing such ultrashort intervals became possible. While these types of studies require attosecond experimental precision, we present here a novel measurement approach that avoids tho… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

    Journal ref: Nat Commun 12, 6657 (2021)

  12. arXiv:2107.03374  [pdf, other

    cs.LG

    Evaluating Large Language Models Trained on Code

    Authors: Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter , et al. (33 additional authors not shown)

    Abstract: We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J sol… ▽ More

    Submitted 14 July, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

    Comments: corrected typos, added references, added authors, added acknowledgements

  13. arXiv:2102.06782  [pdf, other

    cs.LG

    Q-Value Weighted Regression: Reinforcement Learning with Limited Data

    Authors: Piotr Kozakowski, Łukasz Kaiser, Henryk Michalewski, Afroz Mohiuddin, Katarzyna Kańska

    Abstract: Sample efficiency and performance in the offline setting have emerged as significant challenges of deep reinforcement learning. We introduce Q-Value Weighted Regression (QWR), a simple RL algorithm that excels in these aspects. QWR is an extension of Advantage Weighted Regression (AWR), an off-policy actor-critic algorithm that performs very well on continuous control tasks, also in the offline se… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  14. Zeptosecond Birth Time Delay in Molecular Photoionization

    Authors: Sven Grundmann, Daniel Trabert, Kilian Fehre, Nico Strenger, Andreas Pier, Leon Kaiser, Max Kircher, Miriam Weller, Sebastian Eckart, Lothar Ph. H. Schmidt, Florian Trinter, Till Jahnke, Markus S. Schöffler, Reinhard Dörner

    Abstract: Photoionization is one of the fundamental light-matter interaction processes in which the absorption of a photon launches the escape of an electron. The time scale of the process poses many open questions. Experiments found time delays in the attosecond ($10^{-18}$ s) domain between electron ejection from different orbitals, electronic bands, or in different directions. Here, we demonstrate that a… ▽ More

    Submitted 16 October, 2020; originally announced October 2020.

    Journal ref: Science 16 Oct 2020: Vol. 370, Issue 6514, pp. 339-341

  15. arXiv:2009.14794  [pdf, other

    cs.LG cs.CL stat.ML

    Rethinking Attention with Performers

    Authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

    Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random featu… ▽ More

    Submitted 19 November, 2022; v1 submitted 30 September, 2020; originally announced September 2020.

    Comments: Published as a conference paper + oral presentation at ICLR 2021. 38 pages. See https://github.com/google-research/google-research/tree/master/protein_lm for protein language model code, and https://github.com/google-research/google-research/tree/master/performer for Performer code. See https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html for Google AI Blog

  16. Revealing the Two-Electron Cusp in the Ground States of He and H2 via Quasifree Double Photoionization

    Authors: S. Grundmann, V. Serov, F. Trinter, K. Fehre, N. Strenger, A. Pier, M. Kircher, D. Trabert, M. Weller, J. Rist, L. Kaiser, A. W. Bray, L. Ph. H. Schmidt, J. B. Williams, T. Jahnke, R. Dörner, M. S. Schöffler, A. S. Kheifets

    Abstract: We report on kinematically complete measurements and ab initio non-perturbative calculations of double ionization of He and H2 by a single 800 eV circularly polarized photon. We confirm the quasifree mechanism of photoionization for H2 and show how it originates from the two-electron cusp in the ground state of a two-electron target. Our approach establishes a new method for map** electrons rela… ▽ More

    Submitted 1 July, 2020; v1 submitted 21 January, 2020; originally announced January 2020.

    Comments: 7 pages, 4 figures

    Journal ref: Phys. Rev. Research 2, 033080 (2020)

  17. arXiv:2001.04451  [pdf, other

    cs.LG cs.CL stat.ML

    Reformer: The Efficient Transformer

    Authors: Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

    Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is… ▽ More

    Submitted 18 February, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

    Comments: ICLR 2020

  18. arXiv:1906.04331  [pdf, other

    cs.CL cs.LG

    Parallel Scheduled Sampling

    Authors: Daniel Duckworth, Arvind Neelakantan, Ben Goodrich, Lukasz Kaiser, Samy Bengio

    Abstract: Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit (pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling aims to mitigate this discr… ▽ More

    Submitted 21 October, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

    Comments: 2nd submission

  19. arXiv:1905.08836  [pdf, other

    cs.CL

    Sample Efficient Text Summarization Using a Single Pre-Trained Transformer

    Authors: Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, Lukasz Kaiser

    Abstract: Language model (LM) pre-training has resulted in impressive performance and sample efficiency on a variety of language understanding tasks. However, it remains unclear how to best use pre-trained LMs for generation tasks such as abstractive summarization, particularly to enhance sample efficiency. In these sequence-to-sequence settings, prior work has experimented with loading pre-trained weights… ▽ More

    Submitted 21 May, 2019; originally announced May 2019.

  20. arXiv:1903.00374  [pdf, other

    cs.LG stat.ML

    Model-Based Reinforcement Learning for Atari

    Authors: Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski

    Abstract: Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and… ▽ More

    Submitted 3 April, 2024; v1 submitted 1 March, 2019; originally announced March 2019.

  21. arXiv:1810.10126  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Area Attention

    Authors: Yang Li, Lukasz Kaiser, Samy Bengio, Si Si

    Abstract: Existing attention mechanisms are trained to attend to individual items in a collection (the memory) with a predefined, fixed granularity, e.g., a word token or an image grid. We propose area attention: a way to attend to areas in the memory, where each area contains a group of items that are structurally adjacent, e.g., spatially for a 2D memory such as images, or temporally for a 1D memory such… ▽ More

    Submitted 7 May, 2020; v1 submitted 23 October, 2018; originally announced October 2018.

    Comments: @InProceedings{pmlr-v97-li19e, title = {Area Attention}, author = {Li, Yang and Kaiser, Lukasz and Bengio, Samy and Si, Si}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {3846--3855}, year = {2019}, volume = {97}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR} }

    Journal ref: ICML 2019

  22. arXiv:1810.01541  [pdf

    cs.AI

    Co-Arg: Cogent Argumentation with Crowd Elicitation

    Authors: Mihai Boicu, Dorin Marcu, Gheorghe Tecuci, Lou Kaiser, Chirag Uttamsingh, Navya Kalale

    Abstract: This paper presents Co-Arg, a new type of cognitive assistant to an intelligence analyst that enables the synergistic integration of analyst imagination and expertise, computer knowledge and critical reasoning, and crowd wisdom, to draw defensible and persuasive conclusions from masses of evidence of all types, in a world that is changing all the time. Co-Arg's goal is to improve the quality of th… ▽ More

    Submitted 2 October, 2018; originally announced October 2018.

    Comments: Presented at AAAI FSS-18: Artificial Intelligence in Government and Public Sector, Arlington, Virginia, USA

  23. arXiv:1807.03819  [pdf, other

    cs.CL cs.LG stat.ML

    Universal Transformers

    Authors: Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser

    Abstract: Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them slow to train. Feed-forward and convolutional architectures have recently been shown to achieve superior results on some sequence modeling tasks such as machine tr… ▽ More

    Submitted 5 March, 2019; v1 submitted 10 July, 2018; originally announced July 2018.

    Comments: Published at ICLR2019

  24. arXiv:1803.07416  [pdf, other

    cs.LG cs.CL stat.ML

    Tensor2Tensor for Neural Machine Translation

    Authors: Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit

    Abstract: Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.

    Submitted 16 March, 2018; originally announced March 2018.

    Comments: arXiv admin note: text overlap with arXiv:1706.03762

  25. arXiv:1803.03382  [pdf, other

    cs.LG

    Fast Decoding in Sequence Models using Discrete Latent Variables

    Authors: Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, Noam Shazeer

    Abstract: Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet st… ▽ More

    Submitted 7 June, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

    Comments: ICML 2018

  26. arXiv:1802.05751  [pdf, other

    cs.CV

    Image Transformer

    Authors: Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran

    Abstract: Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By… ▽ More

    Submitted 15 June, 2018; v1 submitted 15 February, 2018; originally announced February 2018.

    Comments: Appears in International Conference on Machine Learning, 2018. Code available at https://github.com/tensorflow/tensor2tensor

  27. arXiv:1801.10198  [pdf, other

    cs.CL

    Generating Wikipedia by Summarizing Long Sequences

    Authors: Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

    Abstract: We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical enco… ▽ More

    Submitted 30 January, 2018; originally announced January 2018.

    Comments: Published as a conference paper at ICLR 2018

  28. arXiv:1801.09797  [pdf, ps, other

    cs.LG stat.ML

    Discrete Autoencoders for Sequence Models

    Authors: Łukasz Kaiser, Samy Bengio

    Abstract: Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

  29. arXiv:1801.04883  [pdf, other

    cs.LG

    Unsupervised Cipher Cracking Using Discrete GANs

    Authors: Aidan N. Gomez, Sicong Huang, Ivan Zhang, Bryan M. Li, Muhammad Osama, Lukasz Kaiser

    Abstract: This work details CipherGAN, an architecture inspired by CycleGAN used for inferring the underlying cipher map** given banks of unpaired ciphertext and plaintext. We demonstrate that CipherGAN is capable of cracking language data enciphered using shift and Vigenere ciphers to a high degree of fidelity and for vocabularies much larger than previously achieved. We present how CycleGAN can be made… ▽ More

    Submitted 15 January, 2018; originally announced January 2018.

  30. arXiv:1706.05137  [pdf, other

    cs.LG stat.ML

    One Model To Learn Them All

    Authors: Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

    Abstract: Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrentl… ▽ More

    Submitted 15 June, 2017; originally announced June 2017.

  31. arXiv:1706.03762  [pdf, other

    cs.CL cs.LG

    Attention Is All You Need

    Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

    Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experi… ▽ More

    Submitted 1 August, 2023; v1 submitted 12 June, 2017; originally announced June 2017.

    Comments: 15 pages, 5 figures

  32. arXiv:1706.03059  [pdf, other

    cs.CL cs.LG

    Depthwise Separable Convolutions for Neural Machine Translation

    Authors: Lukasz Kaiser, Aidan N. Gomez, Francois Chollet

    Abstract: Depthwise separable convolutions reduce the number of parameters and computation used in convolutional operations while increasing representational efficiency. They have been shown to be successful in image classification models, both in obtaining better models than previously possible for a given parameter count (the Xception architecture) and considerably reducing the number of parameters requir… ▽ More

    Submitted 15 June, 2017; v1 submitted 9 June, 2017; originally announced June 2017.

  33. arXiv:1703.03129  [pdf, other

    cs.LG

    Learning to Remember Rare Events

    Authors: Łukasz Kaiser, Ofir Nachum, Aurko Roy, Samy Bengio

    Abstract: Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the modul… ▽ More

    Submitted 8 March, 2017; originally announced March 2017.

    Comments: Conference paper accepted for ICLR'17

  34. arXiv:1702.01252  [pdf, other

    q-bio.QM nlin.PS physics.bio-ph physics.soc-ph

    Random Spatial Networks: Small Worlds without Clustering, Traveling Waves, and Hop-and-Spread Disease Dynamics

    Authors: John Lang, Hans De Sterck, Jamieson L. Kaiser, Joel C. Miller

    Abstract: Random network models play a prominent role in modeling, analyzing and understanding complex phenomena on real-life networks. However, a key property of networks is often neglected: many real-world networks exhibit spatial structure, the tendency of a node to select neighbors with a probability depending on physical distance. Here, we introduce a class of random spatial networks (RSNs) which gener… ▽ More

    Submitted 4 February, 2017; originally announced February 2017.

  35. arXiv:1701.06548  [pdf, other

    cs.NE cs.LG

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    Authors: Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton

    Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the… ▽ More

    Submitted 23 January, 2017; originally announced January 2017.

    Comments: Submitted to ICLR 2017

  36. arXiv:1610.08613  [pdf, ps, other

    cs.LG cs.CL

    Can Active Memory Replace Attention?

    Authors: Łukasz Kaiser, Samy Bengio

    Abstract: Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvem… ▽ More

    Submitted 6 March, 2017; v1 submitted 27 October, 2016; originally announced October 2016.

  37. arXiv:1609.08144  [pdf, other

    cs.CL cs.AI cs.LG

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Authors: Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith , et al. (6 additional authors not shown)

    Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation, with the potential to overcome many of the weaknesses of conventional phrase-based translation systems. Unfortunately, NMT systems are known to be computationally expensive both in training and in translation inference. Also, most NMT systems have difficulty with rare words. These issues have hindered NM… ▽ More

    Submitted 8 October, 2016; v1 submitted 26 September, 2016; originally announced September 2016.

  38. arXiv:1609.02664  [pdf, ps, other

    cs.LG cs.LO

    Machine Learning with Guarantees using Descriptive Complexity and SMT Solvers

    Authors: Charles Jordan, Łukasz Kaiser

    Abstract: Machine learning is a thriving part of computer science. There are many efficient approaches to machine learning that do not provide strong theoretical guarantees, and a beautiful general learning theory. Unfortunately, machine learning approaches that give strong theoretical guarantees have not been efficient enough to be applicable. In this paper we introduce a logical approach to machine learni… ▽ More

    Submitted 9 September, 2016; originally announced September 2016.

  39. arXiv:1603.04467  [pdf, other

    cs.DC cs.LG

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Authors: Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah , et al. (15 additional authors not shown)

    Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational de… ▽ More

    Submitted 16 March, 2016; v1 submitted 14 March, 2016; originally announced March 2016.

    Comments: Version 2 updates only the metadata, to correct the formatting of Martín Abadi's name

  40. arXiv:1511.08228  [pdf, ps, other

    cs.LG cs.NE

    Neural GPUs Learn Algorithms

    Authors: Łukasz Kaiser, Ilya Sutskever

    Abstract: Learning an algorithm from examples is a fundamental problem that has been widely studied. Recently it has been addressed using neural networks, in particular by Neural Turing Machines (NTMs). These are fully differentiable computers that use backpropagation to learn their own programming. Despite their appeal NTMs have a weakness that is caused by their sequential nature: they are not parallel an… ▽ More

    Submitted 14 March, 2016; v1 submitted 25 November, 2015; originally announced November 2015.

  41. arXiv:1511.06807  [pdf, other

    stat.ML cs.LG

    Adding Gradient Noise Improves Learning for Very Deep Networks

    Authors: Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens

    Abstract: Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than… ▽ More

    Submitted 20 November, 2015; originally announced November 2015.

  42. arXiv:1511.06114  [pdf, ps, other

    cs.LG cs.CL stat.ML

    Multi-task Sequence to Sequence Learning

    Authors: Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser

    Abstract: Sequence to sequence learning has recently emerged as a new paradigm in supervised learning. To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks. This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting - where the encoder is shared between several tasks such as machi… ▽ More

    Submitted 1 March, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

    Comments: 10 pages, 4 figures, ICLR 2016 camera-ready, added parsing SOTA results

  43. Low-frequency type II radio detections and coronagraph data to describe and forecast the propagation of 71 CMEs/shocks

    Authors: H. Cremades, F. A. Iglesias, O. C. St. Cyr, H. Xie, M. L. Kaiser, N. Gopalswamy

    Abstract: The vulnerability of technology on which present society relies demands that a solar event, its time of arrival at Earth, and its degree of geoeffectiveness be promptly forecasted. Motivated by improving predictions of arrival times at Earth of shocks driven by coronal mass ejections (CMEs), we have analyzed 71 Earth-directed events in different stages of their propagation. The study is primarily… ▽ More

    Submitted 7 May, 2015; originally announced May 2015.

    Comments: Solar Physics; Accepted for publication 2015-Apr-21

  44. arXiv:1412.7449  [pdf, other

    cs.CL cs.LG stat.ML

    Grammar as a Foreign Language

    Authors: Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

    Abstract: Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used… ▽ More

    Submitted 9 June, 2015; v1 submitted 23 December, 2014; originally announced December 2014.

  45. arXiv:1408.4745  [pdf, ps, other

    cs.DM math.CO

    Directed Width Measures and Monotonicity of Directed Graph Searching

    Authors: Łukasz Kaiser, Stephan Kreutzer, Roman Rabinovich, Sebastian Siebertz

    Abstract: We consider generalisations of tree width to directed graphs, that attracted much attention in the last fifteen years. About their relative strength with respect to "bounded width in one measure implies bounded width in the other" many problems remain unsolved. Only some results separating directed width measures are known. We give an almost complete picture of this relation. For this, we consider… ▽ More

    Submitted 20 August, 2014; originally announced August 2014.

    MSC Class: 68R10

  46. Model Checking the Quantitative mu-Calculus on Linear Hybrid Systems

    Authors: Diana Fischer, Lukasz Kaiser

    Abstract: We study the model-checking problem for a quantitative extension of the modal mu-calculus on a class of hybrid systems. Qualitative model checking has been proved decidable and implemented for several classes of systems, but this is not the case for quantitative questions that arise naturally in this context. Recently, quantitative formalisms that subsume classical temporal logics and allow the m… ▽ More

    Submitted 19 September, 2012; v1 submitted 8 September, 2012; originally announced September 2012.

    Comments: LMCS submission

    ACM Class: D.2.4, F.4.1

    Journal ref: Logical Methods in Computer Science, Volume 8, Issue 3 (September 20, 2012) lmcs:760

  47. Degrees of Lookahead in Regular Infinite Games

    Authors: Michael Holtmann, Lukasz Kaiser, Wolfgang Thomas

    Abstract: We study variants of regular infinite games where the strict alternation of moves between the two players is subject to modifications. The second player may postpone a move for a finite number of steps, or, in other words, exploit in his strategy some lookahead on the moves of the opponent. This captures situations in distributed systems, e.g. when buffers are present in communication or when sig… ▽ More

    Submitted 25 September, 2012; v1 submitted 4 September, 2012; originally announced September 2012.

    Comments: LMCS submission

    ACM Class: D.2.4

    Journal ref: Logical Methods in Computer Science, Volume 8, Issue 3 (September 27, 2012) lmcs:922

  48. Radio-loud CMEs from the disk center lacking shocks at 1 AU

    Authors: N. Gopalswamy, P. Makela, S. Akiyama, S. Yashiro, H. Xie, R. J. MacDowall, M. L. Kaiser

    Abstract: A coronal mass ejection (CME) associated with a type II burst and originating close to the center of the solar disk typically results in a shock at Earth in 2-3 days and hence can be used to predict shock arrival at Earth. However, a significant fraction (about 28%) of such CMEs producing type II bursts were not associated with shocks at Earth. We examined a set of 21 type II bursts observed by th… ▽ More

    Submitted 29 June, 2012; originally announced July 2012.

    Comments: 33 pages, 11 figures, 2 tables

    Report number: 2012ja017610R

  49. Interplanetary shocks lacking type II radio bursts

    Authors: N. Gopalswamy, H. Xie, P. Makela, S. Akiyama, S. Yashiro, M. L. Kaiser, R. A. Howard, J. -L. Bougeret

    Abstract: We report on the radio-emission characteristics of 222 interplanetary (IP) shocks. A surprisingly large fraction of the IP shocks (~34%) is radio quiet (i.e., the shocks lacked type II radio bursts). The CMEs associated with the RQ shocks are generally slow (average speed ~535 km/s) and only ~40% of the CMEs were halos. The corresponding numbers for CMEs associated with radio loud (RL) shocks ar… ▽ More

    Submitted 15 January, 2010; v1 submitted 23 December, 2009; originally announced December 2009.

    Journal ref: Astrophys.J.710:1111-1126,2010

  50. arXiv:0903.4141  [pdf, ps, other

    astro-ph.SR astro-ph.EP

    Dust detection by the wave instrument on STEREO: nanoparticles picked up by the solar wind?

    Authors: N. Meyer-Vernet, M. Maksimovic, A. Czechowski, I. Mann, I. Zouganelis, K. Goetz, M. L. Kaiser, O. C. St. Cyr, J. L. Bougeret, S. D. Bale

    Abstract: The STEREO/WAVES instrument has detected a very large number of intense voltage pulses. We suggest that these events are produced by impact ionisation of nanoparticles striking the spacecraft at a velocity of the order of magnitude of the solar wind speed. Nanoparticles, which are half-way between micron-sized dust and atomic ions, have such a large charge-to-mass ratio that the electric field i… ▽ More

    Submitted 4 April, 2009; v1 submitted 24 March, 2009; originally announced March 2009.

    Comments: In press in Solar Physics, 13 pages, 5 figures