Search | arXiv e-print repository

Flamingo: a Visual Language Model for Few-Shot Learning

Authors: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals , et al. (2 additional authors not shown)

Abstract: Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily i… ▽ More Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data. △ Less

Submitted 15 November, 2022; v1 submitted 29 April, 2022; originally announced April 2022.

Comments: 54 pages. In Proceedings of Neural Information Processing Systems (NeurIPS) 2022

arXiv:2112.06749 [pdf, other]

Step-unrolled Denoising Autoencoders for Text Generation

Authors: Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, Aaron van den Oord

Abstract: In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iteratio… ▽ More In this paper we propose a new generative model of text, Step-unrolled Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models. Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a sequence of tokens, starting from random inputs and improving them each time until convergence. We present a simple new improvement operator that converges in fewer iterations than diffusion methods, while qualitatively producing better samples on natural language datasets. SUNDAE achieves state-of-the-art results (among non-autoregressive methods) on the WMT'14 English-to-German translation task and good qualitative results on unconditional language modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python code from GitHub. The non-autoregressive nature of SUNDAE opens up possibilities beyond left-to-right prompted generation, by filling in arbitrary blank patterns in a template. △ Less

Submitted 19 April, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Accepted to ICLR 2022

arXiv:2102.05182 [pdf, other]

A Deep Learning Approach for Characterizing Major Galaxy Mergers

Authors: Skanda Koppula, Victor Bapst, Marc Huertas-Company, Sam Blackwell, Agnieszka Grabska-Barwinska, Sander Dieleman, Andrea Huber, Natasha Antropova, Mikolaj Binkowski, Hannah Openshaw, Adria Recasens, Fernando Caro, Avishai Deke, Yohan Dubois, Jesus Vega Ferrero, David C. Koo, Joel R. Primack, Trevor Back

Abstract: Fine-grained estimation of galaxy merger stages from observations is a key problem useful for validation of our current theoretical understanding of galaxy formation. To this end, we demonstrate a CNN-based regression model that is able to predict, for the first time, using a single image, the merger stage relative to the first perigee passage with a median error of 38.3 million years (Myrs) over… ▽ More Fine-grained estimation of galaxy merger stages from observations is a key problem useful for validation of our current theoretical understanding of galaxy formation. To this end, we demonstrate a CNN-based regression model that is able to predict, for the first time, using a single image, the merger stage relative to the first perigee passage with a median error of 38.3 million years (Myrs) over a period of 400 Myrs. This model uses no specific dynamical modeling and learns only from simulated merger events. We show that our model provides reasonable estimates on real observations, approximately matching prior estimates provided by detailed dynamical modeling. We provide a preliminary interpretability analysis of our models, and demonstrate first steps toward calibrated uncertainty estimation. △ Less

Submitted 9 February, 2021; originally announced February 2021.

Comments: Third Workshop on Machine Learning and the Physical Sciences (NeurIPS 2020), Vancouver, Canada

arXiv:2006.03575 [pdf, other]

End-to-End Adversarial Text-to-Speech

Authors: Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, Karen Simonyan

Abstract: Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audi… ▽ More Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time war** in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision. △ Less

Submitted 17 March, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

Comments: 23 pages. In proceedings of ICLR 2021

arXiv:1909.11646 [pdf, other]

High Fidelity Speech Synthesis with Adversarial Networks

Authors: Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

Abstract: Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introdu… ▽ More Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech. Our architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyse the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced. To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS - Mean Opinion Score), as well as novel quantitative metrics (Fréchet DeepSpeech Distance and Kernel DeepSpeech Distance), which we find to be well correlated with MOS. We show that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator. Listen to GAN-TTS reading this abstract at https://storage.googleapis.com/deepmind-media/research/abstract.wav. △ Less

Submitted 26 September, 2019; v1 submitted 25 September, 2019; originally announced September 2019.

arXiv:1905.12760 [pdf, other]

Batch weight for domain adaptation with mass shift

Authors: Mikołaj Bińkowski, R Devon Hjelm, Aaron Courville

Abstract: Unsupervised domain transfer is the task of transferring or translating samples from a source distribution to a different target distribution. Current solutions unsupervised domain transfer often operate on data on which the modes of the distribution are well-matched, for instance have the same frequencies of classes between source and target distributions. However, these models do not perform wel… ▽ More Unsupervised domain transfer is the task of transferring or translating samples from a source distribution to a different target distribution. Current solutions unsupervised domain transfer often operate on data on which the modes of the distribution are well-matched, for instance have the same frequencies of classes between source and target distributions. However, these models do not perform well when the modes are not well-matched, as would be the case when samples are drawn independently from two different, but related, domains. This mode imbalance is problematic as generative adversarial networks (GANs), a successful approach in this setting, are sensitive to mode frequency, which results in a mismatch of semantics between source samples and generated samples of the target distribution. We propose a principled method of re-weighting training samples to correct for such mass shift between the transferred distributions, which we call batch-weight. We also provide rigorous probabilistic setting for domain transfer and new simplified objective for training transfer networks, an alternative to complex, multi-component loss functions used in the current state-of-the art image-to-image translation models. The new objective stems from the discrimination of joint distributions and enforces cycle-consistency in an abstract, high-level, rather than pixel-wise, sense. Lastly, we experimentally show the effectiveness of the proposed methods in several image-to-image translation tasks. △ Less

Submitted 29 May, 2019; originally announced May 2019.

arXiv:1811.03766 [pdf, other]

Endogeneous Dynamics of Intraday Liquidity

Authors: Mikołaj Bińkowski, Charles-Albert Lehalle

Abstract: In this paper we investigate the endogenous information contained in four liquidity variables at a five minutes time scale on equity markets around the world: the traded volume, the bid-ask spread, the volatility and the volume at first limits of the orderbook. In the spirit of Granger causality, we measure the level of information by the level of accuracy of linear autoregressive models. This emp… ▽ More In this paper we investigate the endogenous information contained in four liquidity variables at a five minutes time scale on equity markets around the world: the traded volume, the bid-ask spread, the volatility and the volume at first limits of the orderbook. In the spirit of Granger causality, we measure the level of information by the level of accuracy of linear autoregressive models. This empirical study is carried out on a dataset of more than 300 stocks from four different markets (US, UK, Japan and Hong Kong) from a period of over five years. We discuss the obtained performances of autoregressive (AR) models on stationarized versions of the variables, focusing on explaining the observed differences between stocks. Since empirical studies are often conducted at this time scale, we believe it is of paramount importance to document endogenous dynamics in a simple framework with no addition of supplemental information. Our study can hence be used as a benchmark to identify exogenous effects. On the other hand, most optimal trading frameworks (like the celebrated Almgren and Chriss one), focus on computing an optimal trading speed at a frequency close to the one we consider. Such frameworks very often take i.i.d. assumptions on liquidity variables; this paper document the auto-correlations emerging from real data, opening the door to new developments in optimal trading. △ Less

Submitted 8 November, 2018; originally announced November 2018.

arXiv:1805.11565 [pdf, other]

On gradient regularizers for MMD GANs

Authors: Michael Arbel, Danica J. Sutherland, Mikołaj Bińkowski, Arthur Gretton

Abstract: We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approxim… ▽ More We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on $160 \times 160$ CelebA and $64 \times 64$ unconditional ImageNet. △ Less

Submitted 14 January, 2021; v1 submitted 29 May, 2018; originally announced May 2018.

Comments: Code at https://github.com/MichaelArbel/Scaled-MMD-GAN

Journal ref: Advances in Neural Information Processing Systems 31 (NeurIPS 2018), 6700-6710

arXiv:1801.01401 [pdf, other]

Demystifying MMD GANs

Authors: Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel, Arthur Gretton

Abstract: We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a… ▽ More We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramer GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training. △ Less

Submitted 14 January, 2021; v1 submitted 4 January, 2018; originally announced January 2018.

Comments: Published at ICLR 2018: https://openreview.net/forum?id=r1lUOzWCW

arXiv:1703.04122 [pdf, other]

Autoregressive Convolutional Neural Networks for Asynchronous Time Series

Authors: Mikołaj Bińkowski, Gautier Marti, Philippe Donnat

Abstract: We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of adjusted regressors, w… ▽ More We propose Significance-Offset Convolutional Neural Network, a deep convolutional network architecture for regression of multivariate asynchronous time series. The model is inspired by standard autoregressive (AR) models and gating mechanisms used in recurrent neural networks. It involves an AR-like weighting system, where the final predictor is obtained as a weighted sum of adjusted regressors, while the weights are datadependent functions learnt through a convolutional network. The architecture was designed for applications on asynchronous time series and is evaluated on such datasets: a hedge fund proprietary dataset of over 2 million quotes for a credit derivative index, an artificially generated noisy autoregressive series and UCI household electricity consumption dataset. The proposed architecture achieves promising results as compared to convolutional and recurrent neural networks. △ Less

Submitted 12 June, 2018; v1 submitted 12 March, 2017; originally announced March 2017.

Comments: Proceedings of The 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018, to appear

arXiv:1703.00485 [pdf, ps, other]

doi 10.1007/978-3-030-65459-7

A review of two decades of correlations, hierarchies, networks and clustering in financial markets

Authors: Gautier Marti, Frank Nielsen, Mikołaj Bińkowski, Philippe Donnat

Abstract: We review the state of the art of clustering financial time series and the study of their correlations alongside other interaction networks. The aim of this review is to gather in one place the relevant material from different fields, e.g. machine learning, information geometry, econophysics, statistical physics, econometrics, behavioral finance. We hope it will help researchers to use more effect… ▽ More We review the state of the art of clustering financial time series and the study of their correlations alongside other interaction networks. The aim of this review is to gather in one place the relevant material from different fields, e.g. machine learning, information geometry, econophysics, statistical physics, econometrics, behavioral finance. We hope it will help researchers to use more effectively this alternative modeling of the financial time series. Decision makers and quantitative researchers may also be able to leverage its insights. Finally, we also hope that this review will form the basis of an open toolbox to study correlations, hierarchies, networks and clustering in financial markets. △ Less

Submitted 3 November, 2020; v1 submitted 1 March, 2017; originally announced March 2017.

Journal ref: Chapter in Progress in Information Geometry: Theory and Applications, 245-274, 2021

Showing 1–11 of 11 results for author: Bińkowski, M