Search | arXiv e-print repository

GIVT: Generative Infinite-Vocabulary Transformers

Authors: Michael Tschannen, Cian Eastwood, Fabian Mentzer

Abstract: We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, w… ▽ More We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a $β$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN (and improved variants thereof) as well as MaskGIT, and achieves performance competitive with recent latent diffusion models. Finally, we obtain strong results outside of image generation when applying GIVT to panoptic segmentation and depth estimation with a VAE variant of the UViM framework △ Less

Submitted 21 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: v2: add related NLP work, loss details. v3: Improved GMM formulation, added adapter module, larger models, better image generation results. Code and model checkpoints are available at: https://github.com/google-research/big_vision

arXiv:2311.08815 [pdf, other]

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

Authors: Cian Eastwood, Julius von Kügelgen, Linus Ericsson, Diane Bouchacourt, Pascal Vincent, Bernhard Schölkopf, Mark Ibrahim

Abstract: Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style f… ▽ More Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2307.09933 [pdf, other]

Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features

Authors: Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von Kügelgen, Bernhard Schölkopf

Abstract: To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain… ▽ More To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data. △ Less

Submitted 8 November, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023 Camera-Ready

arXiv:2210.00364 [pdf, other]

DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability

Authors: Cian Eastwood, Andrei Liviu Nicolicioiu, Julius von Kügelgen, Armin Kekić, Frederik Träuble, Andrea Dittadi, Bernhard Schölkopf

Abstract: In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonline… ▽ More In representation learning, a common approach is to seek representations which disentangle the underlying factors of variation. Eastwood & Williams (2018) proposed three metrics for quantifying the quality of such disentangled representations: disentanglement (D), completeness (C) and informativeness (I). In this work, we first connect this DCI framework to two common notions of linear and nonlinear identifiability, thereby establishing a formal link between disentanglement and the closely-related field of independent component analysis. We then propose an extended DCI-ES framework with two new measures of representation quality - explicitness (E) and size (S) - and point out how D and C can be computed for black-box predictors. Our main idea is that the functional capacity required to use a representation is an important but thus-far neglected aspect of representation quality, which we quantify using explicitness or ease-of-use (E). We illustrate the relevance of our extensions on the MPI3D and Cars3D datasets. △ Less

Submitted 16 February, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

Comments: Accepted to ICLR 2023

arXiv:2207.09944 [pdf, other]

Probable Domain Generalization via Quantile Risk Minimization

Authors: Cian Eastwood, Alexander Robey, Shashank Singh, Julius von Kügelgen, Hamed Hassani, George J. Pappas, Bernhard Schölkopf

Abstract: Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst… ▽ More Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $α$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $α$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $α\to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed. △ Less

Submitted 22 August, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: NeurIPS 2022 camera-ready (+ minor corrections)

arXiv:2203.04694 [pdf, other]

Align-Deform-Subtract: An Interventional Framework for Explaining Object Differences

Authors: Cian Eastwood, Li Nanbo, Christopher K. I. Williams

Abstract: Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes diff… ▽ More Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes differences in object properties. The result is a set of "disentangled" error measures which explain object differences in terms of the underlying properties. Experiments on real and synthetic data illustrate the efficacy of the framework. △ Less

Submitted 20 July, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: ICLR 2022 Workshop on Objects, Structure and Causality

arXiv:2111.07117 [pdf, other]

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Authors: Li Nanbo, Cian Eastwood, Robert B. Fisher

Abstract: Learning object-centric representations of multi-object scenes is a promising approach towards machine intelligence, facilitating high-level reasoning and control from visual sensory data. However, current approaches for unsupervised object-centric scene representation are incapable of aggregating information from multiple observations of a scene. As a result, these "single-view" methods form thei… ▽ More Learning object-centric representations of multi-object scenes is a promising approach towards machine intelligence, facilitating high-level reasoning and control from visual sensory data. However, current approaches for unsupervised object-centric scene representation are incapable of aggregating information from multiple observations of a scene. As a result, these "single-view" methods form their representations of a 3D scene based only on a single 2D observation (view). Naturally, this leads to several inaccuracies, with these methods falling victim to single-view spatial ambiguities. To address this, we propose The Multi-View and Multi-Object Network (MulMON) -- a method for learning accurate, object-centric representations of multi-object scenes by leveraging multiple views. In order to sidestep the main technical difficulty of the multi-object-multi-view scenario -- maintaining object correspondences across views -- MulMON iteratively updates the latent object representations for a scene over multiple views. To ensure that these iterative updates do indeed aggregate spatial information to form a complete 3D scene understanding, MulMON is asked to predict the appearance of the scene from novel viewpoints during training. Through experiments, we show that MulMON better-resolves spatial ambiguities than single-view methods -- learning more accurate and disentangled object representations -- and also achieves new functionality in predicting object segmentations for novel viewpoints. △ Less

Submitted 13 November, 2021; originally announced November 2021.

Comments: Accepted at NeurIPS 2020 (Spotlight)

arXiv:2107.05446 [pdf, other]

Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration

Authors: Cian Eastwood, Ian Mason, Christopher K. I. Williams, Bernhard Schölkopf

Abstract: Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level o… ▽ More Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level of feature-space class-separation in the target domain. We address these issues for a particularly pervasive type of domain shift called measurement shift which can be resolved by restoring the source features rather than extracting new ones. In particular, we propose Feature Restoration (FR) wherein we: (i) store a lightweight and flexible approximation of the feature distribution under the source data; and (ii) adapt the feature-extractor such that the approximate feature distribution under the target data realigns with that saved on the source. We additionally propose a bottom-up training scheme which boosts performance, which we call Bottom-Up Feature Restoration (BUFR). On real and synthetic data, we demonstrate that BUFR outperforms existing SFDA methods in terms of accuracy, calibration, and data efficiency, while being less reliant on the performance of the source model in the target domain. △ Less

Submitted 17 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: ICLR 2022 (Spotlight)

arXiv:2007.12989 [pdf, other]

Information Fusion on Belief Networks

Authors: Shawn C. Eastwood, Svetlana N. Yanushkevich

Abstract: This paper will focus on the process of 'fusing' several observations or models of uncertainty into a single resultant model. Many existing approaches to fusion use subjective quantities such as 'strengths of belief' and process these quantities with heuristic algorithms. This paper argues in favor of quantities that can be objectively measured, as opposed to the subjective 'strength of belief' va… ▽ More This paper will focus on the process of 'fusing' several observations or models of uncertainty into a single resultant model. Many existing approaches to fusion use subjective quantities such as 'strengths of belief' and process these quantities with heuristic algorithms. This paper argues in favor of quantities that can be objectively measured, as opposed to the subjective 'strength of belief' values. This paper will focus on probability distributions, and more importantly, structures that denote sets of probability distributions known as 'credal sets'. The novel aspect of this paper will be a taxonomy of models of fusion that use specific types of credal sets, namely probability interval distributions and Dempster-Shafer models. An objective requirement for information fusion algorithms is provided, and is satisfied by all models of fusion presented in this paper. Dempster's rule of combination is shown to not satisfy this requirement. This paper will also assess the computational challenges involved for the proposed fusion approaches. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: 25 pages, pages of Appendix, 3 figures

Showing 1–9 of 9 results for author: Eastwood, C