Search | arXiv e-print repository

Poly-View Contrastive Learning

Authors: Amitis Shidani, Devon Hjelm, Jason Ramapuram, Russ Webb, Eeshan Gunesh Dhekane, Dan Busbridge

Abstract: Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimit… ▽ More Contrastive learning typically matches pairs of related views among a number of unrelated negative views. Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics. We show that with unlimited computation, one should maximize the number of related views, and with a fixed compute budget, it is beneficial to decrease the number of unique samples whilst increasing the number of views of those samples. In particular, poly-view contrastive models trained for 128 epochs with batch size 256 outperform SimCLR trained for 1024 epochs at batch size 4096 on ImageNet1k, challenging the belief that contrastive models require large batch sizes and many training epochs. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted to ICLR 2024. 42 pages, 7 figures, 3 tables, loss pseudo-code included in appendix

arXiv:2312.03213 [pdf, other]

Bootstrap Your Own Variance

Authors: Polina Turishcheva, Jason Ramapuram, Sinead Williamson, Dan Busbridge, Eeshan Dhekane, Russ Webb

Abstract: Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gauss… ▽ More Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise). △ Less

Submitted 5 December, 2023; originally announced December 2023.

Journal ref: NeurIPS 2023 Workshop: Self-Supervised Learning - Theory and Practice

arXiv:2307.13813 [pdf, other]

How to Scale Your EMA

Authors: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb

Abstract: Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functio… ▽ More Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings. △ Less

Submitted 7 November, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

Comments: Spotlight at NeurIPS 2023, 53 pages, 32 figures, 17 tables

arXiv:2110.00528 [pdf, other]

Do Self-Supervised and Supervised Methods Learn Similar Visual Representations?

Authors: Tom George Grigg, Dan Busbridge, Jason Ramapuram, Russ Webb

Abstract: Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of neural representations, we explore in this direction by comparing a contrastive self-supervised algorithm to supervision for simple image data in a common architec… ▽ More Despite the success of a number of recent techniques for visual self-supervised deep learning, there has been limited investigation into the representations that are ultimately learned. By leveraging recent advances in the comparison of neural representations, we explore in this direction by comparing a contrastive self-supervised algorithm to supervision for simple image data in a common architecture. We find that the methods learn similar intermediate representations through dissimilar means, and that the representations diverge rapidly in the final few layers. We investigate this divergence, finding that these layers strongly fit to their distinct learning objectives. We also find that the contrastive objective implicitly fits the supervised objective in intermediate layers, but that the reverse is not true. Our work particularly highlights the importance of the learned intermediate representations, and raises critical questions for auxiliary task design. △ Less

Submitted 2 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: Accepted to 2nd Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2021), Sydney, Australia. Fixed typos, added acknowledgements. 5 pages + 2 pages of appendices, 5 figures, 1 table

arXiv:1912.08444 [pdf, other]

Relational Mimic for Visual Adversarial Imitation Learning

Authors: Lionel Blondé, Yichuan Charlie Tang, Jian Zhang, Russ Webb

Abstract: In this work, we introduce a new method for imitation learning from video demonstrations. Our method, Relational Mimic (RM), improves on previous visual imitation learning methods by combining generative adversarial networks and relational learning. RM is flexible and can be used in conjunction with other recent advances in generative adversarial imitation learning to better address the need for m… ▽ More In this work, we introduce a new method for imitation learning from video demonstrations. Our method, Relational Mimic (RM), improves on previous visual imitation learning methods by combining generative adversarial networks and relational learning. RM is flexible and can be used in conjunction with other recent advances in generative adversarial imitation learning to better address the need for more robust and sample-efficient approaches. In addition, we introduce a new neural network architecture that improves upon the previous state-of-the-art in reinforcement learning and illustrate how increasing the relational reasoning capabilities of the agent enables the latter to achieve increasingly higher performance in a challenging locomotion task with pixel inputs. Finally, we study the effects and contributions of relational learning in policy evaluation, policy improvement and reward learning through ablation studies. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1905.03658 [pdf, other]

Improving Discrete Latent Representations With Differentiable Approximation Bridges

Authors: Jason Ramapuram, Russ Webb

Abstract: Modern neural network training relies on piece-wise (sub-)differentiable functions in order to use backpropagation to update model parameters. In this work, we introduce a novel method to allow simple non-differentiable functions at intermediary layers of deep neural networks. We do so by training with a differentiable approximation bridge (DAB) neural network which approximates the non-differenti… ▽ More Modern neural network training relies on piece-wise (sub-)differentiable functions in order to use backpropagation to update model parameters. In this work, we introduce a novel method to allow simple non-differentiable functions at intermediary layers of deep neural networks. We do so by training with a differentiable approximation bridge (DAB) neural network which approximates the non-differentiable forward function and provides gradient updates during backpropagation. We present strong empirical results (performing over 600 experiments) in four different domains: unsupervised (image) representation learning, variational (image) density estimation, image classification, and sequence sorting to demonstrate that our proposed method improves state of the art performance. We demonstrate that training with DAB aided discrete non-differentiable functions improves image reconstruction quality and posterior linear separability by 10% against the Gumbel-Softmax relaxed estimator [37, 26] as well as providing a 9% improvement in the test variational lower bound in comparison to the state of the art RELAX [16] discrete estimator. We also observe an accuracy improvement of 77% in neural sequence sorting and a 25% improvement against the straight-through estimator [5] in an image classification setting. The DAB network is not used for inference and expands the class of functions that are usable in neural networks. △ Less

Submitted 25 October, 2019; v1 submitted 9 May, 2019; originally announced May 2019.

arXiv:1812.03170 [pdf, other]

Variational Saccading: Efficient Inference for Large Resolution Images

Authors: Jason Ramapuram, Maurits Diephuis, Frantzeska Lavda, Russ Webb, Alexandros Kalousis

Abstract: Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224 x 244 in Resnet models [24]. This limitation excludes the 4000 x 3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sam… ▽ More Image classification with deep neural networks is typically restricted to images of small dimensionality such as 224 x 244 in Resnet models [24]. This limitation excludes the 4000 x 3000 dimensional images that are taken by modern smartphone cameras and smart devices. In this work, we aim to mitigate the prohibitive inferential and memory costs of operating in such large dimensional spaces. To sample from the high-resolution original input distribution, we propose using a smaller proxy distribution to learn the co-ordinates that correspond to regions of interest in the high-dimensional space. We introduce a new principled variational lower bound that captures the relationship of the proxy distribution's posterior and the original image's co-ordinate space in a way that maximizes the conditional classification likelihood. We empirically demonstrate on one synthetic benchmark and one real world large resolution DSLR camera image dataset that our method produces comparable results with ~10x faster inference and lower memory consumption than a model that utilizes the entire original input distribution. Finally, we experiment with a more complex setting using mini-maps from Starcraft II [56] to infer the number of characters in a complex 3d-rendered scene. Even in such complicated scenes our model provides strong localization: a feature missing from traditional classification models. △ Less

Submitted 6 September, 2019; v1 submitted 8 December, 2018; originally announced December 2018.

Comments: Published BMVC 2019 & NIPS 2018 Bayesian Deep Learning Workshop

arXiv:1807.00126 [pdf, other]

A New Benchmark and Progress Toward Improved Weakly Supervised Learning

Authors: Jason Ramapuram, Russ Webb

Abstract: Knowledge Matters: Importance of Prior Information for Optimization [7], by Gulcehre et. al., sought to establish the limits of current black-box, deep learning techniques by posing problems which are difficult to learn without engineering knowledge into the model or training procedure. In our work, we completely solve the previous Knowledge Matters problem using a generic model, pose a more diffi… ▽ More Knowledge Matters: Importance of Prior Information for Optimization [7], by Gulcehre et. al., sought to establish the limits of current black-box, deep learning techniques by posing problems which are difficult to learn without engineering knowledge into the model or training procedure. In our work, we completely solve the previous Knowledge Matters problem using a generic model, pose a more difficult and scalable problem, All-Pairs, and advance this new problem by introducing a new learned, spatially-varying histogram model called TypeNet which outperforms conventional models on the problem. We present results on All-Pairs where our model achieves 100% test accuracy while the best ResNet models achieve 79% accuracy. In addition, our model is more than an order of magnitude smaller than Resnet-34. The challenge of solving larger-scale All-Pairs problems with high accuracy is presented to the community for investigation. △ Less

Submitted 18 September, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

Showing 1–8 of 8 results for author: Webb, R