Skip to main content

Showing 1–23 of 23 results for author: Gritsenko, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.02887  [pdf, other

    cs.CV cs.LG

    Time-, Memory- and Parameter-Efficient Visual Adaptation

    Authors: Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, Anurag Arnab

    Abstract: As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  2. arXiv:2401.09865  [pdf, other

    cs.CV cs.AI cs.LG

    Improving fine-grained understanding in image-text pre-training

    Authors: Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrović

    Abstract: We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grou** of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language to… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: 26 pages

  3. arXiv:2308.11093  [pdf, other

    cs.CV cs.AI cs.LG

    Video OWL-ViT: Temporally-consistent open-world localization in video

    Authors: Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf

    Abstract: We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tas… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  4. arXiv:2307.06304  [pdf, other

    cs.CV cs.AI cs.LG

    Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

    Authors: Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby

    Abstract: The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

  5. arXiv:2306.09683  [pdf, other

    cs.CV

    Scaling Open-Vocabulary Object Detection

    Authors: Matthias Minderer, Alexey Gritsenko, Neil Houlsby

    Abstract: Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses… ▽ More

    Submitted 22 May, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

  6. arXiv:2304.12160  [pdf, other

    cs.CV

    End-to-End Spatio-Temporal Action Localisation with Video Transformers

    Authors: Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

    Abstract: The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an input video, and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

  7. arXiv:2302.05442  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Vision Transformers to 22 Billion Parameters

    Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver , et al. (17 additional authors not shown)

    Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  8. arXiv:2210.02303  [pdf, other

    cs.CV cs.LG

    Imagen Video: High Definition Video Generation with Diffusion Models

    Authors: Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans

    Abstract: We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design deci… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: See accompanying website: https://imagen.research.google/video/

  9. arXiv:2207.03807  [pdf, other

    cs.CV

    Beyond Transfer Learning: Co-finetuning for Action Localisation

    Authors: Anurag Arnab, Xuehan Xiong, Alexey Gritsenko, Rob Romijnders, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid

    Abstract: Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large ``upstream'' datasets for classification, as such labels are easy to collect, and then finetuned on ``downstream'' tasks such as action localisation, which are smaller due to their finer-grained annotations. In this paper, we question this approach, and propos… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  10. arXiv:2205.06230  [pdf, other

    cs.CV

    Simple Open-Vocabulary Object Detection with Vision Transformers

    Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

    Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary… ▽ More

    Submitted 20 July, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: ECCV 2022 camera-ready version

  11. arXiv:2204.03458  [pdf, other

    cs.CV cs.AI cs.LG

    Video Diffusion Models

    Authors: Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet

    Abstract: Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to… ▽ More

    Submitted 22 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

  12. arXiv:2112.05692  [pdf, other

    cs.CV cs.AI cs.HC cs.LG

    VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling

    Authors: Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, Alexey Gritsenko

    Abstract: User interface modeling is inherently multimodal, which involves several distinct types of data: images, structures and language. The tasks are also diverse, including object detection, language generation and grounding. In this paper, we present VUT, a Versatile UI Transformer that takes multimodal input and simultaneously accomplishes 5 distinct tasks with the same model. Our model consists of a… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

  13. arXiv:2110.11403  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SCENIC: A JAX Library for Computer Vision Research and Beyond

    Authors: Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay

    Abstract: Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond. The goal of this toolkit is to facilitate rapid experimentation, prototy**, and research of new vision architectures and models. Scenic supports a diverse range of vision tasks (e.g., classification, segmentation, detection)and facilitates working on multi-modal problems, along… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  14. arXiv:2110.02037  [pdf, other

    cs.LG stat.ML

    Autoregressive Diffusion Models

    Authors: Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans

    Abstract: We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model represent… ▽ More

    Submitted 1 February, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: Published as a conference paper at International Conference on Learning Representations (ICLR) 2022

  15. arXiv:2107.07002  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.IR

    The Benchmark Lottery

    Authors: Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, Oriol Vinyals

    Abstract: The world of empirical machine learning (ML) strongly relies on benchmarks in order to determine the relative effectiveness of different algorithms and methods. This paper proposes the notion of "a benchmark lottery" that describes the overall fragility of the ML benchmarking process. The benchmark lottery postulates that many factors, other than fundamental algorithmic superiority, may lead to a… ▽ More

    Submitted 14 July, 2021; originally announced July 2021.

  16. arXiv:2012.06957  [pdf, other

    cs.LG

    Open-World Class Discovery with Kernel Networks

    Authors: Zifeng Wang, Batool Salehi, Andrey Gritsenko, Kaushik Chowdhury, Stratis Ioannidis, Jennifer Dy

    Abstract: We study an Open-World Class Discovery problem in which, given labeled training samples from old classes, we need to discover new classes from unlabeled test samples. There are two critical challenges to addressing this paradigm: (a) transferring knowledge from old to new classes, and (b) incorporating knowledge learned from new classes back to the original model. We propose Class Discovery Kernel… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

    Comments: Accepted to the IEEE International Conference on Data Mining 2020 (ICDM'20); Best paper candidate

  17. arXiv:2008.01160  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    A Spectral Energy Distance for Parallel Speech Synthesis

    Authors: Alexey A. Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek, Nal Kalchbrenner

    Abstract: Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited fo… ▽ More

    Submitted 23 October, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

  18. arXiv:2006.12459  [pdf, other

    cs.LG stat.ML

    IDF++: Analyzing and Improving Integer Discrete Flows for Lossless Compression

    Authors: Rianne van den Berg, Alexey A. Gritsenko, Mostafa Dehghani, Casper Kaae Sønderby, Tim Salimans

    Abstract: In this paper we analyse and improve integer discrete flows for lossless compression. Integer discrete flows are a recently proposed class of models that learn invertible transformations for integer-valued random variables. Their discrete nature makes them particularly suitable for lossless compression with entropy coding schemes. We start by investigating a recent theoretical claim that states th… ▽ More

    Submitted 23 March, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Accepted as a conference paper at the Ninth International Conference on Learning Representations (ICLR) 2021

  19. arXiv:1912.08638  [pdf, other

    cs.LG stat.ML

    Incremental ELMVIS for unsupervised learning

    Authors: Anton Akusok, Emil Eirola, Yoan Miche, Ian Oliver, Kaj-Mikael Björk, Andrey Gritsenko, Stephen Baek, Amaury Lendasse

    Abstract: An incremental version of the ELMVIS+ method is proposed in this paper. It iteratively selects a few best fitting data samples from a large pool, and adds them to the model. The method keeps high speed of ELMVIS+ while allowing for much larger possible sample pools due to lower memory requirements. The extension is useful for reaching a better local optimum with greedy optimization of ELMVIS, and… ▽ More

    Submitted 18 December, 2019; originally announced December 2019.

    Journal ref: Proceedings of ELM-2016 (pp. 183-193). Springer, Cham

  20. arXiv:1812.06869  [pdf, other

    cs.LG cs.CV stat.ML

    BriarPatches: Pixel-Space Interventions for Inducing Demographic Parity

    Authors: Alexey A. Gritsenko, Alex D'Amour, James Atwood, Yoni Halpern, D. Sculley

    Abstract: We introduce the BriarPatch, a pixel-space intervention that obscures sensitive attributes from representations encoded in pre-trained classifiers. The patches encourage internal model representations not to encode sensitive information, which has the effect of pushing downstream predictors towards exhibiting demographic parity with respect to the sensitive information. The net result is that thes… ▽ More

    Submitted 17 December, 2018; originally announced December 2018.

    Comments: 6 pages, 5 figures, NeurIPS Workshop on Ethical, Social and Governance Issues in AI

  21. arXiv:1807.00244  [pdf, other

    eess.IV cs.LG eess.SP

    Automatic Identification of Twin Zygosity in Resting-State Functional MRI

    Authors: Andrey Gritsenko, Martin A. Lindquist, Gregory R. Kirk, Moo K. Chung

    Abstract: A key strength of twin studies arises from the fact that there are two types of twins, monozygotic and dizygotic, that share differing amounts of genetic information. Accurate differentiation of twin types allows efficient inference on genetic influences in a population. However, identification of zygosity is often prone to errors without genotying. In this study, we propose a novel pairwise featu… ▽ More

    Submitted 26 October, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

  22. arXiv:1710.06368  [pdf, other

    cs.GR cs.CV

    Embedded Spectral Descriptors: Learning the point-wise correspondence metric via Siamese neural networks

    Authors: Zhiyu Sun, Yusen He, Andrey Gritsenko, Amaury Lendasse, Stephen Baek

    Abstract: A robust and informative local shape descriptor plays an important role in mesh registration. In this regard, spectral descriptors that are based on the spectrum of the Laplace-Beltrami operator have been a popular subject of research for the last decade due to their advantageous properties, such as isometry invariance. Despite such, however, spectral descriptors often fail to give a correct simil… ▽ More

    Submitted 18 October, 2019; v1 submitted 17 October, 2017; originally announced October 2017.

  23. arXiv:1310.1553  [pdf

    cs.DC

    A Workflow-Forecast Approach To The Task Scheduling Problem In Distributed Computing Systems

    Authors: Andrey Gritsenko

    Abstract: The aim of this paper is to provide a description of deep-learning-based scheduling approach for academic-purpose high-performance computing systems. The share of academic-purpose distributed computing systems (DCS) reaches 17.4 percents amongst TOP500 supercomputer sites (15.6 percents in performance scale) that makes them a valuable object of research. The core of this approach is to predict the… ▽ More

    Submitted 6 October, 2013; originally announced October 2013.

    Comments: 7 pages, 5 tables, 7 figures

    Journal ref: International Journal of Advanced Studies in Computer Science and Engineering, Volume 2, Special Issue 2, pp. 1-7. September 2013