Skip to main content

Showing 1–14 of 14 results for author: Birodkar, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  2. arXiv:2311.14822  [pdf, other

    cs.CV

    Text and Click inputs for unambiguous open vocabulary instance segmentation

    Authors: Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar

    Abstract: Segmentation localizes objects in an image on a fine-grained per-pixel scale. Segmentation benefits by humans-in-the-loop to provide additional input of objects to segment using a combination of foreground or background clicks. Tasks include photoediting or novel dataset annotation, where human annotators leverage an existing segmentation model instead of drawing raw pixel level annotations. We pr… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: 20 pages, 9 figures, 8 tables

  3. arXiv:2310.05737  [pdf, other

    cs.CV cs.AI cs.MM

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

    Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More

    Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  4. arXiv:2302.05442  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Vision Transformers to 22 Billion Parameters

    Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver , et al. (17 additional authors not shown)

    Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  5. arXiv:2212.10596  [pdf, other

    cs.CV

    Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

    Authors: Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

    Abstract: Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised mod… ▽ More

    Submitted 10 January, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

  6. arXiv:2204.00484  [pdf, other

    cs.CV cs.LG

    Proper Reuse of Image Classification Features Improves Object Detection

    Authors: Cristina Vasconcelos, Vighnesh Birodkar, Vincent Dumoulin

    Abstract: A common practice in transfer learning is to initialize the downstream model weights by pre-training on a data-abundant upstream task. In object detection specifically, the feature backbone is typically initialized with Imagenet classifier weights and fine-tuned on the object detection task. Recent works show this is not strictly necessary under longer training regimes and provide recipes for trai… ▽ More

    Submitted 27 June, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Journal ref: CVPR 2022

  7. arXiv:2111.12872  [pdf, other

    cs.CV cs.CL

    Less is More: Generating Grounded Navigation Instructions from Landmarks

    Authors: Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter Anderson

    Abstract: We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multiling… ▽ More

    Submitted 4 April, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: CVPR 2022 Camera-ready

  8. arXiv:2105.03494  [pdf, other

    cs.CV

    The iWildCam 2021 Competition Dataset

    Authors: Sara Beery, Arushi Agarwal, Elijah Cole, Vighnesh Birodkar

    Abstract: Camera traps enable the automatic collection of large quantities of image data. Ecologists use camera traps to monitor animal populations all over the world. In order to estimate the abundance of a species from camera trap data, ecologists need to know not just which species were seen, but also how many individuals of each species were seen. Object detection techniques can be used to find the numb… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

    Comments: FGVC8 Workshop at CVPR 2021. arXiv admin note: substantial text overlap with arXiv:2004.10340

  9. arXiv:2104.00613  [pdf, other

    cs.CV

    The surprising impact of mask-head architecture on novel class segmentation

    Authors: Vighnesh Birodkar, Zhichao Lu, Siyang Li, Vivek Rathod, Jonathan Huang

    Abstract: Instance segmentation models today are very accurate when trained on large annotated datasets, but collecting mask annotations at scale is prohibitively expensive. We address the partially supervised instance segmentation problem in which one can train on (significantly cheaper) bounding boxes for all categories but use masks only for a subset of categories. In this work, we focus on a popular fam… ▽ More

    Submitted 17 August, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

  10. arXiv:1906.03808  [pdf, other

    cs.LG stat.ML

    A Closed-Form Learned Pooling for Deep Classification Networks

    Authors: Vighnesh Birodkar, Hossein Mobahi, Dilip Krishnan, Samy Bengio

    Abstract: In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a single, local filter is shared across the entire image. However, there are scenarios where we may need to treat spatial locations in non-uniform manner.… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

  11. arXiv:1903.00586  [pdf, other

    cs.CV

    Straight to the point: reinforcement learning for user guidance in ultrasound

    Authors: Fausto Milletari, Vighnesh Birodkar, Michal Sofka

    Abstract: Point of care ultrasound (POCUS) consists in the use of ultrasound imaging in critical or emergency situations to support clinical decisions by healthcare professionals and first responders. In this setting it is essential to be able to provide means to obtain diagnostic data to potentially inexperienced users who did not receive an extensive medical training. Interpretation and acquisition of ult… ▽ More

    Submitted 1 March, 2019; originally announced March 2019.

  12. arXiv:1901.11409  [pdf, other

    cs.CV cs.LG stat.ML

    Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need

    Authors: Vighnesh Birodkar, Hossein Mobahi, Samy Bengio

    Abstract: Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for commo… ▽ More

    Submitted 29 January, 2019; originally announced January 2019.

  13. arXiv:1705.10915  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Unsupervised Learning of Disentangled Representations from Video

    Authors: Remi Denton, Vighnesh Birodkar

    Abstract: We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the tim… ▽ More

    Submitted 30 May, 2017; originally announced May 2017.

  14. arXiv:1609.05257  [pdf, other

    cs.CV

    A convolutional approach to reflection symmetry

    Authors: Marcelo Cicconet, Vighnesh Birodkar, Mads Lund, Michael Werman, Davi Geiger

    Abstract: We present a convolutional approach to reflection symmetry detection in 2D. Our model, built on the products of complex-valued wavelet convolutions, simplifies previous edge-based pairwise methods. Being parameter-centered, as opposed to feature-centered, it has certain computational advantages when the object sizes are known a priori, as demonstrated in an ellipse detection application. The metho… ▽ More

    Submitted 16 September, 2016; originally announced September 2016.

    Comments: This paper is under consideration at Pattern Recognition Letters