Skip to main content

Showing 1–15 of 15 results for author: Cascante-Bonilla, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.16921  [pdf, other

    cs.CV

    PropTest: Automatic Property Testing for Improved Visual Programming

    Authors: Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez

    Abstract: Visual Programming has emerged as an alternative to end-to-end black-box visual reasoning models. This type of methods leverage Large Language Models (LLMs) to decompose a problem and generate the source code for an executable computer program. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose Pro… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Project Page: https://jaywonkoo17.github.io/PropTest/

  2. arXiv:2403.13804  [pdf, other

    cs.CV cs.CL cs.LG

    Learning from Models and Data for Visual Grounding

    Authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, Vicente Ordonez

    Abstract: We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: the… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project Page: https://catherine-r-he.github.io/SynGround/

  3. arXiv:2402.18695  [pdf, other

    cs.CV cs.CL

    Grounding Language Models for Visual Entity Recognition

    Authors: Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez

    Abstract: We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label spa… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  4. arXiv:2312.04554  [pdf, other

    cs.CV cs.CL cs.LG

    Improved Visual Grounding through Self-Consistent Explanations

    Authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, Vicente Ordonez

    Abstract: Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with par… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project Page: https://catherine-r-he.github.io/SelfEQ/

  5. arXiv:2305.19595  [pdf, other

    cs.CV

    Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

    Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ullman, Leonid Karlinsky

    Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of no… ▽ More

    Submitted 1 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

  6. arXiv:2303.17590  [pdf, other

    cs.CV cs.CL

    Going Beyond Nouns With Vision & Language Models Using Synthetic Data

    Authors: Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky

    Abstract: Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (… ▽ More

    Submitted 30 August, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023. Project page: https://synthetic-vic.github.io/

  7. arXiv:2211.13218  [pdf, other

    cs.CV cs.AI cs.LG

    CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

    Authors: James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, Zsolt Kira

    Abstract: Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has e… ▽ More

    Submitted 30 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

  8. arXiv:2211.12494  [pdf, other

    cs.CV cs.LG

    On the Transferability of Visual Features in Generalized Zero-Shot Learning

    Authors: Paola Cascante-Bonilla, Leonid Karlinsky, James Seale Smith, Yanjun Qi, Vicente Ordonez

    Abstract: Generalized Zero-Shot Learning (GZSL) aims to train a classifier that can generalize to unseen classes, using a set of attributes as auxiliary information, and the visual features extracted from a pre-trained convolutional neural network. While recent GZSL methods have explored various techniques to leverage the capacity of these features, there has been an extensive growth of representation learn… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

  9. arXiv:2211.09790  [pdf, other

    cs.LG cs.AI cs.CV

    ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

    Authors: James Seale Smith, Paola Cascante-Bonilla, Assaf Arbelle, Donghyun Kim, Rameswar Panda, David Cox, Diyi Yang, Zsolt Kira, Rogerio Feris, Leonid Karlinsky

    Abstract: Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object… ▽ More

    Submitted 30 March, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)

  10. arXiv:2203.17219  [pdf, other

    cs.CV

    SimVQA: Exploring Simulated Environments for Visual Question Answering

    Authors: Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio Feris, Vicente Ordonez

    Abstract: Existing work on VQA explores data augmentation to achieve better generalization by perturbing the images in the dataset or modifying the existing questions and answers. While these methods exhibit good performance, the diversity of the questions and answers are constrained by the available image set. In this work we explore using synthetic computer-generated data to fully control the visual and l… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022. Camera-Ready version. Project page: https://simvqa.github.io/

  11. arXiv:2106.09011  [pdf, other

    cs.CV cs.LG cs.NE

    Evolving Image Compositions for Feature Representation Learning

    Authors: Paola Cascante-Bonilla, Arshdeep Sekhon, Yanjun Qi, Vicente Ordonez

    Abstract: Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples are assigned label scores that are proportional to the number of patches borrowed from each ima… ▽ More

    Submitted 31 March, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to BMVC 2021. Camera-Ready version. Project page: https://paolacascante.com/patchmix/index.html

  12. arXiv:2001.06001  [pdf, other

    cs.LG cs.CV stat.ML

    Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

    Authors: Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez

    Abstract: In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and… ▽ More

    Submitted 10 December, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

    Comments: In the 35th AAAI Conference on Artificial Intelligence. AAAI 2021

  13. arXiv:1911.03826  [pdf, other

    cs.CV

    Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

    Authors: Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez

    Abstract: This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state rep… ▽ More

    Submitted 9 November, 2019; originally announced November 2019.

    Comments: 14 pages, 9 figures, NeurIPS 2019

  14. arXiv:1908.03180  [pdf, other

    cs.CV

    Moviescope: Large-scale Analysis of Movies using Multiple Modalities

    Authors: Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, Vicente Ordonez

    Abstract: Film media is a rich form of artistic expression. Unlike photography, and short videos, movies contain a storyline that is deliberately complex and intricate in order to engage its audience. In this paper we present a large scale study comparing the effectiveness of visual, audio, text, and metadata-based features for predicting high-level information about movies such as their genre or estimated… ▽ More

    Submitted 8 August, 2019; originally announced August 2019.

  15. arXiv:1812.04081  [pdf, other

    cs.CL cs.HC

    Chat-crowd: A Dialog-based Platform for Visual Layout Composition

    Authors: Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez, Song Feng

    Abstract: In this paper we introduce Chat-crowd, an interactive environment for visual layout composition via conversational interactions. Chat-crowd supports multiple agents with two conversational roles: agents who play the role of a designer are in charge of placing objects in an editable canvas according to instructions or commands issued by agents with a director role. The system can be integrated with… ▽ More

    Submitted 1 April, 2019; v1 submitted 10 December, 2018; originally announced December 2018.