Skip to main content

Showing 1–44 of 44 results for author: Ordonez, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19388  [pdf, other

    cs.SD cs.CL cs.CV cs.MM eess.AS

    Taming Data and Transformers for Audio Generation

    Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, Vicente Ordonez

    Abstract: Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Project Webpage: https://snap-research.github.io/GenAU/

  2. arXiv:2406.11262  [pdf, other

    cs.CV

    Generative Visual Instruction Tuning

    Authors: Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

    Abstract: We propose to use machine-generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruct… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2405.04834  [pdf, other

    cs.CV

    FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

    Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

    Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexibl… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  4. arXiv:2403.16921  [pdf, other

    cs.CV

    PropTest: Automatic Property Testing for Improved Visual Programming

    Authors: Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez

    Abstract: Visual Programming has emerged as an alternative to end-to-end black-box visual reasoning models. This type of methods leverage Large Language Models (LLMs) to decompose a problem and generate the source code for an executable computer program. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose Pro… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Project Page: https://jaywonkoo17.github.io/PropTest/

  5. arXiv:2403.13804  [pdf, other

    cs.CV cs.CL cs.LG

    Learning from Models and Data for Visual Grounding

    Authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, Vicente Ordonez

    Abstract: We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: the… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project Page: https://catherine-r-he.github.io/SynGround/

  6. arXiv:2402.18695  [pdf, other

    cs.CV cs.CL

    Grounding Language Models for Visual Entity Recognition

    Authors: Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez

    Abstract: We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label spa… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  7. arXiv:2312.04554  [pdf, other

    cs.CV cs.CL cs.LG

    Improved Visual Grounding through Self-Consistent Explanations

    Authors: Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang, Alexander C. Berg, Vicente Ordonez

    Abstract: Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with par… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project Page: https://catherine-r-he.github.io/SelfEQ/

  8. arXiv:2311.18822  [pdf, other

    cs.CV

    ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation

    Authors: Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez

    Abstract: Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global… ▽ More

    Submitted 31 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: Accepted at CVPR 2024. Project Page: https://elasticdiffusion.github.io/

  9. arXiv:2311.16311  [pdf, other

    cs.CV

    Characterizing Video Question Answering with Sparsified Inputs

    Authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson

    Abstract: In Video Question Answering, videos are often processed as a full-length sequence of frames to ensure minimal loss of information. Recent works have demonstrated evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss the case of single frame selection. In our work, we extend the setting to multiple number of inputs and other modalities. We char… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  10. arXiv:2308.12910  [pdf, other

    cs.CV

    SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

    Authors: Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez

    Abstract: We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject,… ▽ More

    Submitted 4 January, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: WACV 2024

  11. arXiv:2303.17590  [pdf, other

    cs.CV cs.CL

    Going Beyond Nouns With Vision & Language Models Using Synthetic Data

    Authors: Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gül Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky

    Abstract: Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (… ▽ More

    Submitted 30 August, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023. Project page: https://synthetic-vic.github.io/

  12. arXiv:2303.12001  [pdf, other

    cs.CV

    ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

    Authors: Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

    Abstract: We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well… ▽ More

    Submitted 30 November, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: More results on Video an Image datasets, ViC-MAE now supports training on videos and images

  13. arXiv:2303.07615  [pdf, other

    cs.CV

    Variation of Gender Biases in Visual Recognition Models Before and After Finetuning

    Authors: Jaspreet Ranjit, Tianlu Wang, Baishakhi Ray, Vicente Ordonez

    Abstract: We introduce a framework to measure how biases change before and after fine-tuning a large scale visual recognition model for a downstream task. Deep learning models trained on increasing amounts of data are known to encode societal biases. Many computer vision systems today rely on models typically pretrained on large scale datasets. While bias mitigation techniques have been developed for tuning… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: 10 pages, 3 Figures

  14. arXiv:2211.12494  [pdf, other

    cs.CV cs.LG

    On the Transferability of Visual Features in Generalized Zero-Shot Learning

    Authors: Paola Cascante-Bonilla, Leonid Karlinsky, James Seale Smith, Yanjun Qi, Vicente Ordonez

    Abstract: Generalized Zero-Shot Learning (GZSL) aims to train a classifier that can generalize to unseen classes, using a set of attributes as auxiliary information, and the visual features extracted from a pre-trained convolutional neural network. While recent GZSL methods have explored various techniques to leverage the capacity of these features, there has been an extensive growth of representation learn… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

  15. arXiv:2206.15462  [pdf, other

    cs.CV cs.CL cs.LG

    Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

    Authors: Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez

    Abstract: We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-la… ▽ More

    Submitted 6 January, 2024; v1 submitted 30 June, 2022; originally announced June 2022.

    Comments: CVPR 2023. Fix ReferIt results. Code: https://github.com/uvavision/AMC-grounding Project Webpage: https://vislang.ai/amc

  16. arXiv:2205.09830  [pdf, ps, other

    cs.CL

    Towards Understanding Gender-Seniority Compound Bias in Natural Language Generation

    Authors: Samhita Honnavalli, Aesha Parekh, Lily Ou, Sophie Groenwold, Sharon Levy, Vicente Ordonez, William Yang Wang

    Abstract: Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: 6 pages, LREC 2022

  17. arXiv:2203.17219  [pdf, other

    cs.CV

    SimVQA: Exploring Simulated Environments for Visual Question Answering

    Authors: Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio Feris, Vicente Ordonez

    Abstract: Existing work on VQA explores data augmentation to achieve better generalization by perturbing the images in the dataset or modifying the existing questions and answers. While these methods exhibit good performance, the diversity of the questions and answers are constrained by the available image set. In this work we explore using synthetic computer-generated data to fully control the visual and l… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to CVPR 2022. Camera-Ready version. Project page: https://simvqa.github.io/

  18. arXiv:2203.13612  [pdf, other

    cs.LG cs.AI cs.CV cs.SE

    Repairing Group-Level Errors for DNNs Using Weighted Regularization

    Authors: Ziyuan Zhong, Yuchi Tian, Conor J. Sweeney, Vicente Ordonez, Baishakhi Ray

    Abstract: Deep Neural Networks (DNNs) have been widely used in software making decisions impacting people's lives. However, they have been found to exhibit severe erroneous behaviors that may lead to unfortunate outcomes. Previous work shows that such misbehaviors often occur due to class property violations rather than errors on a single image. Although methods for detecting such errors have been proposed,… ▽ More

    Submitted 4 April, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

  19. arXiv:2112.07133  [pdf, other

    cs.CV

    CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

    Authors: Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez

    Abstract: We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information… ▽ More

    Submitted 11 May, 2023; v1 submitted 13 December, 2021; originally announced December 2021.

  20. arXiv:2110.15946  [pdf, other

    cs.CV cs.IT

    Estimating and Maximizing Mutual Information for Knowledge Distillation

    Authors: Aman Shrivastava, Yanjun Qi, Vicente Ordonez

    Abstract: In this work, we propose Mutual Information Maximization Knowledge Distillation (MIMKD). Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network. We demonstrate through extensive experiments that this can be used to improve the performance of low capac… ▽ More

    Submitted 11 May, 2023; v1 submitted 29 October, 2021; originally announced October 2021.

  21. arXiv:2106.09011  [pdf, other

    cs.CV cs.LG cs.NE

    Evolving Image Compositions for Feature Representation Learning

    Authors: Paola Cascante-Bonilla, Arshdeep Sekhon, Yanjun Qi, Vicente Ordonez

    Abstract: Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples are assigned label scores that are proportional to the number of patches borrowed from each ima… ▽ More

    Submitted 31 March, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to BMVC 2021. Camera-Ready version. Project page: https://paolacascante.com/patchmix/index.html

  22. arXiv:2103.12236  [pdf, other

    cs.CV

    Instance-level Image Retrieval using Reranking Transformers

    Authors: Fuwen Tan, Jiangbo Yuan, Vicente Ordonez

    Abstract: Instance-level image retrieval is the task of searching in a large database for images that match an object in a query image. To address this task, systems usually rely on a retrieval step that uses global image descriptors, and a subsequent step that performs domain-specific refinements or reranking by leveraging operations such as geometric verification based on local features. In this work, we… ▽ More

    Submitted 4 June, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: ICCV 2021, Table-3 corrected

  23. arXiv:2012.01250  [pdf, other

    cs.CV cs.LG

    Chair Segments: A Compact Benchmark for the Study of Object Segmentation

    Authors: Leticia Pinto-Alva, Ian K. Torres, Rosangel Garcia, Ziyan Yang, Vicente Ordonez

    Abstract: Over the years, datasets and benchmarks have had an outsized influence on the design of novel algorithms. In this paper, we introduce ChairSegments, a novel and compact semi-synthetic dataset for object segmentation. We also show empirical findings in transfer learning that mirror recent findings for image classification. We particularly show that models that are fine-tuned from a pretrained set o… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: 10 pages, 7 figures

  24. arXiv:2011.14027  [pdf, other

    cs.CV cs.AI cs.LG

    General Multi-label Image Classification with Transformers

    Authors: Jack Lanchantin, Tianlu Wang, Vicente Ordonez, Yanjun Qi

    Abstract: Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Tr… ▽ More

    Submitted 27 November, 2020; originally announced November 2020.

    Comments: 13 pages, 7 figures

  25. arXiv:2010.03743  [pdf, other

    cs.CV

    Visual News: Benchmark and Challenges in News Image Captioning

    Authors: Fuxiao Liu, Yinghan Wang, Tianlu Wang, Vicente Ordonez

    Abstract: We propose Visual News Captioner, an entity-aware model for the task of news image captioning. We also introduce Visual News, a large-scale benchmark consisting of more than one million news images along with associated news articles, image captions, author information, and other metadata. Unlike the standard image captioning task, news images depict situations where people, locations, and events… ▽ More

    Submitted 13 September, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

    Comments: 9 pages, 5 figures, accepted to EMNLP2021

  26. arXiv:2006.03204  [pdf, other

    cs.CV cs.AI cs.LG

    Black-box Explanation of Object Detectors via Saliency Maps

    Authors: Vitali Petsiuk, Rajiv Jain, Varun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, Kate Saenko

    Abstract: We propose D-RISE, a method for generating visual explanations for the predictions of object detectors. Utilizing the proposed similarity metric that accounts for both localization and categorization aspects of object detection allows our method to produce saliency maps that show image areas that most affect the prediction. D-RISE can be considered "black-box" in the software testing sense, as it… ▽ More

    Submitted 10 June, 2021; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: CVPR 2021 (oral). Project page https://cs-people.bu.edu/vpetsiuk/drise/

  27. arXiv:2005.00965  [pdf, other

    cs.CL cs.LG

    Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation

    Authors: Tianlu Wang, Xi Victoria Lin, Nazneen Fatema Rajani, Bryan McCann, Vicente Ordonez, Caiming Xiong

    Abstract: Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus reg… ▽ More

    Submitted 2 May, 2020; originally announced May 2020.

    Comments: Accepted to ACL 2020

  28. arXiv:2001.06001  [pdf, other

    cs.LG cs.CV stat.ML

    Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

    Authors: Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez

    Abstract: In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and… ▽ More

    Submitted 10 December, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

    Comments: In the 35th AAAI Conference on Artificial Intelligence. AAAI 2021

  29. arXiv:1912.07773  [pdf, other

    cs.CV cs.HC cs.LG cs.RO

    MEDIRL: Predicting the Visual Attention of Drivers via Maximum Entropy Deep Inverse Reinforcement Learning

    Authors: Sonia Baee, Erfan Pakdamanian, Inki Kim, Lu Feng, Vicente Ordonez, Laura Barnes

    Abstract: Inspired by human visual attention, we propose a novel inverse reinforcement learning formulation using Maximum Entropy Deep Inverse Reinforcement Learning (MEDIRL) for predicting the visual attention of drivers in accident-prone situations. MEDIRL predicts fixation locations that lead to maximal rewards by learning a task-sensitive reward function from eye fixation patterns recorded from attentiv… ▽ More

    Submitted 6 October, 2021; v1 submitted 16 December, 2019; originally announced December 2019.

    Comments: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  30. arXiv:1911.03826  [pdf, other

    cs.CV

    Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

    Authors: Fuwen Tan, Paola Cascante-Bonilla, Xiaoxiao Guo, Hui Wu, Song Feng, Vicente Ordonez

    Abstract: This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state rep… ▽ More

    Submitted 9 November, 2019; originally announced November 2019.

    Comments: 14 pages, 9 figures, NeurIPS 2019

  31. arXiv:1908.03180  [pdf, other

    cs.CV

    Moviescope: Large-scale Analysis of Movies using Multiple Modalities

    Authors: Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, Vicente Ordonez

    Abstract: Film media is a rich form of artistic expression. Unlike photography, and short videos, movies contain a storyline that is deliberately complex and intricate in order to engage its audience. In this paper we present a large scale study comparing the effectiveness of visual, audio, text, and metadata-based features for predicting high-level information about movies such as their genre or estimated… ▽ More

    Submitted 8 August, 2019; originally announced August 2019.

  32. arXiv:1905.07831  [pdf, other

    cs.SE cs.CV cs.LG

    Testing DNN Image Classifiers for Confusion & Bias Errors

    Authors: Yuchi Tian, Ziyuan Zhong, Vicente Ordonez, Gail Kaiser, Baishakhi Ray

    Abstract: Image classifiers are an important component of today's software, from consumer and business applications to safety-critical domains. The advent of Deep Neural Networks (DNNs) is the key catalyst behind such wide-spread success. However, wide adoption comes with serious concerns about the robustness of software systems dependent on DNNs for image classification, as several severe erroneous behavio… ▽ More

    Submitted 11 February, 2020; v1 submitted 19 May, 2019; originally announced May 2019.

  33. arXiv:1904.03310  [pdf, other

    cs.CL

    Gender Bias in Contextualized Word Embeddings

    Authors: Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, Kai-Wei Chang

    Abstract: In this paper, we quantify, analyze and mitigate gender bias exhibited in ELMo's contextualized word vectors. First, we conduct several intrinsic analyses and find that (1) training data for ELMo contains significantly more male than female entities, (2) the trained ELMo embeddings systematically encode gender information and (3) ELMo unequally encodes gender information about male and female enti… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

  34. arXiv:1812.04081  [pdf, other

    cs.CL cs.HC

    Chat-crowd: A Dialog-based Platform for Visual Layout Composition

    Authors: Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez, Song Feng

    Abstract: In this paper we introduce Chat-crowd, an interactive environment for visual layout composition via conversational interactions. Chat-crowd supports multiple agents with two conversational roles: agents who play the role of a designer are in charge of placing objects in an editable canvas according to instructions or commands issued by agents with a director role. The system can be integrated with… ▽ More

    Submitted 1 April, 2019; v1 submitted 10 December, 2018; originally announced December 2018.

  35. arXiv:1811.08489  [pdf, other

    cs.CV

    Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations

    Authors: Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, Vicente Ordonez

    Abstract: In this work, we present a framework to measure and mitigate intrinsic biases with respect to protected variables --such as gender-- in visual recognition tasks. We show that trained models significantly amplify the association of target labels with gender beyond what one would expect from biased datasets. Surprisingly, we show that even when datasets are balanced such that each label co-occurs eq… ▽ More

    Submitted 10 October, 2019; v1 submitted 20 November, 2018; originally announced November 2018.

    Comments: 10 pages, 7 figures, ICCV 2019

  36. arXiv:1809.01110  [pdf, other

    cs.CV cs.CL

    Text2Scene: Generating Compositional Scenes from Textual Descriptions

    Authors: Fuwen Tan, Song Feng, Vicente Ordonez

    Abstract: In this paper, we propose Text2Scene, a model that generates various forms of compositional scene representations from natural language descriptions. Unlike recent works, our method does NOT use Generative Adversarial Networks (GANs). Text2Scene instead learns to sequentially generate objects and their attributes (location, size, appearance, etc) at every time step by attending to different parts… ▽ More

    Submitted 9 June, 2019; v1 submitted 4 September, 2018; originally announced September 2018.

    Comments: CVPR 2019

  37. arXiv:1805.08587  [pdf, other

    cs.IR cs.CV

    Deep Feature Aggregation and Image Re-ranking with Heat Diffusion for Image Retrieval

    Authors: Shanmin Pang, ** Ma, Jianru Xue, Jihua Zhu, Vicente Ordonez

    Abstract: Image retrieval based on deep convolutional features has demonstrated state-of-the-art performance in popular benchmarks. In this paper, we present a unified solution to address deep convolutional feature aggregation and image re-ranking by simulating the dynamics of heat diffusion. A distinctive problem in image retrieval is that repetitive or \emph{bursty} features tend to dominate final image r… ▽ More

    Submitted 8 October, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

    Comments: The paper has been accepted to IEEE Transactions on Multimedia

  38. arXiv:1804.06876  [pdf, other

    cs.CL cs.AI

    Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

    Authors: Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, Kai-Wei Chang

    Abstract: We introduce a new benchmark, WinoBias, for coreference resolution focused on gender bias. Our corpus contains Winograd-schema style sentences with entities corresponding to people referred by their occupation (e.g. the nurse, the doctor, the carpenter). We demonstrate that a rule-based, a feature-rich, and a neural coreference system all link gendered pronouns to pro-stereotypical entities with h… ▽ More

    Submitted 18 April, 2018; originally announced April 2018.

    Comments: NAACL '18 Camera Ready

  39. arXiv:1710.08049  [pdf, other

    cs.CV

    Feedback-prop: Convolutional Neural Network Inference under Partial Evidence

    Authors: Tianlu Wang, Kota Yamaguchi, Vicente Ordonez

    Abstract: We propose an inference procedure for deep convolutional neural networks (CNNs) when partial evidence is available. Our method consists of a general feedback-based propagation approach (feedback-prop) that boosts the prediction accuracy for an arbitrary set of unknown target labels when the values for a non-overlap** arbitrary set of target labels are known. We show that existing models trained… ▽ More

    Submitted 29 March, 2018; v1 submitted 22 October, 2017; originally announced October 2017.

    Comments: Accepted to CVPR 2018

  40. arXiv:1707.09457  [pdf, other

    cs.AI cs.CL cs.CV stat.ML

    Men Also Like Shop**: Reducing Gender Bias Amplification using Corpus-level Constraints

    Authors: Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, Kai-Wei Chang

    Abstract: Language is increasingly being used to define rich visual recognition problems with supporting image collections sourced from the web. Structured prediction models are used in these tasks to take advantage of correlations between co-occurring labels and visual input but risk inadvertently encoding social biases found in web corpora. In this work, we study data and models associated with multilabel… ▽ More

    Submitted 28 July, 2017; originally announced July 2017.

    Comments: 11 pages, published in EMNLP 2017

  41. arXiv:1707.07102  [pdf, other

    cs.CV cs.CL

    OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts

    Authors: Xuwang Yin, Vicente Ordonez

    Abstract: Generating captions for images is a task that has recently received considerable attention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information provided is a set of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence model that encodes a set of objects and their locations as an input sequence using an LSTM network, and… ▽ More

    Submitted 22 July, 2017; originally announced July 2017.

    Comments: Accepted at EMNLP 2017

  42. arXiv:1706.01021  [pdf, other

    cs.GR cs.CV

    Where and Who? Automatic Semantic-Aware Person Composition

    Authors: Fuwen Tan, Crispin Bernier, Benjamin Cohen, Vicente Ordonez, Connelly Barnes

    Abstract: Image compositing is a method used to generate realistic yet fake imagery by inserting contents from one image to another. Previous work in compositing has focused on improving appearance compatibility of a user selected foreground segment and a background image (i.e. color and illumination consistency). In this work, we instead develop a fully automated compositing model that additionally learns… ▽ More

    Submitted 2 December, 2017; v1 submitted 3 June, 2017; originally announced June 2017.

    Comments: 10 pages, 9 figures

  43. arXiv:1612.00901  [pdf, other

    cs.CV cs.AI

    Commonly Uncommon: Semantic Sparsity in Situation Recognition

    Authors: Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, Ali Farhadi

    Abstract: Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objec… ▽ More

    Submitted 2 December, 2016; originally announced December 2016.

  44. arXiv:1603.05279  [pdf, other

    cs.CV

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

    Authors: Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi

    Abstract: We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This resu… ▽ More

    Submitted 2 August, 2016; v1 submitted 16 March, 2016; originally announced March 2016.