-
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Authors:
Jun Wang,
Mingfei Gao,
Yuqian Hu,
Ramprasaath R. Selvaraju,
Chetan Ramaiah,
Ran Xu,
Joseph F. JaJa,
Larry S. Davis
Abstract:
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of the text in each image participates…
▽ More
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of the text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code is available at https://github.com/HenryJunW/TAG.
△ Less
Submitted 7 October, 2022; v1 submitted 2 August, 2022;
originally announced August 2022.
-
Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework
Authors:
Shu Zhang,
Ran Xu,
Caiming Xiong,
Chetan Ramaiah
Abstract:
Current contrastive learning frameworks focus on leveraging a single supervisory signal to learn representations, which limits the efficacy on unseen data and downstream tasks. In this paper, we present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between classes. We introduce novel hierarchy preservi…
▽ More
Current contrastive learning frameworks focus on leveraging a single supervisory signal to learn representations, which limits the efficacy on unseen data and downstream tasks. In this paper, we present a hierarchical multi-label representation learning framework that can leverage all available labels and preserve the hierarchical relationship between classes. We introduce novel hierarchy preserving losses, which jointly apply a hierarchical penalty to the contrastive loss, and enforce the hierarchy constraint. The loss function is data driven and automatically adapts to arbitrary multi-label structures. Experiments on several datasets show that our relationship-preserving embedding performs well on a variety of tasks and outperform the baseline supervised and self-supervised approaches. Code is available at https://github.com/salesforce/hierarchicalContrastiveLearning.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
Value Retrieval with Arbitrary Queries for Form-like Documents
Authors:
Mingfei Gao,
Le Xue,
Chetan Ramaiah,
Chen Xing,
Ran Xu,
Caiming Xiong
Abstract:
We propose value retrieval with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of the layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (Simple…
▽ More
We propose value retrieval with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of the layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (SimpleDLM) strategy to improve document understanding on large-scale model pre-training. Experimental results show that our method outperforms previous designs significantly and the SimpleDLM further improves our performance on value retrieval by around 17% F1 score compared with the state-of-the-art pre-training method. Code is available at https://github.com/salesforce/QVR-SimpleDLM.
△ Less
Submitted 15 April, 2022; v1 submitted 14 December, 2021;
originally announced December 2021.
-
Burn After Reading: Online Adaptation for Cross-domain Streaming Data
Authors:
Luyu Yang,
Mingfei Gao,
Zeyuan Chen,
Ran Xu,
Abhinav Shrivastava,
Chetan Ramaiah
Abstract:
In the context of online privacy, many methods propose complex privacy and security preserving measures to protect sensitive data. In this paper, we argue that: not storing any sensitive data is the best form of security. Thus we propose an online framework that "burns after reading", i.e. each online sample is immediately deleted after it is processed. Meanwhile, we tackle the inevitable distribu…
▽ More
In the context of online privacy, many methods propose complex privacy and security preserving measures to protect sensitive data. In this paper, we argue that: not storing any sensitive data is the best form of security. Thus we propose an online framework that "burns after reading", i.e. each online sample is immediately deleted after it is processed. Meanwhile, we tackle the inevitable distribution shift between the labeled public data and unlabeled private data as a problem of unsupervised domain adaptation. Specifically, we propose a novel algorithm that aims at the most fundamental challenge of the online adaptation setting--the lack of diverse source-target data pairs. Therefore, we design a Cross-Domain Bootstrap** approach, called CroDoBo, to increase the combined diversity across domains. Further, to fully exploit the valuable discrepancies among the diverse combinations, we employ the training strategy of multiple learners with co-supervision. CroDoBo achieves state-of-the-art online performance on four domain adaptation benchmarks.
△ Less
Submitted 8 December, 2021;
originally announced December 2021.
-
Proposal Learning for Semi-Supervised Object Detection
Authors:
Peng Tang,
Chetan Ramaiah,
Yan Wang,
Ran Xu,
Caiming Xiong
Abstract:
In this paper, we focus on semi-supervised object detection to boost performance of proposal-based object detectors (a.k.a. two-stage object detectors) by training on both labeled and unlabeled data. However, it is non-trivial to train object detectors on unlabeled data due to the unavailability of ground truth labels. To address this problem, we present a proposal learning approach to learn propo…
▽ More
In this paper, we focus on semi-supervised object detection to boost performance of proposal-based object detectors (a.k.a. two-stage object detectors) by training on both labeled and unlabeled data. However, it is non-trivial to train object detectors on unlabeled data due to the unavailability of ground truth labels. To address this problem, we present a proposal learning approach to learn proposal features and predictions from both labeled and unlabeled data. The approach consists of a self-supervised proposal learning module and a consistency-based proposal learning module. In the self-supervised proposal learning module, we present a proposal location loss and a contrastive loss to learn context-aware and noise-robust proposal features respectively. In the consistency-based proposal learning module, we apply consistency losses to both bounding box classification and regression predictions of proposals to learn noise-robust proposal features and predictions. Our approach enjoys the following benefits: 1) encouraging more context information to delivered in the proposals learning procedure; 2) noisy proposal features and enforcing consistency to allow noise-robust object detection; 3) building a general and high-performance semi-supervised object detection framework, which can be easily adapted to proposal-based object detectors with different backbone architectures. Experiments are conducted on the COCO dataset with all available labeled and unlabeled data. Results demonstrate that our approach consistently improves the performance of fully-supervised baselines. In particular, after combining with data distillation, our approach improves AP by about 2.0% and 0.9% on average compared to fully-supervised baselines and data distillation baselines respectively.
△ Less
Submitted 23 April, 2020; v1 submitted 14 January, 2020;
originally announced January 2020.
-
Challenges in Representation Learning: A report on three machine learning contests
Authors:
Ian J. Goodfellow,
Dumitru Erhan,
Pierre Luc Carrier,
Aaron Courville,
Mehdi Mirza,
Ben Hamner,
Will Cukierski,
Yichuan Tang,
David Thaler,
Dong-Hyun Lee,
Yingbo Zhou,
Chetan Ramaiah,
Fangxiang Feng,
Ruifan Li,
Xiaojie Wang,
Dimitris Athanasakis,
John Shawe-Taylor,
Maxim Milakov,
John Park,
Radu Ionescu,
Marius Popescu,
Cristian Grozea,
James Bergstra,
**g**g Xie,
Lukasz Romaszko
, et al. (3 additional authors not shown)
Abstract:
The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kin…
▽ More
The ICML 2013 Workshop on Challenges in Representation Learning focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions.
△ Less
Submitted 1 July, 2013;
originally announced July 2013.