Skip to main content

Showing 1–26 of 26 results for author: Manmatha, R

.
  1. arXiv:2404.04469  [pdf, other

    cs.CV

    Mixed-Query Transformer: A Unified Image Segmentation Architecture

    Authors: Pei Wang, Zhaowei Cai, Hao Yang, Ashwin Swaminathan, R. Manmatha, Stefano Soatto

    Abstract: Existing unified image segmentation models either employ a unified architecture across multiple tasks but use separate weights tailored to each dataset, or apply a single set of weights to multiple datasets but are limited to a single task. In this paper, we introduce the Mixed-Query Transformer (MQ-Former), a unified architecture for multi-task and multi-dataset image segmentation using a single… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

  2. arXiv:2404.02883  [pdf, other

    cs.CV cs.AI cs.LG

    On the Scalability of Diffusion-based Text-to-Image Generation

    Authors: Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

    Abstract: Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work,… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: CVPR2024

  3. arXiv:2311.08623  [pdf, other

    cs.CV cs.CL cs.LG

    DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models

    Authors: Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha

    Abstract: Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

  4. arXiv:2311.08622  [pdf, other

    cs.CV cs.CL cs.LG

    Multiple-Question Multiple-Answer Text-VQA

    Authors: Peng Tang, Srikar Appalaraju, R. Manmatha, Yusheng Xie, Vijay Mahadevan

    Abstract: We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to p… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

  5. arXiv:2307.07929  [pdf, other

    cs.CV

    DocTr: Document Transformer for Structured Information Extraction in Documents

    Authors: Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan

    Abstract: We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anch… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

  6. arXiv:2306.01733  [pdf, other

    cs.CV cs.CL cs.LG

    DocFormerv2: Local Features for Document Understanding

    Authors: Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, R. Manmatha

    Abstract: We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFo… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  7. arXiv:2302.07387  [pdf, other

    cs.CV

    PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

    Authors: Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, R. Manmatha

    Abstract: In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens… ▽ More

    Submitted 27 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: CVPR 2023. Project Page: https://polyformer.github.io/

  8. arXiv:2302.03432  [pdf, other

    cs.CV

    SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation

    Authors: Yash Patel, Yusheng Xie, Yi Zhu, Srikar Appalaraju, R. Manmatha

    Abstract: Learning to segment images purely by relying on the image-text alignment from web data can lead to sub-optimal performance due to noise in the data. The noise comes from the samples where the associated text does not correlate with the image's visual content. Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  9. arXiv:2211.07912  [pdf, other

    cs.CV

    YORO -- Lightweight End to End Visual Grounding

    Authors: Chih-Hui Ho, Srikar Appalaraju, Bhavan Jasani, R. Manmatha, Nuno Vasconcelos

    Abstract: We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Accepted to ECCVW on International Challenge on Compositional and Multimodal Perception

  10. arXiv:2208.03364  [pdf, other

    cs.CV cs.AI

    GLASS: Global to Local Attention for Scene-Text Spotting

    Authors: Roi Ronen, Shahar Tsiper, Oron Anschel, Inbal Lavi, Amir Markovitz, R. Manmatha

    Abstract: In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework. Under this paradigm, both tasks are accomplished by operating over a shared global feature map extracted from the input image. Among the main challenges that end-to-end approaches face is the performance degradation when recognizing text across scal… ▽ More

    Submitted 5 August, 2022; originally announced August 2022.

    Comments: 23 pages, 9 figures, ECCV'22

  11. arXiv:2202.05508  [pdf, other

    cs.CV cs.CL cs.LG

    Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer

    Authors: Yair Kittenplon, Inbal Lavi, Sharon Fogel, Yarin Bar, R. Manmatha, Pietro Perona

    Abstract: Text spotting end-to-end methods have recently gained attention in the literature due to the benefits of jointly optimizing the text detection and recognition components. Existing methods usually have a distinct separation between the detection and recognition branches, requiring exact annotations for the two tasks. We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting… ▽ More

    Submitted 14 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

  12. arXiv:2112.12494  [pdf, other

    cs.CV

    LaTr: Layout-Aware Transformer for Scene-Text VQA

    Authors: Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha

    Abstract: We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single… ▽ More

    Submitted 24 December, 2021; v1 submitted 23 December, 2021; originally announced December 2021.

  13. arXiv:2106.11539  [pdf, other

    cs.CV

    DocFormer: End-to-End Transformer for Document Understanding

    Authors: Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, R. Manmatha

    Abstract: We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses te… ▽ More

    Submitted 20 September, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to ICCV 2021 main conference

  14. arXiv:2012.12643  [pdf, other

    cs.CV cs.LG

    On Calibration of Scene-Text Recognition Models

    Authors: Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha

    Abstract: In this work, we study the problem of word-level confidence calibration for scene-text recognition (STR). Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored. We analyze several recent STR methods and show that they are consistently overconfident. We then fo… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  15. arXiv:2012.10873  [pdf, other

    cs.CV

    Sequence-to-Sequence Contrastive Learning for Text Recognition

    Authors: Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, Pietro Perona

    Abstract: We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition. To account for the sequence-to-sequence structure, each feature map is divided into different instances over which the contrastive loss is computed. This operation enables us to contrast in a sub-word level, where from each image we extract several positive p… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

  16. arXiv:2012.06567  [pdf, other

    cs.CV cs.MM

    A Comprehensive Study of Deep Video Action Recognition

    Authors: Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li

    Abstract: Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation prot… ▽ More

    Submitted 11 December, 2020; originally announced December 2020.

    Comments: Technical report. Code and model zoo can be found at https://cv.gluon.ai/model_zoo/action_recognition.html

  17. arXiv:2008.08899  [pdf, other

    cs.CV cs.IR

    Document Visual Question Answering Challenge 2020

    Authors: Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C. V. Jawahar

    Abstract: This paper presents results of Document Visual Question Answering Challenge organized as part of "Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem - Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single document image. On the other hand, the second task is… ▽ More

    Submitted 17 July, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: to be published as a short paper in DAS 2020

  18. arXiv:2004.14960  [pdf, other

    cs.CV

    Improving Semantic Segmentation via Self-Training

    Authors: Yi Zhu, Zhongyue Zhang, Chongruo Wu, Zhi Zhang, Tong He, Hang Zhang, R. Manmatha, Mu Li, Alexander Smola

    Abstract: Deep learning usually achieves the best results with complete supervision. In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models. In this paper, we show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm. We first train a teacher model on labeled data, and t… ▽ More

    Submitted 6 May, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

  19. arXiv:2004.08955  [pdf, other

    cs.CV

    ResNeSt: Split-Attention Networks

    Authors: Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola

    Abstract: It is well known that featuremap attention and multi-path representation are important for visual recognition. In this paper, we present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our design results in a simple and unified computation block… ▽ More

    Submitted 30 December, 2020; v1 submitted 19 April, 2020; originally announced April 2020.

  20. arXiv:2003.11288  [pdf, other

    cs.CV

    SCATTER: Selective Context Attentional Scene Text Recognizer

    Authors: Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha

    Abstract: Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SCATTER utilizes a stacked block architecture with i… ▽ More

    Submitted 25 March, 2020; originally announced March 2020.

    Comments: In CVPR 2020

  21. arXiv:2002.04988  [pdf, other

    eess.IV cs.CV

    Saliency Driven Perceptual Image Compression

    Authors: Yash Patel, Srikar Appalaraju, R. Manmatha

    Abstract: This paper proposes a new end-to-end trainable model for lossy image compression, which includes several novel components. The method incorporates 1) an adequate perceptual similarity metric; 2) saliency in the images; 3) a hierarchical auto-regressive model. This paper demonstrates that the popularly used evaluations metrics such as MS-SSIM and PSNR are inadequate for judging the performance of i… ▽ More

    Submitted 8 November, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

    Comments: WACV 2021 camera-ready version

  22. arXiv:1908.04187  [pdf, other

    eess.IV cs.CV

    Human Perceptual Evaluations for Image Compression

    Authors: Yash Patel, Srikar Appalaraju, R. Manmatha

    Abstract: Recently, there has been much interest in deep learning techniques to do image compression and there have been claims that several of these produce better results than engineered compression schemes (such as JPEG, JPEG2000 or BPG). A standard way of comparing image compression schemes today is to use perceptual similarity metrics such as PSNR or MS-SSIM (multi-scale structural similarity). This ha… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: arXiv admin note: text overlap with arXiv:1907.08310

  23. arXiv:1907.08310  [pdf, other

    eess.IV cs.CV

    Deep Perceptual Compression

    Authors: Yash Patel, Srikar Appalaraju, R. Manmatha

    Abstract: Several deep learned lossy compression techniques have been proposed in the recent literature. Most of these are optimized by using either MS-SSIM (multi-scale structural similarity) or MSE (mean squared error) as a loss function. Unfortunately, neither of these correlate well with human perception and this is clearly visible from the resulting compressed images. In several cases, the MS-SSIM for… ▽ More

    Submitted 31 July, 2019; v1 submitted 18 July, 2019; originally announced July 2019.

  24. arXiv:1907.02244  [pdf, other

    cs.CV eess.IV

    Searching for Apparel Products from Images in the Wild

    Authors: Son Tran, Ming Du, Sampath Chanda, R. Manmatha, Cj Taylor

    Abstract: In this age of social media, people often look at what others are wearing. In particular, Instagram and Twitter influencers often provide images of themselves wearing different outfits and their followers are often inspired to buy similar clothes.We propose a system to automatically find the closest visually similar clothes in the online Catalog (street-to-shop searching). The problem is challengi… ▽ More

    Submitted 7 April, 2022; v1 submitted 4 July, 2019; originally announced July 2019.

    Comments: KDD2019, AI for Fashion Workshop

  25. arXiv:1712.00636  [pdf, other

    cs.CV

    Compressed Video Action Recognition

    Authors: Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl

    Abstract: Training robust deep video representations has proven to be much more challenging than learning deep image representations. This is in part due to the enormous size of raw video streams and the high temporal redundancy; the true and interesting signal is often drowned in too much irrelevant data. Motivated by that the superfluous information can be reduced by up to two orders of magnitude by video… ▽ More

    Submitted 29 March, 2018; v1 submitted 2 December, 2017; originally announced December 2017.

    Comments: CVPR 2018 (Selected for spotlight presentation)

  26. arXiv:1706.07567  [pdf, other

    cs.CV

    Sampling Matters in Deep Embedding Learning

    Authors: Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, Philipp Krähenbühl

    Abstract: Deep embeddings answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search. The most prominent approaches optimize a deep convolutional network with a suitable loss function, such as contrastive loss or triplet loss. While a rich line of work focuses solely on the loss functions, we show in this paper that… ▽ More

    Submitted 16 January, 2018; v1 submitted 23 June, 2017; originally announced June 2017.

    Comments: Add supplementary material. Paper published in ICCV 2017