Skip to main content

Showing 1–12 of 12 results for author: Mafla, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2209.10474  [pdf, other

    cs.CV

    Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

    Authors: Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

  2. arXiv:2209.06730  [pdf, other

    cs.CV

    MUST-VQA: MUltilingual Scene-text VQA

    Authors: Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

    Abstract: In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a m… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  3. arXiv:2209.06717  [pdf, other

    cs.CV

    Out-of-Vocabulary Challenge Report

    Authors: Sergi Garcia-Bordils, Andrés Mafla, Ali Furkan Biten, Oren Nuriel, Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas

    Abstract: This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV contest introduces an important aspect that is not commonly studied by Optical Character Recognition (OCR) models, namely, the recognition of unseen scene text instances at training time. The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: To be appeared in Text In Everything Workshop in ECCV 2022

  4. arXiv:2203.04814  [pdf, other

    cs.CV

    Text-DIAE: A Self-Supervised Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

    Authors: Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

    Abstract: In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. E… ▽ More

    Submitted 18 August, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

    Comments: Preprint

  5. arXiv:2110.02623  [pdf, other

    cs.CV cs.AI

    Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

    Authors: Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

    Abstract: The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forc… ▽ More

    Submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted WACV 2022

  6. arXiv:2012.04329  [pdf, other

    cs.CV

    StacMR: Scene-Text Aware Cross-Modal Retrieval

    Authors: Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas

    Abstract: Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in imag… ▽ More

    Submitted 8 December, 2020; originally announced December 2020.

  7. arXiv:2009.09809  [pdf, other

    cs.CV

    Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

    Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text… ▽ More

    Submitted 21 September, 2020; originally announced September 2020.

  8. arXiv:2006.00923  [pdf, other

    cs.CV

    Multimodal grid features and cell pointers for Scene Text Visual Question Answering

    Authors: Lluís Gómez, Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas

    Abstract: This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities i… ▽ More

    Submitted 25 June, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

    Comments: This paper is under consideration at Pattern Recognition Letters

  9. arXiv:2001.04732  [pdf, other

    cs.CV

    Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

    Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

    Abstract: Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained… ▽ More

    Submitted 14 January, 2020; originally announced January 2020.

    Comments: Winter Conference on Applications of Computer Vision (WACV 2020) Accepted paper

  10. arXiv:1907.00490  [pdf, other

    cs.CV

    ICDAR 2019 Competition on Scene Text Visual Question Answering

    Authors: Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

    Abstract: This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/ans… ▽ More

    Submitted 30 June, 2019; originally announced July 2019.

    Comments: 15th International Conference on Document Analysis and Recognition (ICDAR 2019)

  11. arXiv:1905.13648  [pdf, other

    cs.CV

    Scene Text Visual Question Answering

    Authors: Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas

    Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading… ▽ More

    Submitted 16 October, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

    Comments: International Conference on Computer Vision (ICCV 2019)

  12. arXiv:1808.09044  [pdf, other

    cs.CV

    Single Shot Scene Text Retrieval

    Authors: Lluís Gómez, Andrés Mafla, Marçal Rusiñol, Dimosthenis Karatzas

    Abstract: Textual information found in scene images provides high level semantic information about the image and its context and it can be leveraged for better scene understanding. In this paper we address the problem of scene text retrieval: given a text query, the system must return all images containing the queried text. The novelty of the proposed model consists in the usage of a single shot CNN archite… ▽ More

    Submitted 27 August, 2018; originally announced August 2018.

    Comments: ECCV 2018