Skip to main content

Showing 1–12 of 12 results for author: Tito, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.19024  [pdf, other

    cs.CV

    Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism

    Authors: Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

    Comments: Accepted to ICDAR2024

  2. arXiv:2312.10108  [pdf, other

    cs.CV cs.AI cs.LG

    Privacy-Aware Document Visual Question Answering

    Authors: Rubèn Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Lei Kang, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas

    Abstract: Document Visual Question Answering (DocVQA) is a fast growing branch of document understanding. Despite the fact that documents contain sensitive or copyrighted information, none of the current DocVQA methods offers strong privacy guarantees. In this work, we explore privacy in the domain of DocVQA for the first time. We highlight privacy issues in state of the art multi-modal LLM models used fo… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  3. arXiv:2305.08455  [pdf, other

    cs.CV cs.CL cs.LG

    Document Understanding Dataset and Evaluation (DUDE)

    Authors: Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, Tomasz Stanisławek

    Abstract: We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document… ▽ More

    Submitted 11 September, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: Accepted at ICCV 2023

  4. arXiv:2212.05935  [pdf, other

    cs.CV cs.AI cs.CL

    Hierarchical multimodal transformers for Multi-Page DocVQA

    Authors: Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny

    Abstract: Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where qu… ▽ More

    Submitted 1 April, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

  5. arXiv:2202.12985  [pdf, other

    cs.CV cs.AI

    OCR-IDL: OCR Annotations for Industry Document Library Dataset

    Authors: Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

    Abstract: Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance… ▽ More

    Submitted 25 February, 2022; originally announced February 2022.

  6. arXiv:2111.05547  [pdf, other

    cs.CV cs.LG

    ICDAR 2021 Competition on Document VisualQuestion Answering

    Authors: Rubèn Tito, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

    Abstract: In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges. This edition complements the previous tasks on Single Document VQA and Document Collection VQA with a newly introduced on Infographics VQA. Infographics VQA is based on a new dataset of more than 5,000 infographics images and 30,000 question-answer pairs. The winner methods have scored 0.6120 AN… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  7. Document Collection Visual Question Answering

    Authors: Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny

    Abstract: Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed… ▽ More

    Submitted 8 June, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

  8. arXiv:2104.12756  [pdf, other

    cs.CV cs.CL

    InfographicVQA

    Authors: Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C. V Jawahar

    Abstract: Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language question… ▽ More

    Submitted 22 August, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  9. arXiv:2008.08899  [pdf, other

    cs.CV cs.IR

    Document Visual Question Answering Challenge 2020

    Authors: Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C. V. Jawahar

    Abstract: This paper presents results of Document Visual Question Answering Challenge organized as part of "Text and Documents in the Deep Learning Era" workshop, in CVPR 2020. The challenge introduces a new problem - Visual Question Answering on document images. The challenge comprised two tasks. The first task concerns with asking questions on a single document image. On the other hand, the second task is… ▽ More

    Submitted 17 July, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: to be published as a short paper in DAS 2020

  10. arXiv:2006.00923  [pdf, other

    cs.CV

    Multimodal grid features and cell pointers for Scene Text Visual Question Answering

    Authors: Lluís Gómez, Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas

    Abstract: This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities i… ▽ More

    Submitted 25 June, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

    Comments: This paper is under consideration at Pattern Recognition Letters

  11. arXiv:1907.00490  [pdf, other

    cs.CV

    ICDAR 2019 Competition on Scene Text Visual Question Answering

    Authors: Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

    Abstract: This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/ans… ▽ More

    Submitted 30 June, 2019; originally announced July 2019.

    Comments: 15th International Conference on Document Analysis and Recognition (ICDAR 2019)

  12. arXiv:1905.13648  [pdf, other

    cs.CV

    Scene Text Visual Question Answering

    Authors: Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas

    Abstract: Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading… ▽ More

    Submitted 16 October, 2019; v1 submitted 31 May, 2019; originally announced May 2019.

    Comments: International Conference on Computer Vision (ICCV 2019)