Search | arXiv e-print repository

LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach

Authors: Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Lladós, Ernest Valveny, Sanket Biswas

Abstract: The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (D… ▽ More The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (DIR) systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. This hierarchical DIR framework dynamically adjusts to the characteristics of the input document, facilitating effective domain adaptation. We evaluated our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study. Initially trained on a synthetically generated dataset, our model demonstrates strong generalization capabilities for the DIR task, offering a promising solution for handling variability in real-world data. Our code is accessible on GitHub. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted to ICDAR 2024 (Athens, Greece) Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA)

arXiv:2406.08354 [pdf, other]

DocSynthv2: A Practical Autoregressive Modeling for Document Generation

Authors: Sanket Biswas, Rajiv Jain, Vlad I. Morariu, Jiuxiang Gu, Puneet Mathur, Curtis Wigington, Tong Sun, Josep Lladós

Abstract: While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both la… ▽ More While the generation of document layouts has been extensively explored, comprehensive document generation encompassing both layout and content presents a more complex challenge. This paper delves into this advanced domain, proposing a novel approach called DocSynthv2 through the development of a simple yet effective autoregressive structured model. Our model, distinct in its integration of both layout and textual cues, marks a step beyond existing layout-generation approaches. By focusing on the relationship between the structural elements and the textual content within documents, we aim to generate cohesive and contextually relevant documents without any reliance on visual components. Through experimental studies on our curated benchmark for the new task, we demonstrate the ability of our model combining layout and textual information in enhancing the generation quality and relevance of documents, opening new pathways for research in document creation and automated design. Our findings emphasize the effectiveness of autoregressive models in handling complex document generation tasks. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Spotlight (Oral) Acceptance to CVPR 2024 Workshop for Graphic Design Understanding and Generation (GDUG)

arXiv:2406.08226 [pdf, other]

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

Authors: Jordy Van Landeghem, Subhajit Maity, Ayan Banerjee, Matthew Blaschko, Marie-Francine Moens, Josep Lladós, Sanket Biswas

Abstract: This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant… ▽ More This work explores knowledge distillation (KD) for visually-rich document (VRD) applications such as document layout analysis (DLA) and document image classification (DIC). While VRD research is dependent on increasingly sophisticated and cumbersome models, the field has neglected to study efficiency via model compression. Here, we design a KD experimentation methodology for more lean, performant models on document understanding (DU) tasks that are integral within larger task pipelines. We carefully selected KD strategies (response-based, feature-based) for distilling knowledge to and from backbones with different architectures (ResNet, ViT, DiT) and capacities (base, small, tiny). We study what affects the teacher-student knowledge gap and find that some methods (tuned vanilla KD, MSE, SimKD with an apt projector) can consistently outperform supervised student training. Furthermore, we design downstream task setups to evaluate covariate shift and the robustness of distilled DLA models on zero-shot layout-aware document visual question answering (DocVQA). DLA-KD experiments result in a large mAP knowledge gap, which unpredictably translates to downstream robustness, accentuating the need to further explore how to efficiently obtain more semantic document layout awareness. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted to ICDAR 2024 (Athens, Greece)

arXiv:2406.07315 [pdf, other]

Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

Authors: Adrià Molina, Oriol Ramos Terrades, Josep Lladós

Abstract: This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a… ▽ More This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum. △ Less

Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Preprint for the manuscript accepted for publication in the DAS2024 LNCS proceedings

arXiv:2405.03104 [pdf, other]

GeoContrastNet: Contrastive Key-Value Edge Learning for Language-Agnostic Document Understanding

Authors: Nil Biescas, Carlos Boned, Josep Lladós, Sanket Biswas

Abstract: This paper presents GeoContrastNet, a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs), emphasizing the significant role of geometric features. We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework, demonstrat… ▽ More This paper presents GeoContrastNet, a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs), emphasizing the significant role of geometric features. We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework, demonstrating promising results in both link prediction and semantic entity recognition performance. Our findings reveal that combining both geometric and visual features could match the capabilities of large DU models that rely heavily on Optical Character Recognition (OCR) features in terms of performance accuracy and efficiency. This approach underscores the critical importance of relational layout information between the named text entities in a semi-structured layout of a page. Specifically, our results highlight the model's proficiency in identifying key-value relationships within the FUNSD dataset for forms and also discovering the spatial relationships in table-structured layouts for RVLCDIP business invoices. Our code and pretrained models will be accessible on our official GitHub. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted in ICDAR 2024 (Athens, Greece)

arXiv:2405.03099 [pdf, other]

SketchGPT: Autoregressive Modeling for Sketch Generation and Recognition

Authors: Adarsh Tiwari, Sanket Biswas, Josep Lladós

Abstract: We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By map** complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction ob… ▽ More We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By map** complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted in ICDAR 2024

arXiv:2404.00412 [pdf, other]

SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout

Authors: Ayan Banerjee, Nityanand Mathur, Josep Lladós, Umapada Pal, Anjan Dutta

Abstract: Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vect… ▽ More Generating VectorArt from text prompts is a challenging vision task, requiring diverse yet realistic depictions of the seen as well as unseen entities. However, existing research has been mostly limited to the generation of single objects, rather than comprehensive scenes comprising multiple elements. In response, this work introduces SVGCraft, a novel end-to-end framework for the creation of vector graphics depicting entire scenes from textual descriptions. Utilizing a pre-trained LLM for layout generation from text prompts, this framework introduces a technique for producing masked latents in specified bounding boxes for accurate object placement. It introduces a fusion mechanism for integrating attention maps and employs a diffusion U-Net for coherent composition, speeding up the drawing process. The resulting SVG is optimized using a pre-trained encoder and LPIPS loss with opacity modulation to maximize similarity. Additionally, this work explores the potential of primitive shapes in facilitating canvas completion in constrained environments. Through both qualitative and quantitative assessments, SVGCraft is demonstrated to surpass prior works in abstraction, recognizability, and detail, as evidenced by its performance metrics (CLIP-T: 0.4563, Cosine Similarity: 0.6342, Confusion: 0.66, Aesthetic: 6.7832). The code will be available at https://github.com/ayanban011/SVGCraft. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2402.11401 [pdf, other]

GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation

Authors: Ayan Banerjee, Sanket Biswas, Josep Lladós, Umapada Pal

Abstract: Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constr… ▽ More Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: https://github.com/ayanban011/GraphKD. △ Less

Submitted 20 February, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

arXiv:2312.05086 [pdf, other]

doi 10.1007/978-3-031-19745-1_20

I Can't Believe It's Not Better: In-air Movement For Alzheimer Handwriting Synthetic Generation

Authors: Asma Bensalah, Antonio Parziale, Giuseppe De Gregorio, Angelo Marcelli, Alicia Fornés, Lladós

Abstract: During recent years, there here has been a boom in terms of deep learning use for handwriting analysis and recognition. One main application for handwriting analysis is early detection and diagnosis in the health field. Unfortunately, most real case problems still suffer a scarcity of data, which makes difficult the use of deep learning-based models. To alleviate this problem, some works resort to… ▽ More During recent years, there here has been a boom in terms of deep learning use for handwriting analysis and recognition. One main application for handwriting analysis is early detection and diagnosis in the health field. Unfortunately, most real case problems still suffer a scarcity of data, which makes difficult the use of deep learning-based models. To alleviate this problem, some works resort to synthetic data generation. Lately, more works are directed towards guided data synthetic generation, a generation that uses the domain and data knowledge to generate realistic data that can be useful to train deep learning models. In this work, we combine the domain knowledge about the Alzheimer's disease for handwriting and use it for a more guided data generation. Concretely, we have explored the use of in-air movements for synthetic data generation. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2310.00917 [pdf, other]

Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

Authors: Alloy Das, Sanket Biswas, Ayan Banerjee, Josep Lladós, Umapada Pal, Saumik Bhattacharya

Abstract: The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here… ▽ More The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency. △ Less

Submitted 1 November, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: Accepted to the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2024)

arXiv:2310.00558 [pdf, other]

Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

Authors: Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós

Abstract: When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of d… ▽ More When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance. △ Less

Submitted 17 February, 2024; v1 submitted 30 September, 2023; originally announced October 2023.

Comments: Accepted to ICRA 2024

arXiv:2309.05756 [pdf, other]

TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language

Authors: Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, Oriol Ramos Terrades, Josep Lladós

Abstract: The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on O… ▽ More The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: Preprint to Pattern Recognition

arXiv:2305.04609 [pdf, other]

doi 10.1007/978-3-031-41676-7_18

SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation

Authors: Ayan Banerjee, Sanket Biswas, Josep Lladós, Umapada Pal

Abstract: Instance-level segmentation of documents consists in assigning a class-aware and instance-aware label to each pixel of the image. It is a key step in document parsing for their understanding. In this paper, we present a unified transformer encoder-decoder architecture for en-to-end instance segmentation of complex layouts in document images. The method adapts a contrastive training with a mixed qu… ▽ More Instance-level segmentation of documents consists in assigning a class-aware and instance-aware label to each pixel of the image. It is a key step in document parsing for their understanding. In this paper, we present a unified transformer encoder-decoder architecture for en-to-end instance segmentation of complex layouts in document images. The method adapts a contrastive training with a mixed query selection for anchor initialization in the decoder. Later on, it performs a dot product between the obtained query embeddings and the pixel embedding map (coming from the encoder) for semantic reasoning. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ), and TableBank demonstrate that our model with SwinL backbone achieves better segmentation performance than the existing state-of-the-art approaches with the average precision of \textbf{93.72}, \textbf{54.39}, \textbf{84.65} and \textbf{98.04} respectively under one billion parameters. The code is made publicly available at: \href{https://github.com/ayanban011/SwinDocSegmenter}{github.com/ayanban011/SwinDocSegmenter} △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted to ICDAR 2023 (San Jose, California)

arXiv:2305.00795 [pdf, other]

doi 10.1007/978-3-031-41676-7_20

SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation

Authors: Subhajit Maity, Sanket Biswas, Siladittya Manna, Ayan Banerjee, Josep Lladós, Saumik Bhattacharya, Umapada Pal

Abstract: Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal… ▽ More Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSeg △ Less

Submitted 20 August, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: Accepted at The 17th International Conference on Document Analysis and Recognition (ICDAR 2023)

Journal ref: ICDAR 2023 (International Conference on Document Analysis and Recognition) Lecture Notes in Computer Science, vol 14187, pp. 342-360. Springer Nature

arXiv:2212.14797 [pdf, other]

doi 10.1007/978-3-031-19745-1_25

Easing Automatic Neurorehabilitation via Classification and Smoothness Analysis

Authors: Asma Bensalah, Alicia Fornés, Cristina Carmona-Duarte, Josep Lladós

Abstract: Assessing the quality of movements for post-stroke patients during the rehabilitation phase is vital given that there is no standard stroke rehabilitation plan for all the patients. In fact, it depends basically on the patient's functional independence and its progress along the rehabilitation sessions. To tackle this challenge and make neurorehabilitation more agile, we propose an automatic asses… ▽ More Assessing the quality of movements for post-stroke patients during the rehabilitation phase is vital given that there is no standard stroke rehabilitation plan for all the patients. In fact, it depends basically on the patient's functional independence and its progress along the rehabilitation sessions. To tackle this challenge and make neurorehabilitation more agile, we propose an automatic assessment pipeline that starts by recognizing patients' movements by means of a shallow deep learning architecture, then measuring the movement quality using jerk measure and related measures. A particularity of this work is that the dataset used is clinically relevant, since it represents movements inspired from Fugl-Meyer a well common upper-limb clinical stroke assessment scale for stroke patients. We show that it is possible to detect the contrast between healthy and patients movements in terms of smoothness, besides achieving conclusions about the patients' progress during the rehabilitation sessions that correspond to the clinicians' findings about each case. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2212.05063 [pdf, other]

doi 10.1007/978-3-031-19745-1_16

The RPM3D project: 3D Kinematics for Remote Patient Monitoring

Authors: Alicia Fornés, Asma Bensalah, Cristina Carmona-Duarte, Jialuo Chen, Miguel A. Ferrer, Andreas Fischer, Josep Lladós, Cristina Martín, Eloy Opisso, Réjean Plamondon, Anna Scius-Bertrand, Josep Maria Tormos

Abstract: This project explores the feasibility of remote patient monitoring based on the analysis of 3D movements captured with smartwatches. We base our analysis on the Kinematic Theory of Rapid Human Movement. We have validated our research in a real case scenario for stroke rehabilitation at the Guttmann Institute5 (neurorehabilitation hospital), showing promising results. Our work could have a great im… ▽ More This project explores the feasibility of remote patient monitoring based on the analysis of 3D movements captured with smartwatches. We base our analysis on the Kinematic Theory of Rapid Human Movement. We have validated our research in a real case scenario for stroke rehabilitation at the Guttmann Institute5 (neurorehabilitation hospital), showing promising results. Our work could have a great impact in remote healthcare applications, improving the medical efficiency and reducing the healthcare costs. Future steps include more clinical validation, develo** multi-modal analysis architectures (analysing data from sensors, images, audio, etc.), and exploring the application of our technology to monitor other neurodegenerative diseases. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2212.05062 [pdf, other]

doi 10.1007/978-3-030-68763-2_36

Towards Stroke Patients' Upper-limb Automatic Motor Assessment Using Smartwatches

Authors: Asma Bensalah, Jialuo Chen, Alicia Fornés, Cristina Carmona-Duarte, Josep Lladós, Miguel A. Ferrer

Abstract: Assessing the physical condition in rehabilitation scenarios is a challenging problem, since it involves Human Activity Recognition (HAR) and kinematic analysis methods. In addition, the difficulties increase in unconstrained rehabilitation scenarios, which are much closer to the real use cases. In particular, our aim is to design an upper-limb assessment pipeline for stroke patients using smartwa… ▽ More Assessing the physical condition in rehabilitation scenarios is a challenging problem, since it involves Human Activity Recognition (HAR) and kinematic analysis methods. In addition, the difficulties increase in unconstrained rehabilitation scenarios, which are much closer to the real use cases. In particular, our aim is to design an upper-limb assessment pipeline for stroke patients using smartwatches. We focus on the HAR task, as it is the first part of the assessing pipeline. Our main target is to automatically detect and recognize four key movements inspired by the Fugl-Meyer assessment scale, which are performed in both constrained and unconstrained scenarios. In addition to the application protocol and dataset, we propose two detection and classification baseline methods. We believe that the proposed framework, dataset and baseline results will serve to foster this research field. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2212.00554 [pdf, other]

Early prediction of the risk of ICU mortality with Deep Federated Learning

Authors: Korbinian Randl, Núria Lladós Armengol, Lena Mondrejevski, Ioanna Miliou

Abstract: Intensive Care Units usually carry patients with a serious risk of mortality. Recent research has shown the ability of Machine Learning to indicate the patients' mortality risk and point physicians toward individuals with a heightened need for care. Nevertheless, healthcare data is often subject to privacy regulations and can therefore not be easily shared in order to build Centralized Machine Lea… ▽ More Intensive Care Units usually carry patients with a serious risk of mortality. Recent research has shown the ability of Machine Learning to indicate the patients' mortality risk and point physicians toward individuals with a heightened need for care. Nevertheless, healthcare data is often subject to privacy regulations and can therefore not be easily shared in order to build Centralized Machine Learning models that use the combined data of multiple hospitals. Federated Learning is a Machine Learning framework designed for data privacy that can be used to circumvent this problem. In this study, we evaluate the ability of deep Federated Learning to predict the risk of Intensive Care Unit mortality at an early stage. We compare the predictive performance of Federated, Centralized, and Local Machine Learning in terms of AUPRC, F1-score, and AUROC. Our results show that Federated Learning performs equally well as the centralized approach and is substantially better than the local approach, thus providing a viable solution for early Intensive Care Unit mortality prediction. In addition, we show that the prediction performance is higher when the patient history window is closer to discharge or death. Finally, we show that using the F1-score as an early stop** metric can stabilize and increase the performance of our approach for the task at hand. △ Less

Submitted 5 December, 2022; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: 12 pages

arXiv:2209.10441 [pdf, other]

A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts

Authors: Giuseppe De Gregorio, Sanket Biswas, Mohamed Ali Souibgui, Asma Bensalah, Josep Lladós, Alicia Fornés, Angelo Marcelli

Abstract: Despite recent advances in automatic text recognition, the performance remains moderate when it comes to historical manuscripts. This is mainly because of the scarcity of available labelled data to train the data-hungry Handwritten Text Recognition (HTR) models. The Keyword Spotting System (KWS) provides a valid alternative to HTR due to the reduction in error rate, but it is usually limited to a… ▽ More Despite recent advances in automatic text recognition, the performance remains moderate when it comes to historical manuscripts. This is mainly because of the scarcity of available labelled data to train the data-hungry Handwritten Text Recognition (HTR) models. The Keyword Spotting System (KWS) provides a valid alternative to HTR due to the reduction in error rate, but it is usually limited to a closed reference vocabulary. In this paper, we propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram) that requires a small amount of labelled training data. We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary. In this case, an out-of-vocabulary (OOV) word in an input handwritten line image could be a sequence of n-grams that belong to the lexicon. An extensive experimental evaluation of our proposed multi-representation approach was carried out on a subset of Bentham's historical manuscript collections to obtain some really promising results in this direction. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: Accepted in ICFHR 2022

arXiv:2208.11168 [pdf, other]

doi 10.1007/978-3-031-25069-9_22

Doc2Graph: a Task Agnostic Document Understanding Framework based on Graph Neural Networks

Authors: Andrea Gemelli, Sanket Biswas, Enrico Civitelli, Josep Lladós, Simone Marinai

Abstract: Geometric Deep Learning has recently attracted significant interest in a wide range of machine learning fields, including document analysis. The application of Graph Neural Networks (GNNs) has become crucial in various document-related tasks since they can unravel important structural patterns, fundamental in key information extraction processes. Previous works in the literature propose task-drive… ▽ More Geometric Deep Learning has recently attracted significant interest in a wide range of machine learning fields, including document analysis. The application of Graph Neural Networks (GNNs) has become crucial in various document-related tasks since they can unravel important structural patterns, fundamental in key information extraction processes. Previous works in the literature propose task-driven models and do not take into account the full power of graphs. We propose Doc2Graph, a task-agnostic document understanding framework based on a GNN model, to solve different tasks given different types of documents. We evaluated our approach on two challenging datasets for key information extraction in form understanding, invoice layout analysis and table detection. Our code is freely accessible on https://github.com/andreagemelli/doc2graph. △ Less

Submitted 23 August, 2022; originally announced August 2022.

Comments: TiE ECCV 2022

Journal ref: Computer Vision -- ECCV 2022 Workshops, vol. 13804, 2023, pp. 329-344

arXiv:2204.04028 [pdf, other]

A Generic Image Retrieval Method for Date Estimation of Historical Document Collections

Authors: Adrià Molina, Lluis Gomez, Oriol Ramos Terrades, Josep Lladós

Abstract: Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others. This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections. we use a ranking loss function named smooth-nDCG to train a Convolutional… ▽ More Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others. This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections. we use a ranking loss function named smooth-nDCG to train a Convolutional Neural Network that learns an ordination of documents for each problem. One of the main usages of the presented approach is as a tool for historical contextual retrieval. It means that scholars could perform comparative analysis of historical images from big datasets in terms of the period where they were produced. We provide experimental evaluation on different types of documents from real datasets of manuscript and newspaper images. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Preprint of paper accepted at DAS2022

arXiv:2203.04814 [pdf, other]

Text-DIAE: A Self-Supervised Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Authors: Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

Abstract: In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. E… ▽ More In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. Each of the pretext objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that confirm the design choice of the selected pretext tasks. Importantly, the proposed model does not exhibit limitations of previous state-of-the-art methods based on contrastive losses, while at the same time requiring substantially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}. △ Less

Submitted 18 August, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Preprint

arXiv:2201.11438 [pdf, other]

DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer

Authors: Sanket Biswas, Ayan Banerjee, Josep Lladós, Umapada Pal

Abstract: Understanding documents with rich layouts is an essential step towards information extraction. Business intelligence processes often require the extraction of useful semantic content from documents at a large scale for subsequent decision-making tasks. In this context, instance-level segmentation of different document objects (title, sections, figures etc.) has emerged as an interesting problem fo… ▽ More Understanding documents with rich layouts is an essential step towards information extraction. Business intelligence processes often require the extraction of useful semantic content from documents at a large scale for subsequent decision-making tasks. In this context, instance-level segmentation of different document objects (title, sections, figures etc.) has emerged as an interesting problem for the document analysis and understanding community. To advance the research in this direction, we present a transformer-based model called \emph{DocSegTr} for end-to-end instance segmentation of complex layouts in document images. The method adapts a twin attention module, for semantic reasoning, which helps to become highly computationally efficient compared with the state-of-the-art. To the best of our knowledge, this is the first work on transformer-based document segmentation. Extensive experimentation on competitive benchmarks like PubLayNet, PRIMA, Historical Japanese (HJ) and TableBank demonstrate that our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches with the average precision of 89.4, 40.3, 83.4 and 93.3. This simple and flexible framework could serve as a promising baseline for instance-level recognition tasks in document images. △ Less

Submitted 21 September, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Preprint

arXiv:2201.10252 [pdf, other]

DocEnTr: An End-to-End Document Image Enhancement Transformer

Authors: Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri Kessentini, Alicia Fornés, Josep Lladós, Umapada Pal

Abstract: Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder… ▽ More Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: \url{https://github.com/dali92002/DocEnTR}. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: submitted to ICPR 2022

arXiv:2109.11874 [pdf, other]

Localizing Infinity-shaped fishes: Sketch-guided object localization in the wild

Authors: Pau Riba, Sounak Dey, Ali Furkan Biten, Josep Llados

Abstract: This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images. In this cross-modal setting, we first contribute with a tough-to-beat baseline that without any specific SGOL training is able to outperform the previous works on a fixed set of classes. The baseline is useful to analyze the… ▽ More This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images. In this cross-modal setting, we first contribute with a tough-to-beat baseline that without any specific SGOL training is able to outperform the previous works on a fixed set of classes. The baseline is useful to analyze the performance of SGOL approaches based on available simple yet powerful methods. We advance prior arts by proposing a sketch-conditioned DETR (DEtection TRansformer) architecture which avoids a hard classification and alleviates the domain gap between sketches and images to localize object instances. Although the main goal of SGOL is focused on object detection, we explored its natural extension to sketch-guided instance segmentation. This novel task allows to move towards identifying the objects at pixel level, which is of key importance in several applications. We experimentally demonstrate that our model and its variants significantly advance over previous state-of-the-art results. All training and testing code of our model will be released to facilitate future research{https://github.com/priba/sgol_wild}. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: Under Review

arXiv:2107.04357 [pdf, other]

Graph-based Deep Generative Modelling for Document Layout Generation

Authors: Sanket Biswas, Pau Riba, Josep Lladós, Umapada Pal

Abstract: One of the major prerequisites for any deep learning approach is the availability of large-scale training data. When dealing with scanned document images in real world scenarios, the principal information of its content is stored in the layout itself. In this work, we have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable… ▽ More One of the major prerequisites for any deep learning approach is the availability of large-scale training data. When dealing with scanned document images in real world scenarios, the principal information of its content is stored in the layout itself. In this work, we have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts that can be used to train document interpretation systems, in this case, specially in digital mailroom applications. It is also the first graph-based approach for document layout generation task experimented on administrative document images, in this case, invoices. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: Accepted by ICDAR Workshops-GLESDO 2021

arXiv:2107.02638 [pdf, other]

DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis

Authors: Sanket Biswas, Pau Riba, Josep Lladós, Umapada Pal

Abstract: Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by… ▽ More Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by the user, our proposed DocSynth model learns to generate a set of realistic document images consistent with the defined layout. Also, this framework has been adapted to this work as a superior baseline model for creating synthetic document image datasets for augmenting real data during training for document layout analysis tasks. Different sets of learning objectives have been also used to improve the model performance. Quantitatively, we also compare the generated results of our model with real data using standard evaluation metrics. The results highlight that our model can successfully generate realistic and diverse document images with multiple objects. We also present a comprehensive qualitative analysis summary of the different scopes of synthetic image generation tasks. Lastly, to our knowledge this is the first work of its kind. △ Less

Submitted 6 July, 2021; originally announced July 2021.

Comments: Accepted by ICDAR 2021

arXiv:2106.05618 [pdf, other]

Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach

Authors: Adrià Molina, Pau Riba, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós

Abstract: This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design… ▽ More This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design a neural network that learns a classifier or a regressor, we propose a learning objective based on the nDCG ranking metric. We have experimentally evaluated the performance of the method in two different tasks: date estimation and date-sensitive image retrieval, using the DEW public database, overcoming the baseline methods. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted at ICDAR 2021

arXiv:2106.05144 [pdf, other]

Learning to Rank Words: Optimizing Ranking Metrics for Word Spotting

Authors: Pau Riba, Adrià Molina, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós

Abstract: In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the qu… ▽ More In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the query string. We experimentally demonstrate the competitive performance of the proposed model on query-by-string word spotting for both, handwritten and real scene word images. We also provide the results for query-by-example word spotting, although it is not the main focus of this work. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted at ICDAR 2021

arXiv:2105.05300 [pdf, other]

One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Authors: Mohamed Ali Souibgui, Ali Furkan Biten, Sounak Dey, Alicia Fornés, Yousri Kessentini, Lluis Gomez, Dimosthenis Karatzas, Josep Lladós

Abstract: Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). For example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the message contents. Thus, in this paper we address this problem through a data generation technique… ▽ More Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). For example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the message contents. Thus, in this paper we address this problem through a data generation technique based on Bayesian Program Learning (BPL). Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol in the alphabet. After generating symbols, we create synthetic lines to train state-of-the-art HTR architectures in a segmentation free fashion. Quantitative and qualitative analyses were carried out and confirm the effectiveness of the proposed method. △ Less

Submitted 5 October, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

Comments: Accepted in WACV 2022

arXiv:2008.07641 [pdf, other]

Learning Graph Edit Distance by Graph Neural Networks

Authors: Pau Riba, Andreas Fischer, Josep Lladós, Alicia Fornés

Abstract: The emergence of geometric deep learning as a novel framework to deal with graph-based representations has faded away traditional approaches in favor of completely new methodologies. In this paper, we propose a new framework able to combine the advances on deep metric learning with traditional approximations of the graph edit distance. Hence, we propose an efficient graph distance based on the nov… ▽ More The emergence of geometric deep learning as a novel framework to deal with graph-based representations has faded away traditional approaches in favor of completely new methodologies. In this paper, we propose a new framework able to combine the advances on deep metric learning with traditional approximations of the graph edit distance. Hence, we propose an efficient graph distance based on the novel field of geometric deep learning. Our method employs a message passing neural network to capture the graph structure, and thus, leveraging this information for its use on a distance computation. The performance of the proposed graph distance is validated on two different scenarios. On the one hand, in a graph retrieval of handwritten words~\ie~keyword spotting, showing its superior performance when compared with (approximate) graph edit distance benchmarks. On the other hand, demonstrating competitive results for graph similarity learning when compared with the current state-of-the-art on a recent benchmark dataset. △ Less

Submitted 17 August, 2020; originally announced August 2020.

arXiv:1912.10016 [pdf, other]

A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages

Authors: Manuel Carbonell, Alicia Fornés, Mauricio Villegas, Josep Lladós

Abstract: In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propo… ▽ More In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propose an end-to-end model that combines a one stage object detection network with branches for the recognition of text and named entities respectively in a way that shared features can be learned simultaneously from the training error of each of the tasks. By doing so the model jointly performs handwritten text detection, transcription, and named entity recognition at page level with a single feed forward step. We exhaustively evaluate our approach on different datasets, discussing its advantages and limitations compared to sequential approaches. The results show that the model is capable of benefiting from shared features for simultaneously solving interdependent tasks. △ Less

Submitted 4 May, 2020; v1 submitted 20 December, 2019; originally announced December 2019.

Comments: To be published in Pattern Recognition Letters

arXiv:1910.08993 [pdf, other]

Identity Document and banknote security forensics: a survey

Authors: Albert Berenguel Centeno, Oriol Ramos Terrades, Josep Lladós Canet, Cristina Cañero Morales

Abstract: Counterfeiting and piracy are a form of theft that has been steadily growing in recent years. Banknotes and identity documents are two common objects of counterfeiting. Aiming to detect these counterfeits, the present survey covers a wide range of anti-counterfeiting security features, categorizing them into three components: security substrate, security inks and security printing. respectively. F… ▽ More Counterfeiting and piracy are a form of theft that has been steadily growing in recent years. Banknotes and identity documents are two common objects of counterfeiting. Aiming to detect these counterfeits, the present survey covers a wide range of anti-counterfeiting security features, categorizing them into three components: security substrate, security inks and security printing. respectively. From the computer vision perspective, we present works in the literature covering these three categories. Other topics, such as history of counterfeiting, effects on society and document experts, counterfeiter types of attacks, trends among others are covered. Therefore, from non-experienced to professionals in security documents, can be introduced or deepen its knowledge in anti-counterfeiting measures. △ Less

Submitted 20 October, 2019; originally announced October 2019.

Comments: 35 pages, 5 figures, 14 tables

arXiv:1904.03451 [pdf, other]

doi 10.1109/CVPR.2019.00228

Doodle to Search: Practical Zero-Shot Sketch-based Image Retrieval

Authors: Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, Yi-Zhe Song

Abstract: In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected c… ▽ More In this paper, we investigate the problem of zero-shot sketch-based image retrieval (ZS-SBIR), where human sketches are used as queries to conduct retrieval of photos from unseen categories. We importantly advance prior arts by proposing a novel ZS-SBIR scenario that represents a firm step forward in its practical application. The new setting uniquely recognizes two important yet often neglected challenges of practical ZS-SBIR, (i) the large domain gap between amateur sketch and photo, and (ii) the necessity for moving towards large-scale retrieval. We first contribute to the community a novel ZS-SBIR dataset, QuickDraw-Extended, that consists of 330,000 sketches and 204,000 photos spanning across 110 categories. Highly abstract amateur human sketches are purposefully sourced to maximize the domain gap, instead of ones included in existing datasets that can often be semi-photorealistic. We then formulate a ZS-SBIR framework to jointly model sketches and photos into a common embedding space. A novel strategy to mine the mutual information among domains is specifically engineered to alleviate the domain gap. External semantic knowledge is further embedded to aid semantic transfer. We show that, rather surprisingly, retrieval performance significantly outperforms that of state-of-the-art on existing datasets that can already be achieved using a reduced version of our model. We further demonstrate the superior performance of our full model by comparing with a number of alternatives on the newly proposed dataset. The new dataset, plus all training and testing code of our model, will be publicly released to facilitate future research △ Less

Submitted 19 August, 2020; v1 submitted 6 April, 2019; originally announced April 2019.

Comments: Oral paper in CVPR 2019

Journal ref: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:1807.02839 [pdf, other]

doi 10.1007/s00521-019-04642-7

Hierarchical stochastic graphlet embedding for graph-based pattern recognition

Authors: Anjan Dutta, Pau Riba, Josep Lladós, Alicia Fornés

Abstract: Despite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable because of the lack of mathematical operations defined in graph domain. Graph embedding, which maps graphs to a vectorial space, has been proposed as a way to tackle these difficulties enabling the use of standard machine learning techniques. However, it is well known… ▽ More Despite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable because of the lack of mathematical operations defined in graph domain. Graph embedding, which maps graphs to a vectorial space, has been proposed as a way to tackle these difficulties enabling the use of standard machine learning techniques. However, it is well known that graph embedding functions usually suffer from the loss of structural information. In this paper, we consider the hierarchical structure of a graph as a way to mitigate this loss of information. The hierarchical structure is constructed by topologically clustering the graph nodes, and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure is constructed, we consider several configurations to define the map** into a vector space given a classical graph embedding, in particular, we propose to make use of the Stochastic Graphlet Embedding (SGE). Broadly speaking, SGE produces a distribution of uniformly sampled low to high order graphlets as a way to embed graphs into the vector space. In what follows, the coarse-to-fine structure of a graph hierarchy and the statistics fetched by the SGE complements each other and includes important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, obtaining a more robust graph representation. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the state-of-the-art methods. △ Less

Submitted 4 January, 2020; v1 submitted 8 July, 2018; originally announced July 2018.

Comments: In Neural Computing and Applications (17 pages, 5 figures, 6 tables)

arXiv:1804.10819 [pdf, other]

Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch

Authors: Sounak Dey, Anjan Dutta, Suman K. Ghosh, Ernest Valveny, Josep Lladós, Umapada Pal

Abstract: In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is… ▽ More In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets. △ Less

Submitted 28 April, 2018; originally announced April 2018.

Comments: Accepted at ICPR 2018

arXiv:1803.06252 [pdf, other]

Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model

Authors: Manuel Carbonell, Mauricio Villegas, Alicia Fornés, Josep Lladós

Abstract: When extracting information from handwritten documents, text transcription and named entity recognition are usually faced as separate subsequent tasks. This has the disadvantage that errors in the first module affect heavily the performance of the second module. In this work we propose to do both tasks jointly, using a single neural network with a common architecture used for plain text recognitio… ▽ More When extracting information from handwritten documents, text transcription and named entity recognition are usually faced as separate subsequent tasks. This has the disadvantage that errors in the first module affect heavily the performance of the second module. In this work we propose to do both tasks jointly, using a single neural network with a common architecture used for plain text recognition. Experimentally, the work has been tested on a collection of historical marriage records. Results of experiments are presented to show the effect on the performance for different configurations: different ways of encoding the information, doing or not transfer learning and processing at text line or multi-line region level. The results are comparable to state of the art reported in the ICDAR 2017 Information Extraction competition, even though the proposed technique does not use any dictionaries, language modeling or post processing. △ Less

Submitted 22 March, 2018; v1 submitted 16 March, 2018; originally announced March 2018.

Comments: To appear in IAPR International Workshop on Document Analysis Systems 2018 (DAS 2018)

arXiv:1801.07138 [pdf]

Wikipedia in academia as a teaching tool: from averse to proactive faculty profiles

Authors: Julià Minguillón, Eduard Aibar, Maura Lerga, Josep Lladós, Antoni Meseguer-Artola

Abstract: This study concerned the active use of Wikipedia as a teaching tool in the classroom in higher education, trying to identify different usage profiles and their characterization. A questionnaire survey was administrated to all full-time and part-time teachers at the Universitat Oberta de Catalunya and the Universitat Pompeu Fabra, both in Barcelona, Spain. The questionnaire was designed using the T… ▽ More This study concerned the active use of Wikipedia as a teaching tool in the classroom in higher education, trying to identify different usage profiles and their characterization. A questionnaire survey was administrated to all full-time and part-time teachers at the Universitat Oberta de Catalunya and the Universitat Pompeu Fabra, both in Barcelona, Spain. The questionnaire was designed using the Technology Acceptance Model as a reference, including items about teachers web 2.0 profile, Wikipedia usage, expertise, perceived usefulness, easiness of use, visibility and quality, as well as Wikipedia status among colleagues and incentives to use it more actively. Clustering and statistical analysis were carried out using the k-medoids algorithm and differences between clusters were assessed by means of contingency tables and generalized linear models (logit). The respondents were classified in four clusters, from less to more likely to adopt and use Wikipedia in the classroom, namely averse (25.4%), reluctant (17.9%), open (29.5%) and proactive (27.2%). Proactive faculty are mostly men teaching part-time in STEM fields, mainly engineering, while averse faculty are mostly women teaching full-time in non-STEM fields. Nevertheless, questionnaire items related to visibility, quality, image, usefulness and expertise determine the main differences between clusters, rather than age, gender or domain. Clusters involving a positive view of Wikipedia and at least some frequency of use clearly outnumber those with a strictly negative stance. This goes against the common view that faculty members are mostly sceptical about Wikipedia. Environmental factors such as academic culture and colleagues opinion are more important than faculty personal characteristics, especially with respect to what they think about Wikipedia quality. △ Less

Submitted 22 January, 2018; originally announced January 2018.

Comments: 16 pages, 1 figure, 8 tables

ACM Class: K.3.1; H.3.5

arXiv:1708.06126 [pdf, other]

e-Counterfeit: a mobile-server platform for document counterfeit detection

Authors: Albert Berenguel, Oriol Ramos Terrades, Josep Lladós, Cristina Cañero

Abstract: This paper presents a novel application to detect counterfeit identity documents forged by a scan-printing operation. Texture analysis approaches are proposed to extract validation features from security background that is usually printed in documents as IDs or banknotes. The main contribution of this work is the end-to-end mobile-server architecture, which provides a service for non-expert users… ▽ More This paper presents a novel application to detect counterfeit identity documents forged by a scan-printing operation. Texture analysis approaches are proposed to extract validation features from security background that is usually printed in documents as IDs or banknotes. The main contribution of this work is the end-to-end mobile-server architecture, which provides a service for non-expert users and therefore can be used in several scenarios. The system also provides a crowdsourcing mode so labeled images can be gathered, generating databases for incremental training of the algorithms. △ Less

Submitted 21 August, 2017; originally announced August 2017.

Comments: 6 pages, 5 figures

arXiv:1707.02131 [pdf, other]

SigNet: Convolutional Siamese Network for Writer Independent Offline Signature Verification

Authors: Sounak Dey, Anjan Dutta, J. Ignacio Toledo, Suman K. Ghosh, Josep Llados, Umapada Pal

Abstract: Offline signature verification is one of the most challenging tasks in biometrics and document forensics. Unlike other verification problems, it needs to model minute but critical details between genuine and forged signatures, because a skilled falsification might often resembles the real signature with small deformation. This verification task is even harder in writer independent scenarios which… ▽ More Offline signature verification is one of the most challenging tasks in biometrics and document forensics. Unlike other verification problems, it needs to model minute but critical details between genuine and forged signatures, because a skilled falsification might often resembles the real signature with small deformation. This verification task is even harder in writer independent scenarios which is undeniably fiscal for realistic cases. In this paper, we model an offline writer independent signature verification task with a convolutional Siamese network. Siamese networks are twin networks with shared weights, which can be trained to learn a feature space where similar observations are placed in proximity. This is achieved by exposing the network to a pair of similar and dissimilar observations and minimizing the Euclidean distance between similar pairs while simultaneously maximizing it between dissimilar pairs. Experiments conducted on cross-domain datasets emphasize the capability of our network to model forgery in different languages (scripts) and handwriting styles. Moreover, our designed Siamese network, named SigNet, exceeds the state-of-the-art results on most of the benchmark signature datasets, which paves the way for further research in this direction. △ Less

Submitted 30 September, 2017; v1 submitted 7 July, 2017; originally announced July 2017.

arXiv:1702.00391 [pdf, other]

Product Graph-based Higher Order Contextual Similarities for Inexact Subgraph Matching

Authors: Anjan Dutta, Josep Lladós, Horst Bunke, Umapada Pal

Abstract: Many algorithms formulate graph matching as an optimization of an objective function of pairwise quantification of nodes and edges of two graphs to be matched. Pairwise measurements usually consider local attributes but disregard contextual information involved in graph structures. We address this issue by proposing contextual similarities between pairs of nodes. This is done by considering the te… ▽ More Many algorithms formulate graph matching as an optimization of an objective function of pairwise quantification of nodes and edges of two graphs to be matched. Pairwise measurements usually consider local attributes but disregard contextual information involved in graph structures. We address this issue by proposing contextual similarities between pairs of nodes. This is done by considering the tensor product graph (TPG) of two graphs to be matched, where each node is an ordered pair of nodes of the operand graphs. Contextual similarities between a pair of nodes are computed by accumulating weighted walks (normalized pairwise similarities) terminating at the corresponding paired node in TPG. Once the contextual similarities are obtained, we formulate subgraph matching as a node and edge selection problem in TPG. We use contextual similarities to construct an objective function and optimize it with a linear programming approach. Since random walk formulation through TPG takes into account higher order information, it is not a surprise that we obtain more reliable similarities and better discrimination among the nodes and edges. Experimental results shown on synthetic as well as real benchmarks illustrate that higher order contextual similarities add discriminating power and allow one to find approximate solutions to the subgraph matching problem. △ Less

Submitted 1 February, 2017; originally announced February 2017.

arXiv:1604.06243 [pdf, other]

Evaluation of the Effect of Improper Segmentation on Word Spotting

Authors: Sounak Dey, Anguelos Nicolaou, Josep Llados, Umapada Pal

Abstract: Word spotting is an important recognition task in historical document analysis. In most cases methods are developed and evaluated assuming perfect word segmentations. In this paper we propose an experimental framework to quantify the effect of goodness of word segmentation has on the performance achieved by word spotting methods in identical unbiased conditions. The framework consists of generatin… ▽ More Word spotting is an important recognition task in historical document analysis. In most cases methods are developed and evaluated assuming perfect word segmentations. In this paper we propose an experimental framework to quantify the effect of goodness of word segmentation has on the performance achieved by word spotting methods in identical unbiased conditions. The framework consists of generating systematic distortions on segmentation and retrieving the original queries from the distorted dataset. We apply the framework on the George Washington and Barcelona Marriage Dataset and on several established and state-of-the-art methods. The experiments allow for an estimate of the end-to-end performance of word spotting methods. △ Less

Submitted 21 April, 2016; originally announced April 2016.

arXiv:1604.05907 [pdf, other]

Local Binary Pattern for Word Spotting in Handwritten Historical Document

Authors: Sounak Dey, Anguelos Nicolaou, Josep Llados, Umapada Pal

Abstract: Digital libraries store images which can be highly degraded and to index this kind of images we resort to word spot- ting as our information retrieval system. Information retrieval for handwritten document images is more challenging due to the difficulties in complex layout analysis, large variations of writing styles, and degradation or low quality of historical manuscripts. This paper presents a… ▽ More Digital libraries store images which can be highly degraded and to index this kind of images we resort to word spot- ting as our information retrieval system. Information retrieval for handwritten document images is more challenging due to the difficulties in complex layout analysis, large variations of writing styles, and degradation or low quality of historical manuscripts. This paper presents a simple innovative learning-free method for word spotting from large scale historical documents combining Local Binary Pattern (LBP) and spatial sampling. This method offers three advantages: firstly, it operates in completely learning free paradigm which is very different from unsupervised learning methods, secondly, the computational time is significantly low because of the LBP features which are very fast to compute, and thirdly, the method can be used in scenarios where annotations are not available. Finally we compare the results of our proposed retrieval method with the other methods in the literature. △ Less

Submitted 21 April, 2016; v1 submitted 20 April, 2016; originally announced April 2016.

arXiv:1004.5427 [pdf]

Employing fuzzy intervals and loop-based methodology for designing structural signature: an application to symbol recognition

Authors: Muhammad Muzzamil Luqman, Mathieu Delalandre, Thierry Brouard, Jean-Yves Ramel, Josep Lladós

Abstract: Motivation of our work is to present a new methodology for symbol recognition. We support structural methods for representing visual associations in graphic documents. The proposed method employs a structural approach for symbol representation and a statistical classifier for recognition. We vectorize a graphic symbol, encode its topological and geometrical information by an ARG and compute a sign… ▽ More Motivation of our work is to present a new methodology for symbol recognition. We support structural methods for representing visual associations in graphic documents. The proposed method employs a structural approach for symbol representation and a statistical classifier for recognition. We vectorize a graphic symbol, encode its topological and geometrical information by an ARG and compute a signature from this structural graph. To address the sensitivity of structural representations to deformations and degradations, we use data adapted fuzzy intervals while computing structural signature. The joint probability distribution of signatures is encoded by a Bayesian network. This network in fact serves as a mechanism for pruning irrelevant features and choosing a subset of interesting features from structural signatures, for underlying symbol set. Finally we deploy the Bayesian network in supervised learning scenario for recognizing query symbols. We have evaluated the robustness of our method against noise, on synthetically deformed and degraded images of pre-segmented 2D architectural and electronic symbols from GREC databases and have obtained encouraging recognition rates. A second set of experimentation was carried out for evaluating the performance of our method against context noise i.e. symbols cropped from complete documents. The results support the use of our signature by a symbol spotting system. △ Less

Submitted 29 April, 2010; originally announced April 2010.

Comments: 10 pages, Eighth IAPR International Workshop on Graphics RECognition (GREC), 2009, volume 8, 22-31

Showing 1–44 of 44 results for author: Lladós