Search | arXiv e-print repository

Divide, Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for On-Board Semantic Segmentation

Authors: Tao Lian, Jose L. Gómez, Antonio M. López

Abstract: The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise f… ▽ More The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise for advancing the last mile of UDA but remain underutilized in current research. Thus, we propose DEC, a flexible UDA framework for multi-source datasets. Following a divide-and-conquer strategy, DEC simplifies the task by categorizing semantic classes, training models for each category, and fusing their outputs by an ensemble model trained exclusively on synthetic datasets to obtain the final segmentation mask. DEC can integrate with existing UDA methods, achieving state-of-the-art performance on Cityscapes, BDD100K, and Mapillary Vistas, significantly narrowing the syn-to-real domain gap. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.14343 [pdf, other]

IWISDM: Assessing instruction following in multimodal models at scale

Authors: Xiaoxuan Lei, Lucas Gomez, Hao Yuan Bai, Pouya Bashivan

Abstract: The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achie… ▽ More The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM. △ Less

Submitted 3 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.02380 [pdf, other]

EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in GLAM Collections

Authors: Francesc Net, Marc Folia, Pep Casals, Andrew D. Bagdanov, Lluis Gomez

Abstract: In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC340K dataset is organized across multiple facets: Materials, Object Types, Disciplines, and Subjects, following a hierarchical structure… ▽ More In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC340K dataset is organized across multiple facets: Materials, Object Types, Disciplines, and Subjects, following a hierarchical structure based on the Art & Architecture Thesaurus (AAT). We developed several baseline models, incorporating multiple heads on a ConvNeXT backbone for multi-label image tagging on these facets, and fine-tuning a CLIP model with our image text pairs. Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the utility of the dataset in improving multi-label classification tools that have the potential to alleviate cataloging tasks in the cultural heritage sector. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 23 pages, 13 figures

ACM Class: I.4.9

arXiv:2404.19031 [pdf, other]

Machine Unlearning for Document Classification

Authors: Lei Kang, Mohamed Ali Souibgui, Fei Yang, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

Abstract: Document understanding models have recently demonstrated remarkable performance by leveraging extensive collections of user documents. However, since documents often contain large amounts of personal data, their usage can pose a threat to user privacy and weaken the bonds of trust between humans and AI services. In response to these concerns, legislation advocating ``the right to be forgotten" has… ▽ More Document understanding models have recently demonstrated remarkable performance by leveraging extensive collections of user documents. However, since documents often contain large amounts of personal data, their usage can pose a threat to user privacy and weaken the bonds of trust between humans and AI services. In response to these concerns, legislation advocating ``the right to be forgotten" has recently been proposed, allowing users to request the removal of private information from computer systems and neural network models. A novel approach, known as machine unlearning, has emerged to make AI models forget about a particular class of data. In our research, we explore machine unlearning for document classification problems, representing, to the best of our knowledge, the first investigation into this area. Specifically, we consider a realistic scenario where a remote server houses a well-trained model and possesses only a small portion of training data. This setup is designed for efficient forgetting manipulation. This work represents a pioneering step towards the development of machine unlearning methods aimed at addressing privacy concerns in document analysis applications. Our code is publicly available at \url{https://github.com/leitro/MachineUnlearning-DocClassification}. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Accepted to ICDAR2024

arXiv:2403.13103 [pdf]

IEEE-GDL CCD Smart Buildings Introduction

Authors: Victor Manuel Larios, José Guadalupe Robledo, Leopoldo Gómez, R. Rincón

Abstract: As part of the activities of the IEEE-GDL CCD working group of physical infrastructure, this whitepaper is intented to be an initial guide to understand the layers, taxonomy of services and best practices for the development of smart buildings. Open standards are claimed in order to increase interoperability between layers and services. Moreover, two buildings in Guadalajara city, one new and anot… ▽ More As part of the activities of the IEEE-GDL CCD working group of physical infrastructure, this whitepaper is intented to be an initial guide to understand the layers, taxonomy of services and best practices for the development of smart buildings. Open standards are claimed in order to increase interoperability between layers and services. Moreover, two buildings in Guadalajara city, one new and another to renew, are described as a proof of concept under development and being part of the strategy to develop the smart city infrastructure based in a master plan. A discussion will be addressed in order to identify the areas of innovation and opportunities for the smart buildings as the contribution of this document. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 5 pages, 4 figures, 4 Tables

ACM Class: H.4.m

arXiv:2312.12176 [pdf, other]

All for One, and One for All: UrbanSyn Dataset, the third Musketeer of Synthetic Driving Scenes

Authors: Jose L. Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A. Iglesias-Guitian, Antonio M. López

Abstract: We introduce UrbanSyn, a photorealistic dataset acquired through semi-procedurally generated synthetic urban driving scenarios. Developed using high-quality geometry and materials, UrbanSyn provides pixel-level ground truth, including depth, semantic segmentation, and instance segmentation with object bounding boxes and occlusion degree. It complements GTAV and Synscapes datasets to form what we c… ▽ More We introduce UrbanSyn, a photorealistic dataset acquired through semi-procedurally generated synthetic urban driving scenarios. Developed using high-quality geometry and materials, UrbanSyn provides pixel-level ground truth, including depth, semantic segmentation, and instance segmentation with object bounding boxes and occlusion degree. It complements GTAV and Synscapes datasets to form what we coin as the 'Three Musketeers'. We demonstrate the value of the Three Musketeers in unsupervised domain adaptation for image semantic segmentation. Results on real-world datasets, Cityscapes, Mapillary Vistas, and BDD100K, establish new benchmarks, largely attributed to UrbanSyn. We make UrbanSyn openly and freely accessible (www.urbansyn.org). △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: The UrbanSyn Dataset is available in http://urbansyn.org/

arXiv:2310.02140 [pdf, other]

doi 10.1109/COMPSAC57700.2023.00258

PAD-Phys: Exploiting Physiology for Presentation Attack Detection in Face Biometrics

Authors: Luis F. Gomez, Julian Fierrez, Aythami Morales, Mahdi Ghafourian, Ruben Tolosana, Imanol Solano, Alejandro Garcia, Francisco Zamora-Martinez

Abstract: Presentation Attack Detection (PAD) is a crucial stage in facial recognition systems to avoid leakage of personal information or spoofing of identity to entities. Recently, pulse detection based on remote photoplethysmography (rPPG) has been shown to be effective in face presentation attack detection. This work presents three different approaches to the presentation attack detection based on rPP… ▽ More Presentation Attack Detection (PAD) is a crucial stage in facial recognition systems to avoid leakage of personal information or spoofing of identity to entities. Recently, pulse detection based on remote photoplethysmography (rPPG) has been shown to be effective in face presentation attack detection. This work presents three different approaches to the presentation attack detection based on rPPG: (i) The physiological domain, a domain using rPPG-based models, (ii) the Deepfakes domain, a domain where models were retrained from the physiological domain to specific Deepfakes detection tasks; and (iii) a new Presentation Attack domain was trained by applying transfer learning from the two previous domains to improve the capability to differentiate between bona-fides and attacks. The results show the efficiency of the rPPG-based models for presentation attack detection, evidencing a 21.70% decrease in average classification error rate (ACER) (from 41.03% to 19.32%) when the presentation attack domain is compared to the physiological and Deepfakes domains. Our experiments highlight the efficiency of transfer learning in rPPG-based models and perform well in presentation attack detection in instruments that do not allow copying of this physiological feature. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Preprint of the paper presented to the Workshop on IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC, 2023)

arXiv:2308.03554 [pdf, other]

TemporalFED: Detecting Cyberattacks in Industrial Time-Series Data Using Decentralized Federated Learning

Authors: Ángel Luis Perales Gómez, Enrique Tomás Martínez Beltrán, Pedro Miguel Sánchez Sánchez, Alberto Huertas Celdrán

Abstract: Industry 4.0 has brought numerous advantages, such as increasing productivity through automation. However, it also presents major cybersecurity issues such as cyberattacks affecting industrial processes. Federated Learning (FL) combined with time-series analysis is a promising cyberattack detection mechanism proposed in the literature. However, the fact of having a single point of failure and netw… ▽ More Industry 4.0 has brought numerous advantages, such as increasing productivity through automation. However, it also presents major cybersecurity issues such as cyberattacks affecting industrial processes. Federated Learning (FL) combined with time-series analysis is a promising cyberattack detection mechanism proposed in the literature. However, the fact of having a single point of failure and network bottleneck are critical challenges that need to be tackled. Thus, this article explores the benefits of the Decentralized Federated Learning (DFL) in terms of cyberattack detection and resource consumption. The work presents TemporalFED, a software module for detecting anomalies in industrial environments using FL paradigms and time series. TemporalFED incorporates three components: Time Series Conversion, Feature Engineering, and Time Series Stationary Conversion. To evaluate TemporalFED, it was deployed on Fedstellar, a DFL framework. Then, a pool of experiments measured the detection performance and resource consumption in a chemical gas industrial environment with different time-series configurations, FL paradigms, and topologies. The results showcase the superiority of the configuration utilizing DFL and Semi-Decentralized Federated Learning (SDFL) paradigms, along with a fully connected topology, which achieved the best performance in anomaly detection. Regarding resource consumption, the configuration without feature engineering employed less bandwidth, CPU, and RAM than other configurations. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2306.09750 [pdf, other]

doi 10.1016/j.eswa.2023.122861

Fedstellar: A Platform for Decentralized Federated Learning

Authors: Enrique Tomás Martínez Beltrán, Ángel Luis Perales Gómez, Chao Feng, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, Alberto Huertas Celdrán

Abstract: In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, sin… ▽ More In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, single point of failure, and reliance on a central server. Decentralized Federated Learning (DFL) addresses these issues by enabling decentralized model aggregation and minimizing dependency on a central entity. Despite these advances, current platforms training DFL models struggle with key issues such as managing heterogeneous federation network topologies. To overcome these challenges, this paper presents Fedstellar, a platform extended from p2pfl library and designed to train FL models in a decentralized, semi-decentralized, and centralized fashion across diverse federations of physical or virtualized devices. The Fedstellar implementation encompasses a web application with an interactive graphical interface, a controller for deploying federations of nodes using physical or virtual devices, and a core deployed on each device which provides the logic needed to train, aggregate, and communicate in the network. The effectiveness of the platform has been demonstrated in two scenarios: a physical deployment involving single-board devices such as Raspberry Pis for detecting cyberattacks, and a virtualized deployment comparing various FL approaches in a controlled environment using MNIST and CIFAR-10 datasets. In both scenarios, Fedstellar demonstrated consistent performance and adaptability, achieving F1 scores of 91%, 98%, and 91.2% using DFL for detecting cyberattacks and classifying MNIST and CIFAR-10, respectively, reducing training time by 32% compared to centralized approaches. △ Less

Submitted 8 April, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

arXiv:2305.16809 [pdf]

GenQ: Automated Question Generation to Support Caregivers While Reading Stories with Children

Authors: Arun Balajiee Lekshmi Narayanan, Ligia E. Gomez, Martha Michelle Soto Fernandez, Tri Nguyen, Chris Blais, M. Adelaida Restrepo, Art Glenberg

Abstract: When caregivers ask open--ended questions to motivate dialogue with children, it facilitates the child's reading comprehension skills.Although there is scope for use of technological tools, referred here as "intelligent tutoring systems", to scaffold this process, it is currently unclear whether existing intelligent systems that generate human--language like questions is beneficial. Additionally,… ▽ More When caregivers ask open--ended questions to motivate dialogue with children, it facilitates the child's reading comprehension skills.Although there is scope for use of technological tools, referred here as "intelligent tutoring systems", to scaffold this process, it is currently unclear whether existing intelligent systems that generate human--language like questions is beneficial. Additionally, training data used in the development of these automated question generation systems is typically sourced without attention to demographics, but people with different cultural backgrounds may ask different questions. As a part of a broader project to design an intelligent reading support app for Latinx children, we crowdsourced questions from Latinx caregivers and noncaregivers as well as caregivers and noncaregivers from other demographics. We examine variations in question--asking within this dataset mediated by individual, cultural, and contextual factors. We then design a system that automatically extracts templates from this data to generate open--ended questions that are representative of those asked by Latinx caregivers. △ Less

Submitted 25 September, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.03657 [pdf, other]

Toward Face Biometric De-identification using Adversarial Examples

Authors: Mahdi Ghafourian, Julian Fierrez, Luis Felipe Gomez, Ruben Vera-Rodriguez, Aythami Morales, Zohra Rezgui, Raymond Veldhuis

Abstract: The remarkable success of face recognition (FR) has endangered the privacy of internet users particularly in social media. Recently, researchers turned to use adversarial examples as a countermeasure. In this paper, we assess the effectiveness of using two widely known adversarial methods (BIM and ILLC) for de-identifying personal images. We discovered, unlike previous claims in the literature, th… ▽ More The remarkable success of face recognition (FR) has endangered the privacy of internet users particularly in social media. Recently, researchers turned to use adversarial examples as a countermeasure. In this paper, we assess the effectiveness of using two widely known adversarial methods (BIM and ILLC) for de-identifying personal images. We discovered, unlike previous claims in the literature, that it is not easy to get a high protection success rate (suppressing identification rate) with imperceptible adversarial perturbation to the human visual system. Finally, we found out that the transferability of adversarial examples is highly affected by the training parameters of the network with which they are generated. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: Accepted at the AAAI-23 workshop on Artificial Intelligence for Cyber Security (AICS)

arXiv:2301.09174 [pdf, other]

MATT: Multimodal Attention Level Estimation for e-learning Platforms

Authors: Roberto Daza, Luis F. Gomez, Aythami Morales, Julian Fierrez, Ruben Tolosana, Ruth Cobos, Javier Ortega-Garcia

Abstract: This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the ca… ▽ More This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads). △ Less

Submitted 22 January, 2023; originally announced January 2023.

Comments: Preprint of the paper presented to the Workshop on Artificial Intelligence for Education (AI4EDU) of AAAI 2023

arXiv:2301.02668 [pdf, other]

A Framework for Large Scale Particle Filters Validated with Data Assimilation for Weather Simulation

Authors: Sebastian Friedemann, Kai Keller, Yen-Sen Lu, Bruno Raffin, Leonardo Bautista Gomez

Abstract: Particle filters are a group of algorithms to solve inverse problems through statistical Bayesian methods when the model does not comply with the linear and Gaussian hypothesis. Particle filters are used in domains like data assimilation, probabilistic programming, neural networkoptimization, localization and navigation. Particle filters estimate the probabilitydistribution of model state… ▽ More Particle filters are a group of algorithms to solve inverse problems through statistical Bayesian methods when the model does not comply with the linear and Gaussian hypothesis. Particle filters are used in domains like data assimilation, probabilistic programming, neural networkoptimization, localization and navigation. Particle filters estimate the probabilitydistribution of model states by running a large number of model instances, the so called particles. The ability to handle a very large number of particles is critical for high dimensional models.This paper proposes a novel paradigm to run very large ensembles of parallel model instances on supercomputers. The approach combines an elastic and fault tolerant runner/server model minimizing data movementswhile enabling dynamic load balancing. Particle weights are computed locally on each runner andtransmitted when available to a server that normalizes them, resamples new particles based on their weight, and redistributes dynamically the work torunners to react to load imbalance. Our approach relies on a an asynchronously manageddistributed particle cache permitting particles to move from one runner to another inthe background while particle propagation goes on. This also enables the number ofrunners to vary during the execution either in reaction to failures and restarts, orto adapt to changing resource availability dictated by external decision processes.The approach is experimented with the Weather Research and Forecasting (WRF) model, toassess its performance for probabilistic weather forecasting. Up to 2555particles on 20442 compute cores are used to assimilate cloud cover observations into short--range weather forecasts over Europe. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2211.09210 [pdf, other]

edBB-Demo: Biometrics and Behavior Analysis for Online Educational Platforms

Authors: Roberto Daza, Aythami Morales, Ruben Tolosana, Luis F. Gomez, Julian Fierrez, Javier Ortega-Garcia

Abstract: We present edBB-Demo, a demonstrator of an AI-powered research platform for student monitoring in remote education. The edBB platform aims to study the challenges associated to user recognition and behavior understanding in digital platforms. This platform has been developed for data collection, acquiring signals from a variety of sensors including keyboard, mouse, webcam, microphone, smartwatch,… ▽ More We present edBB-Demo, a demonstrator of an AI-powered research platform for student monitoring in remote education. The edBB platform aims to study the challenges associated to user recognition and behavior understanding in digital platforms. This platform has been developed for data collection, acquiring signals from a variety of sensors including keyboard, mouse, webcam, microphone, smartwatch, and an Electroencephalography band. The information captured from the sensors during the student sessions is modelled in a multimodal learning framework. The demonstrator includes: i) Biometric user authentication in an unsupervised environment; ii) Human action recognition based on remote video analysis; iii) Heart rate estimation from webcam video; and iv) Attention level estimation from facial expression analysis. △ Less

Submitted 5 December, 2022; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Accepted in "AAAI-23 Conference on Artificial Intelligence (Demonstration Program)"

arXiv:2209.10474 [pdf, other]

Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia

Authors: Khanh Nguyen, Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Abstract: Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over… ▽ More Humans exploit prior knowledge to describe images, and are able to adapt their explanation to specific contextual information, even to the extent of inventing plausible explanations when contextual information and images do not match. In this work, we propose the novel task of captioning Wikipedia images by integrating contextual knowledge. Specifically, we produce models that jointly reason over Wikipedia articles, Wikimedia images and their associated descriptions to produce contextualized captions. Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information. A particular challenging task in this domain is dealing with out-of-dictionary words and Named Entities. To address this, we propose a pre-training objective, Masked Named Entity Modeling (MNEM), and show that this pretext task yields an improvement compared to baseline models. Furthermore, we verify that a model pre-trained with the MNEM objective in Wikipedia generalizes well to a News Captioning dataset. Additionally, we define two different test splits according to the difficulty of the captioning task. We offer insights on the role and the importance of each modality and highlight the limitations of our model. The code, models and data splits are publicly available at Upon acceptance. △ Less

Submitted 21 September, 2022; originally announced September 2022.

arXiv:2209.06730 [pdf, other]

MUST-VQA: MUltilingual Scene-text VQA

Authors: Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas, Lluis Gomez

Abstract: In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a m… ▽ More In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks. △ Less

Submitted 14 September, 2022; originally announced September 2022.

Comments: To be appeared in Text In Everything Workshop in ECCV 2022

arXiv:2206.10343 [pdf, other]

Building an Endangered Language Resource in the Classroom: Universal Dependencies for Kakataibo

Authors: Roberto Zariquiey, Claudia Alvarado, Ximena Echevarria, Luisa Gomez, Rosa Gonzales, Mariana Illescas, Sabina Oporto, Frederic Blum, Arturo Oncevay, Javier Vera

Abstract: In this paper, we launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru. We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates. Then, we describe the general details of the treebank and the language-spe… ▽ More In this paper, we launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru. We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates. Then, we describe the general details of the treebank and the language-specific considerations implemented for the proposed annotation. We finally conduct some experiments on part-of-speech tagging and syntactic dependency parsing. We focus on monolingual and transfer learning settings, where we study the impact of a Shipibo-Konibo treebank, another Panoan language resource. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: Accepted to LREC 2022

arXiv:2205.15781 [pdf, other]

doi 10.3390/s23020621

Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models

Authors: Jose L. Gómez, Gabriel Villalonga, Antonio M. López

Abstract: Semantic image segmentation is a central and challenging task in autonomous driving, addressed by training deep models. Since this training draws to a curse of human-based image labeling, using synthetic images with automatically generated labels together with unlabeled real-world images is a promising alternative. This implies to address an unsupervised domain adaptation (UDA) problem. In this pa… ▽ More Semantic image segmentation is a central and challenging task in autonomous driving, addressed by training deep models. Since this training draws to a curse of human-based image labeling, using synthetic images with automatically generated labels together with unlabeled real-world images is a promising alternative. This implies to address an unsupervised domain adaptation (UDA) problem. In this paper, we propose a new co-training procedure for synth-to-real UDA of semantic segmentation models. It consists of a self-training stage, which provides two domain-adapted models, and a model collaboration loop for the mutual improvement of these two models. These models are then used to provide the final semantic segmentation labels (pseudo-labels) for the real-world images. The overall procedure treats the deep models as black boxes and drives their collaboration at the level of pseudo-labeled target images, i.e., neither modifying loss functions is required, nor explicit feature alignment. We test our proposal on standard synthetic and real-world datasets for on-board semantic segmentation. Our procedure shows improvements ranging from ~13 to ~26 mIoU points over baselines, so establishing new state-of-the-art results. △ Less

Submitted 30 January, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

Comments: Code available at https://github.com/JoseLGomez/Co-training_SemSeg_UDA. Paper accepted on Sensors at https://www.mdpi.com/1424-8220/23/2/621

Journal ref: Sensors, Special Issue Machine Learning for Autonomous Driving Perception and Prediction (2023)

arXiv:2205.01858 [pdf]

DeeptDCS: Deep Learning-Based Estimation of Currents Induced During Transcranial Direct Current Stimulation

Authors: Xiaofan Jia, Sadeed Bin Sayed, Nahian Ibn Hasan, Luis J. Gomez, Guang-Bin Huang, Abdulkadir C. Yucel

Abstract: Objective: Transcranial direct current stimulation (tDCS) is a non-invasive brain stimulation technique used to generate conduction currents in the head and disrupt brain functions. To rapidly evaluate the tDCS-induced current density in near real-time, this paper proposes a deep learning-based emulator, named DeeptDCS. Methods: The emulator leverages Attention U-net taking the volume conductor mo… ▽ More Objective: Transcranial direct current stimulation (tDCS) is a non-invasive brain stimulation technique used to generate conduction currents in the head and disrupt brain functions. To rapidly evaluate the tDCS-induced current density in near real-time, this paper proposes a deep learning-based emulator, named DeeptDCS. Methods: The emulator leverages Attention U-net taking the volume conductor models (VCMs) of head tissues as inputs and outputting the three-dimensional current density distribution across the entire head. The electrode configurations are also incorporated into VCMs without increasing the number of input channels; this enables the straightforward incorporation of the non-parametric features of electrodes (e.g., thickness, shape, size, and position) in the training and testing of the proposed emulator. Results: Attention U-net outperforms standard U-net and its other three variants (Residual U-net, Attention Residual U-net, and Multi-scale Residual U-net) in terms of accuracy. The generalization ability of DeeptDCS to non-trained electrode configurations can be greatly enhanced through fine-tuning the model. The computational time required by one emulation via DeeptDCS is a fraction of a second. Conclusion: DeeptDCS is at least two orders of magnitudes faster than a physics-based open-source simulator, while providing satisfactorily accurate results. Significance: The high computational efficiency permits the use of DeeptDCS in applications requiring its repetitive execution, such as uncertainty quantification and optimization studies of tDCS. △ Less

Submitted 6 October, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

arXiv:2204.04028 [pdf, other]

A Generic Image Retrieval Method for Date Estimation of Historical Document Collections

Authors: Adrià Molina, Lluis Gomez, Oriol Ramos Terrades, Josep Lladós

Abstract: Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others. This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections. we use a ranking loss function named smooth-nDCG to train a Convolutional… ▽ More Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others. This paper presents a robust date estimation system based in a retrieval approach that generalizes well in front of heterogeneous collections. we use a ranking loss function named smooth-nDCG to train a Convolutional Neural Network that learns an ordination of documents for each problem. One of the main usages of the presented approach is as a tool for historical contextual retrieval. It means that scholars could perform comparative analysis of historical images from big datasets in terms of the period where they were produced. We provide experimental evaluation on different types of documents from real datasets of manuscript and newspaper images. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Preprint of paper accepted at DAS2022

arXiv:2203.11361 [pdf, ps, other]

Complexity of limit cycles with block-sequential update schedules in conjunctive networks

Authors: Julio Aracena, Florian Bridoux, Luis Gómez, Lilian Salinas

Abstract: In this paper, we deal the following decision problem: given a conjunctive Boolean network defined by its interaction digraph, does it have a limit cycle of a given length k? We prove that this problem is NP-complete in general if k is a parameter of the problem and in P if the interaction digraph is strongly connected. The case where $k$ is a constant, but the interaction digraph is not strongly… ▽ More In this paper, we deal the following decision problem: given a conjunctive Boolean network defined by its interaction digraph, does it have a limit cycle of a given length k? We prove that this problem is NP-complete in general if k is a parameter of the problem and in P if the interaction digraph is strongly connected. The case where $k$ is a constant, but the interaction digraph is not strongly connected remains open. Furthermore, we study the variation of the decision problem: given a conjunctive Boolean network, does there exist a block-sequential (resp. sequential) update schedule such that there exists a limit cycle of length k? We prove that this problem is NP-complete for any constant k >= 2. △ Less

Submitted 21 March, 2022; originally announced March 2022.

arXiv:2203.04814 [pdf, other]

Text-DIAE: A Self-Supervised Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

Authors: Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Fornés, Yousri Kessentini, Josep Lladós, Lluis Gomez, Dimosthenis Karatzas

Abstract: In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. E… ▽ More In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. Each of the pretext objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that confirm the design choice of the selected pretext tasks. Importantly, the proposed model does not exhibit limitations of previous state-of-the-art methods based on contrastive losses, while at the same time requiring substantially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}. △ Less

Submitted 18 August, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Preprint

arXiv:2202.12985 [pdf, other]

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Authors: Ali Furkan Biten, Rubèn Tito, Lluis Gomez, Ernest Valveny, Dimosthenis Karatzas

Abstract: Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance… ▽ More Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2111.02078 [pdf, ps, other]

FaceQvec: Vector Quality Assessment for Face Biometrics based on ISO Compliance

Authors: Javier Hernandez-Ortega, Julian Fierrez, Luis F. Gomez, Aythami Morales, Jose Luis Gonzalez-de-Suso, Francisco Zamora-Martinez

Abstract: In this paper we develop FaceQvec, a software component for estimating the conformity of facial images with each of the points contemplated in the ISO/IEC 19794-5, a quality standard that defines general quality guidelines for face images that would make them acceptable or unacceptable for use in official documents such as passports or ID cards. This type of tool for quality assessment can help to… ▽ More In this paper we develop FaceQvec, a software component for estimating the conformity of facial images with each of the points contemplated in the ISO/IEC 19794-5, a quality standard that defines general quality guidelines for face images that would make them acceptable or unacceptable for use in official documents such as passports or ID cards. This type of tool for quality assessment can help to improve the accuracy of face recognition, as well as to identify which factors are affecting the quality of a given face image and to take actions to eliminate or reduce those factors, e.g., with postprocessing techniques or re-acquisition of the image. FaceQvec consists of the automation of 25 individual tests related to different points contemplated in the aforementioned standard, as well as other characteristics of the images that have been considered to be related to facial quality. We first include the results of the quality tests evaluated on a development dataset captured under realistic conditions. We used those results to adjust the decision threshold of each test. Then we checked again their accuracy on a evaluation database that contains new face images not seen during development. The evaluation results demonstrate the accuracy of the individual tests for checking compliance with ISO/IEC 19794-5. FaceQvec is available online (https://github.com/uam-biometrics/FaceQvec). △ Less

Submitted 3 November, 2021; originally announced November 2021.

arXiv:2111.01730 [pdf, other]

doi 10.1109/TAP.2021.3137193

A Butterfly-Accelerated Volume Integral Equation Solver for Broad Permittivity and Large-Scale Electromagnetic Analysis

Authors: Sadeed B. Sayed, Yang Liu, Luis J. Gomez, Abdulkadir C. Yucel

Abstract: A butterfly-accelerated volume integral equation (VIE) solver is proposed for fast and accurate electromagnetic (EM) analysis of scattering from heterogeneous objects. The proposed solver leverages the hierarchical off-diagonal butterfly (HOD-BF) scheme to construct the system matrix and obtain its approximate inverse, used as a preconditioner. Complexity analysis and numerical experiments validat… ▽ More A butterfly-accelerated volume integral equation (VIE) solver is proposed for fast and accurate electromagnetic (EM) analysis of scattering from heterogeneous objects. The proposed solver leverages the hierarchical off-diagonal butterfly (HOD-BF) scheme to construct the system matrix and obtain its approximate inverse, used as a preconditioner. Complexity analysis and numerical experiments validate the $O(N\log^2N)$ construction cost of the HOD-BF-compressed system matrix and $O(N^{1.5}\log N)$ inversion cost for the preconditioner, where $N$ is the number of unknowns in the high-frequency EM scattering problem. For many practical scenarios, the proposed VIE solver requires less memory and computational time to construct the system matrix and obtain its approximate inverse compared to a $\mathcal{H}$ matrix-accelerated VIE solver. The accuracy and efficiency of the proposed solver have been demonstrated via its application to the EM analysis of large-scale canonical and real-world structures comprising of broad permittivity values and involving millions of unknowns. △ Less

Submitted 2 November, 2021; originally announced November 2021.

arXiv:2110.02623 [pdf, other]

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Authors: Ali Furkan Biten, Andres Mafla, Lluis Gomez, Dimosthenis Karatzas

Abstract: The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forc… ▽ More The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a \emph{large} improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. Code with our metrics and adaptive margin formulation will be made public. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: Accepted WACV 2022

arXiv:2110.01705 [pdf, other]

Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning

Authors: Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Abstract: Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or inc… ▽ More Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning. This behaviour is quite common in the state-of-the-art captioning models which is not desirable by humans. To decrease the object hallucination in captioning, we propose three simple yet efficient training augmentation method for sentences which requires no new training data or increase in the model size. By extensive analysis, we show that the proposed methods can significantly diminish our models' object bias on hallucination metrics. Moreover, we experimentally demonstrate that our methods decrease the dependency on the visual features. All of our code, configuration files and model weights will be made public. △ Less

Submitted 2 November, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: Accepted to WACV 2022

arXiv:2110.00711 [pdf, other]

doi 10.1007/s10032-021-00383-3

Asking questions on handwritten document collections

Authors: Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, CV Jawahar

Abstract: This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritt… ▽ More This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org △ Less

Submitted 1 October, 2021; originally announced October 2021.

Comments: pre-print version

Journal ref: journal = {Int. J. Document Anal. Recognit.}, volume = {24}, number = {3}, pages = {235--249}, year = {2021}

arXiv:2106.05618 [pdf, other]

Date Estimation in the Wild of Scanned Historical Photos: An Image Retrieval Approach

Authors: Adrià Molina, Pau Riba, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós

Abstract: This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design… ▽ More This paper presents a novel method for date estimation of historical photographs from archival sources. The main contribution is to formulate the date estimation as a retrieval task, where given a query, the retrieved images are ranked in terms of the estimated date similarity. The closer are their embedded representations the closer are their dates. Contrary to the traditional models that design a neural network that learns a classifier or a regressor, we propose a learning objective based on the nDCG ranking metric. We have experimentally evaluated the performance of the method in two different tasks: date estimation and date-sensitive image retrieval, using the DEW public database, overcoming the baseline methods. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: Accepted at ICDAR 2021

arXiv:2106.05144 [pdf, other]

Learning to Rank Words: Optimizing Ranking Metrics for Word Spotting

Authors: Pau Riba, Adrià Molina, Lluis Gomez, Oriol Ramos-Terrades, Josep Lladós

Abstract: In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the qu… ▽ More In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder. We consider retrieval frameworks in which the user expects a retrieval list ranked according to a defined relevance score. In the context of a word spotting problem, the relevance score has been set according to the string edit distance from the query string. We experimentally demonstrate the competitive performance of the proposed model on query-by-string word spotting for both, handwritten and real scene word images. We also provide the results for query-by-example word spotting, although it is not the main focus of this work. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted at ICDAR 2021

arXiv:2105.08627 [pdf]

doi 10.1109/TASC.2021.3105715

SuperVoxHenry Tucker-Enhanced and FFT-Accelerated Inductance Extraction for Voxelized Superconducting Structures

Authors: Mingyu Wang, Cheng Qian, Enrico Di Lorenzo, Luis J. Gomez, Vladimir Okhmatovski, Abdulkadir C. Yucel

Abstract: This paper introduces SuperVoxHenry, an inductance extraction simulator for analyzing voxelized superconducting structures. SuperVoxHenry extends the capabilities of the inductance extractor VoxHenry for analyzing the superconducting structures by incorporating the following enhancements. 1. SuperVoxHenry utilizes a two-fluid model to account for normal currents and supercurrents. 2. SuperVoxHenry… ▽ More This paper introduces SuperVoxHenry, an inductance extraction simulator for analyzing voxelized superconducting structures. SuperVoxHenry extends the capabilities of the inductance extractor VoxHenry for analyzing the superconducting structures by incorporating the following enhancements. 1. SuperVoxHenry utilizes a two-fluid model to account for normal currents and supercurrents. 2. SuperVoxHenry introduces the Tucker decompositions to reduce the memory requirement of circulant tensors as well as the setup time of the simulator. 3. SuperVoxHenry incorporates an aggregation-based algebraic multigrid technique to obtain the sparse preconditioner. △ Less

Submitted 18 August, 2021; v1 submitted 26 April, 2021; originally announced May 2021.

arXiv:2105.05300 [pdf, other]

One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition

Authors: Mohamed Ali Souibgui, Ali Furkan Biten, Sounak Dey, Alicia Fornés, Yousri Kessentini, Lluis Gomez, Dimosthenis Karatzas, Josep Lladós

Abstract: Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). For example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the message contents. Thus, in this paper we address this problem through a data generation technique… ▽ More Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). For example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the message contents. Thus, in this paper we address this problem through a data generation technique based on Bayesian Program Learning (BPL). Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol in the alphabet. After generating symbols, we create synthetic lines to train state-of-the-art HTR architectures in a segmentation free fashion. Quantitative and qualitative analyses were carried out and confirm the effectiveness of the proposed method. △ Less

Submitted 5 October, 2021; v1 submitted 11 May, 2021; originally announced May 2021.

Comments: Accepted in WACV 2022

arXiv:2104.11619 [pdf, other]

doi 10.3390/s21093185

Co-training for Deep Object Detection: Comparing Single-modal and Multi-modal Approaches

Authors: Jose L. Gómez, Gabriel Villalonga, Antonio M. López

Abstract: Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data labeling bottleneck may be intensified due to domain shifts among image senso… ▽ More Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on pair; at least, when using an off-the-shelf depth estimation model not specifically trained on the translated images. △ Less

Submitted 23 April, 2021; originally announced April 2021.

Report number: sensors-1185064

Journal ref: special issue of Sensors (ISSN 1424-8220) "Feature Papers in Physical Sensors Section 2020"

arXiv:2104.09075 [pdf, ps, other]

doi 10.1145/3431379.3460644

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Authors: Albert Njoroge Kahira, Truong Thao Nguyen, Leonardo Bautista Gomez, Ryousei Takano, Rosa M Badia, Mohamed Wahib

Abstract: Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communicat… ▽ More Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: The International ACM Symposium on High-Performance Parallel and Distributed Computing 2021 (HPDC'21)

arXiv:2012.04329 [pdf, other]

StacMR: Scene-Text Aware Cross-Modal Retrieval

Authors: Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas

Abstract: Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in imag… ▽ More Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr △ Less

Submitted 8 December, 2020; originally announced December 2020.

arXiv:2012.00825 [pdf, other]

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Authors: Elvis Rojas, Albert Njoroge Kahira, Esteban Meneses, Leonardo Bautista Gomez, Rosa M Badia

Abstract: Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC wor… ▽ More Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC. △ Less

Submitted 29 March, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

Journal ref: 2020 International Conference on High Performance Computing & Simulation (HPCS20)

arXiv:2009.09809 [pdf, other]

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Abstract: Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text… ▽ More Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text reading system. Then, we combine textual features with salient image regions to exploit the complementary information carried by the two sources. Specifically, we employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image. By obtaining an enhanced set of visual and textual features, the proposed model greatly outperforms the previous state-of-the-art in two different tasks, fine-grained classification and image retrieval in the Con-Text and Drink Bottle datasets. △ Less

Submitted 21 September, 2020; originally announced September 2020.

arXiv:2007.03375 [pdf, other]

Location Sensitive Image Retrieval and Tagging

Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

Abstract: People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that l… ▽ More People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that learns to rank triplets of images, tags and coordinates by plausibility, and two training strategies to balance the location influence in the final ranking. LocSens learns to fuse textual and location information of multimodal queries to retrieve related images at different levels of location granularity, and successfully utilizes location information to improve image tagging. △ Less

Submitted 7 July, 2020; originally announced July 2020.

MSC Class: 68T07 ACM Class: I.2.10

Journal ref: ECCV 2020

arXiv:2007.03098 [pdf, other]

Text Recognition -- Real World Data and Where to Find Them

Authors: Klára Janoušková, Jiri Matas, Lluis Gomez, Dimosthenis Karatzas

Abstract: We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-f… ▽ More We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. The proposed method includes matching of imprecise transcription to weak annotations and edit distance guided neighbourhood search. It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT). We apply the method to two weakly-annotated datasets. Training with the extracted PGT consistently improves the accuracy of a state of the art recognition model, by 3.7~\% on average, across different benchmark datasets (image domains) and 24.5~\% on one of the weakly annotated datasets. △ Less

Submitted 17 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: 10 pages

arXiv:2006.00923 [pdf, other]

Multimodal grid features and cell pointers for Scene Text Visual Question Answering

Authors: Lluís Gómez, Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas

Abstract: This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities i… ▽ More This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities in the scene. The output weights of this attention module over the grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text the to the given question. Our experiments demonstrate competitive performance in two standard datasets. Furthermore, this paper provides a novel analysis of the ST-VQA dataset based on a human performance study. △ Less

Submitted 25 June, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: This paper is under consideration at Pattern Recognition Letters

arXiv:2005.09496 [pdf, other]

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos

Authors: Sangeeth Reddy, Minesh Mathew, Lluis Gomez, Marcal Rusinol, Dimosthenis Karatzas., C. V. Jawahar

Abstract: Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled kee** text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is… ▽ More Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled kee** text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving. The dataset can be found at http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtext-1k △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: to be published in ICRA 2020

arXiv:2001.04732 [pdf, other]

Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas

Abstract: Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained… ▽ More Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding. In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks such as image retrieval, fine-grained classification, and visual question answering. In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities. The novelty of the proposed model consists of the usage of a PHOC descriptor to construct a bag of textual words along with a Fisher Vector Encoding that captures the morphology of text. This approach provides a stronger multimodal representation for this task and as our experiments demonstrate, it achieves state-of-the-art results on two different tasks, fine-grained classification and image retrieval. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: Winter Conference on Applications of Computer Vision (WACV 2020) Accepted paper

arXiv:1912.00154 [pdf, other]

Hardware Versus Software Fault Injection of Modern Undervolted SRAMs

Authors: Muhammet Abdullah Soyturk, Konstantinos Parasyris, Behzad Salami, Osman Unsal, Gulay Yalcin, Leonardo Bautista Gomez

Abstract: To improve power efficiency, researchers are experimenting with dynamically adjusting the supply voltage of systems below the nominal operating points. However, production systems are typically not allowed to function on voltage settings that is below the reliable limit. Consequently, existing software fault tolerance studies are based on fault models, which inject faults on random fault locations… ▽ More To improve power efficiency, researchers are experimenting with dynamically adjusting the supply voltage of systems below the nominal operating points. However, production systems are typically not allowed to function on voltage settings that is below the reliable limit. Consequently, existing software fault tolerance studies are based on fault models, which inject faults on random fault locations using fault injection techniques. In this work we study whether random fault injection is accurate to simulate the behavior of undervolted SRAMs. Our study extends the Gem5 simulator to support fault injection on the caches of the simulated system. The fault injection framework uses fault maps, which describe the faulty bits of SRAMs, as inputs. To compare random fault injection and hardware guided fault injection, we use two types of fault maps. The first type of maps are created through undervolting real SRAMs and observing the location of the erroneous bits, whereas the second type of maps are created by corrupting random bits of the SRAMs. During our study we corrupt the L1-Dcache of the simulated system and we monitor the behavior of the two types of fault maps on the resiliency of six benchmarks. The difference among the resiliency of a benchmark when tested with the different fault maps can be up to 24%. △ Less

Submitted 30 November, 2019; originally announced December 2019.

arXiv:1910.03814 [pdf, other]

Exploring Hate Speech Detection in Multimodal Publications

Authors: Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas

Abstract: In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the c… ▽ More In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image. We gather and annotate a large scale dataset from Twitter, MMHS150K, and propose different models that jointly analyze textual and visual information for hate speech detection, comparing them with unimodal detection. We provide quantitative and qualitative results and analyze the challenges of the proposed task. We find that, even though images are useful for the hate speech detection task, current multimodal models cannot outperform models analyzing only text. We discuss why and open the field and the dataset for further research. △ Less

Submitted 9 October, 2019; originally announced October 2019.

arXiv:1909.01216 [pdf, other]

Online Analytical Processsing on Graph Data

Authors: Leticia Gómez, Bart Kuijpers, Alejandro Vaisman

Abstract: Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a Big Data scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current… ▽ More Online Analytical Processing (OLAP) comprises tools and algorithms that allow querying multidimensional databases. It is based on the multidimensional model, where data can be seen as a cube such that each cell contains one or more measures that can be aggregated along dimensions. In a Big Data scenario, traditional data warehousing and OLAP operations are clearly not sufficient to address current data analysis requirements, for example, social network analysis. Furthermore, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. Nevertheless, there is not much work on the problem of taking OLAP analysis to the graph data model. This paper proposes a formal multidimensional model for graph analysis, that considers the basic graph data, and also background information in the form of dimension hierarchies. The graphs in this model are node- and edge-labelled directed multi-hypergraphs, called graphoids, which can be defined at several different levels of granularity using the dimensions associated with them. Operations analogous to the ones used in typical OLAP over cubes are defined over graphoids. The paper presents a formal definition of the graphoid model for OLAP, proves that the typical OLAP operations on cubes can be expressed over the graphoid model, and shows that the classic data cube model is a particular case of the graphoid data model. Finally, a case study supports the claim that, for many kinds of OLAP-like analysis on graphs, the graphoid model works better than the typical relational OLAP alternative, and for the classic OLAP queries, it remains competitive. △ Less

Submitted 3 September, 2019; originally announced September 2019.

Comments: This is a draft version of the work that will appear in Volume 24(2) of the Intelligent Data Analysis Journal, in early 2020

arXiv:1907.04246 [pdf, other]

Security for Distributed Deep Neural Networks Towards Data Confidentiality & Intellectual Property Protection

Authors: Laurent Gomez, Marcus Wilhelm, José Márquez, Patrick Duverger

Abstract: Current developments in Enterprise Systems observe a paradigm shift, moving the needle from the backend to the edge sectors of those; by distributing data, decentralizing applications and integrating novel components seamlessly to the central systems. Distributively deployed AI capabilities will thrust this transition. Several non-functional requirements arise along with these developments, securi… ▽ More Current developments in Enterprise Systems observe a paradigm shift, moving the needle from the backend to the edge sectors of those; by distributing data, decentralizing applications and integrating novel components seamlessly to the central systems. Distributively deployed AI capabilities will thrust this transition. Several non-functional requirements arise along with these developments, security being at the center of the discussions. Bearing those requirements in mind, hereby we propose an approach to holistically protect distributed Deep Neural Network (DNN) based/enhanced software assets, i.e. confidentiality of their input & output data streams as well as safeguarding their Intellectual Property. Making use of Fully Homomorphic Encryption (FHE), our approach enables the protection of Distributed Neural Networks, while processing encrypted data. On that respect we evaluate the feasibility of this solution on a Convolutional Neuronal Network (CNN) for image classification deployed on distributed infrastructures. △ Less

Submitted 9 July, 2019; originally announced July 2019.

Journal ref: Proceedings of the 16th International Joint Conference on e-Business and Telecommunications, ICETE 2019

arXiv:1907.03343 [pdf, other]

Fast and Provable ADMM for Learning with Generative Priors

Authors: Fabian Latorre Gómez, Armin Eftekhari, Volkan Cevher

Abstract: In this work, we propose a (linearized) Alternating Direction Method-of-Multipliers (ADMM) algorithm for minimizing a convex function subject to a nonconvex constraint. We focus on the special case where such constraint arises from the specification that a variable should lie in the range of a neural network. This is motivated by recent successful applications of Generative Adversarial Networks (G… ▽ More In this work, we propose a (linearized) Alternating Direction Method-of-Multipliers (ADMM) algorithm for minimizing a convex function subject to a nonconvex constraint. We focus on the special case where such constraint arises from the specification that a variable should lie in the range of a neural network. This is motivated by recent successful applications of Generative Adversarial Networks (GANs) in tasks like compressive sensing, denoising and robustness against adversarial examples. The derived rates for our algorithm are characterized in terms of certain geometric properties of the generator network, which we show hold for feedforward architectures, under mild assumptions. Unlike gradient descent (GD), it can efficiently handle non-smooth objectives as well as exploit efficient partial minimization procedures, thus being faster in many practical scenarios. △ Less

Submitted 7 July, 2019; originally announced July 2019.

arXiv:1907.00490 [pdf, other]

ICDAR 2019 Competition on Scene Text Visual Question Answering

Authors: Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas

Abstract: This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/ans… ▽ More This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios. The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding. △ Less

Submitted 30 June, 2019; originally announced July 2019.

Comments: 15th International Conference on Document Analysis and Recognition (ICDAR 2019)

arXiv:1906.05038 [pdf, other]

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets

Authors: Kai Keller, Leonardo Bautista Gomez

Abstract: High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data… ▽ More High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of checkpoint data also increases dramatically, sometimes causing an I/O bottleneck. Differential checkpointing (dCP) aims to minimize the checkpointing overhead by only writing data differences. This is typically implemented at the memory page level, sometimes complemented with hashing algorithms. However, such a technique is unable to cope with dynamic-size datasets. In this work, we present a novel dCP implementation with a new file format that allows fragmentation of protected datasets in order to support dynamic sizes. We identify dirty data blocks using hash algorithms. In order to evaluate the dCP performance, we ported the HPC applications xPic, LULESH 2.0 and Heat2D and analyze them regarding their potential of reducing I/O with dCP and how this data reduction influences the checkpoint performance. In our experiments, we achieve reductions of up to 62% of the checkpoint time. △ Less

Submitted 12 June, 2019; originally announced June 2019.

Comments: This project has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) and the Horizon 2020 (H2020) funding framework under grant agreement no. H2020-FETHPC-754304 (DEEP-EST); and the LEGaTO Project (legato- project.eu), grant agreement No 780681

arXiv:1906.01466 [pdf, other]

Selective Style Transfer for Text

Authors: Raul Gomez, Ali Furkan Biten, Lluis Gomez, Jaume Gibert, Marçal Rusiñol, Dimosthenis Karatzas

Abstract: This paper explores the possibilities of image style transfer applied to text maintaining the original transcriptions. Results on different text domains (scene text, machine printed text and handwritten text) and cross modal results demonstrate that this is feasible, and open different research lines. Furthermore, two architectures for selective style transfer, which means transferring style to on… ▽ More This paper explores the possibilities of image style transfer applied to text maintaining the original transcriptions. Results on different text domains (scene text, machine printed text and handwritten text) and cross modal results demonstrate that this is feasible, and open different research lines. Furthermore, two architectures for selective style transfer, which means transferring style to only desired image pixels, are proposed. Finally, scene text selective style transfer is evaluated as a data augmentation technique to expand scene text detection datasets, resulting in a boost of text detectors performance. Our implementation of the described models is publicly available. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: Accepted in ICDAR 2019

Showing 1–50 of 75 results for author: Gómez, L