Search | arXiv e-print repository

LumberChunker: Long-Form Narrative Document Segmentation

Authors: André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Abstract: Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to… ▽ More Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker △ Less

Submitted 25 June, 2024; originally announced June 2024.

ACM Class: I.2

arXiv:2406.02748 [pdf, other]

Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges

Authors: Daniel A. P. Oliveira, Eugénio Ribeiro, David Martins de Matos

Abstract: Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations. The survey also covers tasks related to automatic story generation, such as image and video captioning, and vi… ▽ More Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations. The survey also covers tasks related to automatic story generation, such as image and video captioning, and visual question answering, as well as story generation without visual inputs. These tasks share common challenges with visual story generation and have served as inspiration for the techniques used in the field. We analyze the main datasets and evaluation metrics, providing a critical perspective on their limitations. △ Less

Submitted 4 June, 2024; originally announced June 2024.

ACM Class: I.2.7; I.2.10

arXiv:2405.18435 [pdf, other]

QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Authors: Hongwei Bran Li, Fernando Navarro, Ivan Ezhov, Amirhossein Bayat, Dhritiman Das, Florian Kofler, Suprosanna Shit, Diana Waldmannstetter, Johannes C. Paetzold, Xiaobin Hu, Benedikt Wiestler, Lucas Zimmer, Tamaz Amiranashvili, Chinmay Prabhakar, Christoph Berger, Jonas Weidner, Michelle Alonso-Basant, Arif Rashid, Ujjwal Baid, Wesam Adel, Deniz Ali, Bhakti Baheti, Yingbin Bai, Ishaan Bhatt, Sabri Can Cetindag , et al. (55 additional authors not shown)

Abstract: Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de… ▽ More Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks. △ Less

Submitted 24 June, 2024; v1 submitted 19 March, 2024; originally announced May 2024.

Comments: initial technical report

arXiv:2405.17202 [pdf, other]

Efficient multi-prompt evaluation of LLMs

Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin

Abstract: Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt va… ▽ More Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations. Our code and data can be found at https://github.com/felipemaiapolo/prompt-eval. △ Less

Submitted 7 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.15645 [pdf, other]

An Online Probabilistic Distributed Tracing System

Authors: M. Toslali, S. Qasim, S. Parthasarathy, F. A. Oliveira, H. Huang, G. Stringhini, Z. Liu, A. K. Coskun

Abstract: Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and… ▽ More Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2404.16049 [pdf, other]

Exploring the limitations of blood pressure estimation using the photoplethysmography signal

Authors: Felipe M. Dias, Diego A. C. Cardenas, Marcelo A. F. Toledo, Filipe A. C. Oliveira, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez

Abstract: Hypertension, a leading contributor to cardiovascular morbidity, underscores the need for accurate and continuous blood pressure (BP) monitoring. Photoplethysmography (PPG) presents a promising approach to this end. However, the precision of BP estimates derived from PPG signals has been the subject of ongoing debate, necessitating a comprehensive evaluation of their effectiveness and constraints.… ▽ More Hypertension, a leading contributor to cardiovascular morbidity, underscores the need for accurate and continuous blood pressure (BP) monitoring. Photoplethysmography (PPG) presents a promising approach to this end. However, the precision of BP estimates derived from PPG signals has been the subject of ongoing debate, necessitating a comprehensive evaluation of their effectiveness and constraints. We developed a calibration-based Siamese ResNet model for BP estimation, using a signal input paired with a reference BP reading. We compared the use of normalized PPG (N-PPG) against the normalized Invasive Arterial Blood Pressure (N-IABP) signals as input. The N-IABP signals do not directly present systolic and diastolic values but theoretically provide a more accurate BP measure than PPG signals since it is a direct pressure sensor inside the body. Our strategy establishes a critical benchmark for PPG performance, realistically calibrating expectations for PPG's BP estimation capabilities. Nonetheless, we compared the performance of our models using different signal-filtering conditions to evaluate the impact of filtering on the results. We evaluated our method using the AAMI and the BHS standards employing the VitalDB dataset. The N-IABP signals meet with AAMI standards for both Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP), with errors of 1.29+-6.33mmHg for systolic pressure and 1.17+-5.78mmHg for systolic and diastolic pressure respectively for the raw N-IABP signal. In contrast, N-PPG signals, in their best setup, exhibited inferior performance than N-IABP, presenting 1.49+-11.82mmHg and 0.89+-7.27mmHg for systolic and diastolic pressure respectively. Our findings highlight the potential and limitations of employing PPG for BP estimation, showing that these signals contain information correlated to BP but may not be sufficient for predicting it accurately. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 17 pages, 7 figures, 3 tables

arXiv:2404.06389 [pdf, other]

doi 10.1016/j.simpa.2024.100657

Raster Forge: Interactive Raster Manipulation Library and GUI for Python

Authors: Afonso Oliveira, Nuno Fachada, João P. Matos-Carvalho

Abstract: Raster Forge is a Python library and graphical user interface for raster data manipulation and analysis. The tool is focused on remote sensing applications, particularly in wildfire management. It allows users to import, visualize, and process raster layers for tasks such as image compositing or topographical analysis. For wildfire management, it generates fuel maps using predefined models. Its im… ▽ More Raster Forge is a Python library and graphical user interface for raster data manipulation and analysis. The tool is focused on remote sensing applications, particularly in wildfire management. It allows users to import, visualize, and process raster layers for tasks such as image compositing or topographical analysis. For wildfire management, it generates fuel maps using predefined models. Its impact extends from disaster management to hydrological modeling, agriculture, and environmental monitoring. Raster Forge can be a valuable asset for geoscientists and researchers who rely on raster data analysis, enhancing geospatial data processing and visualization across various disciplines. △ Less

Submitted 19 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

ACM Class: I.4; I.5; J.2; D.2; H.5.2

Journal ref: Software Impacts, 20, 100657, 2024

arXiv:2404.04385 [pdf, other]

Reconfigurable and Scalable Honeynet for Cyber-Physical Systems

Authors: Luís Sousa, José Cecílio, Pedro Ferreira, Alan Oliveira

Abstract: Industrial Control Systems (ICS) constitute the backbone of contemporary industrial operations, ranging from modest heating, ventilation, and air conditioning systems to expansive national power grids. Given their pivotal role in critical infrastructure, there has been a concerted effort to enhance security measures and deepen our comprehension of potential cyber threats within this domain. To add… ▽ More Industrial Control Systems (ICS) constitute the backbone of contemporary industrial operations, ranging from modest heating, ventilation, and air conditioning systems to expansive national power grids. Given their pivotal role in critical infrastructure, there has been a concerted effort to enhance security measures and deepen our comprehension of potential cyber threats within this domain. To address these challenges, numerous implementations of Honeypots and Honeynets intended to detect and understand attacks have been employed for ICS. This approach diverges from conventional methods by focusing on making a scalable and reconfigurable honeynet for cyber-physical systems. It will also automatically generate attacks on the honeynet to test and validate it. With the development of a scalable and reconfigurable Honeynet and automatic attack generation tools, it is also expected that the system will serve as a basis for producing datasets for training algorithms for detecting and classifying attacks in cyber-physical honeynets. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.03754 [pdf, other]

Data Science for Geographic Information Systems

Authors: Afonso Oliveira, Nuno Fachada, João P. Matos-Carvalho

Abstract: The integration of data science into Geographic Information Systems (GIS) has facilitated the evolution of these tools into complete spatial analysis platforms. The adoption of machine learning and big data techniques has equipped these platforms with the capacity to handle larger amounts of increasingly complex data, transcending the limitations of more traditional approaches. This work traces th… ▽ More The integration of data science into Geographic Information Systems (GIS) has facilitated the evolution of these tools into complete spatial analysis platforms. The adoption of machine learning and big data techniques has equipped these platforms with the capacity to handle larger amounts of increasingly complex data, transcending the limitations of more traditional approaches. This work traces the historical and technical evolution of data science and GIS as fields of study, highlighting the critical points of convergence between domains, and underlining the many sectors that rely on this integration. A GIS application is presented as a case study in the disaster management sector where we utilize aerial data from Tróia, Portugal, to emphasize the process of insight extraction from raw data. We conclude by outlining prospects for future research in integration of these fields in general, and the developed application in particular. △ Less

Submitted 4 April, 2024; originally announced April 2024.

ACM Class: I.2.10; I.4; I.5; J.2

arXiv:2404.02659 [pdf, other]

A Satellite Band Selection Framework for Amazon Forest Deforestation Detection Task

Authors: Eduardo Neto, Fabio A. Faria, Amanda A. S. de Oliveira, Álvaro L. Fazenda

Abstract: The conservation of tropical forests is a topic of significant social and ecological relevance due to their crucial role in the global ecosystem. Unfortunately, deforestation and degradation impact millions of hectares annually, necessitating government or private initiatives for effective forest monitoring. This study introduces a novel framework that employs the Univariate Marginal Distribution… ▽ More The conservation of tropical forests is a topic of significant social and ecological relevance due to their crucial role in the global ecosystem. Unfortunately, deforestation and degradation impact millions of hectares annually, necessitating government or private initiatives for effective forest monitoring. This study introduces a novel framework that employs the Univariate Marginal Distribution Algorithm (UMDA) to select spectral bands from Landsat-8 satellite, optimizing the representation of deforested areas. This selection guides a semantic segmentation architecture, DeepLabv3+, enhancing its performance. Experimental results revealed several band compositions that achieved superior balanced accuracy compared to commonly adopted combinations for deforestation detection, utilizing segment classification via a Support Vector Machine (SVM). Moreover, the optimal band compositions identified by the UMDA-based approach improved the performance of the DeepLabv3+ architecture, surpassing state-of-the-art approaches compared in this study. The observation that a few selected bands outperform the total contradicts the data-driven paradigm prevalent in the deep learning field. Therefore, this suggests an exception to the conventional wisdom that 'more is always better'. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 9 pages, 4 figures, paper accepted for presentation at GECCO 2024

arXiv:2404.01446 [pdf, other]

Finding Regions of Interest in Whole Slide Images Using Multiple Instance Learning

Authors: Martim Afonso, Praphulla M. S. Bhawsar, Monjoy Saha, Jonas S. Almeida, Arlindo L. Oliveira

Abstract: Whole Slide Images (WSI), obtained by high-resolution digital scanning of microscope slides at multiple scales, are the cornerstone of modern Digital Pathology. However, they represent a particular challenge to AI-based/AI-mediated analysis because pathology labeling is typically done at slide-level, instead of tile-level. It is not just that medical diagnostics is recorded at the specimen level,… ▽ More Whole Slide Images (WSI), obtained by high-resolution digital scanning of microscope slides at multiple scales, are the cornerstone of modern Digital Pathology. However, they represent a particular challenge to AI-based/AI-mediated analysis because pathology labeling is typically done at slide-level, instead of tile-level. It is not just that medical diagnostics is recorded at the specimen level, the detection of oncogene mutation is also experimentally obtained, and recorded by initiatives like The Cancer Genome Atlas (TCGA), at the slide level. This configures a dual challenge: a) accurately predicting the overall cancer phenotype and b) finding out what cellular morphologies are associated with it at the tile level. To address these challenges, a weakly supervised Multiple Instance Learning (MIL) approach was explored for two prevalent cancer types, Invasive Breast Carcinoma (TCGA-BRCA) and Lung Squamous Cell Carcinoma (TCGA-LUSC). This approach was explored for tumor detection at low magnification levels and TP53 mutations at various levels. Our results show that a novel additive implementation of MIL matched the performance of reference implementation (AUC 0.96), and was only slightly outperformed by Attention MIL (AUC 0.97). More interestingly from the perspective of the molecular pathologist, these different AI architectures identify distinct sensitivities to morphological features (through the detection of Regions of Interest, RoI) at different amplification levels. Tellingly, TP53 mutation was most sensitive to features at the higher applications where cellular morphology is resolved. △ Less

Submitted 11 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.08795 [pdf]

Ontologia para monitorar a deficiência mental em seus déficts no processamento da informação por declínio cognitivo e evitar agressões psicológicas e físicas em ambientes educacionais com ajuda da I.A*

Authors: Bruna Araújo de Castro Oliveira

Abstract: The intention of this article is to propose the use of artificial intelligence to detect through analysis by UFO ontology the emergence of verbal and physical aggression related to psychosocial deficiencies and their provoking agents, in an attempt to prevent catastrophic consequences within school environments. The intention of this article is to propose the use of artificial intelligence to detect through analysis by UFO ontology the emergence of verbal and physical aggression related to psychosocial deficiencies and their provoking agents, in an attempt to prevent catastrophic consequences within school environments. △ Less

Submitted 31 January, 2024; originally announced March 2024.

Comments: in Portuguese language. Minha vez de falar sobre a realidade

arXiv:2402.18511 [pdf]

Leveraging Compliant Tactile Perception for Haptic Blind Surface Reconstruction

Authors: Laurent Yves Emile Ramos Cheret, Vinicius Prado da Fonseca, Thiago Eustaquio Alves de Oliveira

Abstract: Non-flat surfaces pose difficulties for robots operating in unstructured environments. Reconstructions of uneven surfaces may only be partially possible due to non-compliant end-effectors and limitations on vision systems such as transparency, reflections, and occlusions. This study achieves blind surface reconstruction by harnessing the robotic manipulator's kinematic data and a compliant tactile… ▽ More Non-flat surfaces pose difficulties for robots operating in unstructured environments. Reconstructions of uneven surfaces may only be partially possible due to non-compliant end-effectors and limitations on vision systems such as transparency, reflections, and occlusions. This study achieves blind surface reconstruction by harnessing the robotic manipulator's kinematic data and a compliant tactile sensing module, which incorporates inertial, magnetic, and pressure sensors. The module's flexibility enables us to estimate contact positions and surface normals by analyzing its deformation during interactions with unknown objects. While previous works collect only positional information, we include the local normals in a geometrical approach to estimate curvatures between adjacent contact points. These parameters then guide a spline-based patch generation, which allows us to recreate larger surfaces without an increase in complexity while reducing the time-consuming step of probing the surface. Experimental validation demonstrates that this approach outperforms an off-the-shelf vision system in estimation accuracy. Moreover, this compliant haptic method works effectively even when the manipulator's approach angle is not aligned with the surface normals, which is ideal for unknown non-flat surfaces. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: 7 pages, 9 figures, 2024 IEEE International Conference on Robotics and Automation (ICRA 2024)

arXiv:2402.10889 [pdf, other]

Evaluation of EAP Usage for Authenticating Eduroam Users in 5G Networks

Authors: Leonardo Azalim de Oliveira, Edelberto Franco Silva

Abstract: The fifth generation of the telecommunication networks (5G) established the service-oriented paradigm on the mobile networks. In this new context, the 5G Core component has become extremely flexible so, in addition to serving mobile networks, it can also be used to connect devices from the so-called non-3GPP networks, which contains technologies such as WiFi. The implementation of this connectivit… ▽ More The fifth generation of the telecommunication networks (5G) established the service-oriented paradigm on the mobile networks. In this new context, the 5G Core component has become extremely flexible so, in addition to serving mobile networks, it can also be used to connect devices from the so-called non-3GPP networks, which contains technologies such as WiFi. The implementation of this connectivity requires specific protocols to ensure authentication and reliability. Given these characteristics and the possibility of convergence, it is necessary to carefully choose the encryption algorithms and authentication methods used by non-3GPP user equipment. In light of the above, this paper highlights key findings resulting from an analysis on the subject conducted through a test environment which could be used in the context of the Eduroam federation. △ Less

Submitted 16 February, 2024; originally announced February 2024.

ACM Class: C.2.0

arXiv:2402.09910 [pdf, other]

DE-COP: Detecting Copyrighted Content in Language Models Training Data

Authors: André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Abstract: How can we detect if copyrighted content was used in the training process of a language model, considering that the training data is typically undisclosed? We are motivated by the premise that a language model is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approa… ▽ More How can we detect if copyrighted content was used in the training process of a language model, considering that the training data is typically undisclosed? We are motivated by the premise that a language model is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approach is to probe an LLM with multiple-choice questions, whose options include both verbatim text and their paraphrases. We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff, along with their paraphrases. Our experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance (AUC) on models with logits available. Moreover, DE-COP also achieves an average accuracy of 72% for detecting suspect books on fully black-box models where prior methods give approximately 4% accuracy. The code and datasets are available at https://github.com/LeiLiLab/DE-COP. △ Less

Submitted 25 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

ACM Class: I.2

arXiv:2402.06653 [pdf, other]

Using remotely sensed data for air pollution assessment

Authors: Teresa Bernardino, Maria Alexandra Oliveira, João Nuno Silva

Abstract: Air pollution constitutes a global problem of paramount importance that affects not only human health, but also the environment. The existence of spatial and temporal data regarding the concentrations of pollutants is crucial for performing air pollution studies and monitor emissions. However, although observation data presents great temporal coverage, the number of stations is very limited and th… ▽ More Air pollution constitutes a global problem of paramount importance that affects not only human health, but also the environment. The existence of spatial and temporal data regarding the concentrations of pollutants is crucial for performing air pollution studies and monitor emissions. However, although observation data presents great temporal coverage, the number of stations is very limited and they are usually built in more populated areas. The main objective of this work is to create models capable of inferring pollutant concentrations in locations where no observation data exists. A machine learning model, more specifically the random forest model, was developed for predicting concentrations in the Iberian Peninsula in 2019 for five selected pollutants: $NO_2$, $O_3$ $SO_2$, $PM10$, and $PM2.5$. Model features include satellite measurements, meteorological variables, land use classification, temporal variables (month, day of year), and spatial variables (latitude, longitude, altitude). The models were evaluated using various methods, including station 10-fold cross-validation, in which in each fold observations from 10\% of the stations are used as testing data and the rest as training data. The $R^2$, RMSE and mean bias were determined for each model. The $NO_2$ and $O_3$ models presented good values of $R^2$, 0.5524 and 0.7462, respectively. However, the $SO_2$, $PM10$, and $PM2.5$ models performed very poorly in this regard, with $R^2$ values of -0.0231, 0.3722, and 0.3303, respectively. All models slightly overestimated the ground concentrations, except the $O_3$ model. All models presented acceptable cross-validation RMSE, except the $O_3$ and $PM10$ models where the mean value was a little higher (12.5934 $μg/m^3$ and 10.4737 $μg/m^3$, respectively). △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2402.04884 [pdf, other]

Topological relations in water quality monitoring

Authors: Bruno Chaves Figueiredo, Maria Alexandra Oliveira, João Nuno Silva

Abstract: The Alqueva Multi-Purpose Project (EFMA) is a massive abduction and storage infrastructure system in the Alentejo, which has a water quality monitoring network with almost thousands of water quality stations distributed across three subsystems: Alqueva, Pedrogão, and Ardila. Identification of pollution sources in complex infrastructure systems, such as the EFMA, requires recognition of water flow… ▽ More The Alqueva Multi-Purpose Project (EFMA) is a massive abduction and storage infrastructure system in the Alentejo, which has a water quality monitoring network with almost thousands of water quality stations distributed across three subsystems: Alqueva, Pedrogão, and Ardila. Identification of pollution sources in complex infrastructure systems, such as the EFMA, requires recognition of water flow direction and delimitation of areas being drained to specific sampling points. The transfer channels in the EFMA infrastructure artificially connect several water bodies that do not share drainage basins, which further complicates the interpretation of water quality data because the water does not flow exclusively downstream and is not restricted to specific basins. The existing user-friendly GIS tools do not facilitate the exploration and visualisation of water quality data in spatial-temporal dimensions, such as defining temporal relationships between monitoring campaigns, nor do they allow the establishment of topological and hydrological relationships between different sampling points. This thesis work proposes a framework capable of aggregating many types of information in a GIS environment, visualising large water quality-related datasets and, a graph data model to integrate and relate water quality between monitoring stations and land use. The graph model allows to exploit the relationship between water quality in a watercourse and reservoirs associated with infrastructures. The graph data model and the developed framework demonstrated encouraging results and has proven to be preferred when compared to relational databases. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.02582 [pdf, other]

On the development of an application for the compilation of global sea level changes

Authors: Mihir Odhavji, Maria Alexandra Oliveira, João Nuno Silva

Abstract: There is a lot of data about mean sea level variation from studies conducted around the globe. This data is dispersed, lacks organization along with standardization, and in most cases, it is not available online. In some instances, when it is available, it is often in unpractical ways and different formats. Analyzing it would be inefficient and very time-consuming. In addition to all of that, to s… ▽ More There is a lot of data about mean sea level variation from studies conducted around the globe. This data is dispersed, lacks organization along with standardization, and in most cases, it is not available online. In some instances, when it is available, it is often in unpractical ways and different formats. Analyzing it would be inefficient and very time-consuming. In addition to all of that, to successfully process spatial-temporal data, the user has to be equipped with particular skills and tools used for geographic data like PostGIS, PostgreSQL and GeoAlchemy. The presented solution is to develop a web application that solves some of the issues faced by researchers. The web application allows the user to add data, be it through forms in a browser or automated with the help of an API. The application also assists with data querying, processing and visualization by making tables, showing maps and drawing graphs. Comparing data points from different areas and publications is also made possible. The implemented web application permits the query and storage of spatial-temporal data about mean sea level variation in a simplified, easily accessible and user-friendly manner. It will also allow the realization of more global studies. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.12980 [pdf, other]

Identifying Risk Patterns in Brazilian Police Reports Preceding Femicides: A Long Short Term Memory (LSTM) Based Analysis

Authors: Vinicius Lima, Jaque Almeida de Oliveira

Abstract: Femicide refers to the killing of a female victim, often perpetrated by an intimate partner or family member, and is also associated with gender-based violence. Studies have shown that there is a pattern of escalating violence leading up to these killings, highlighting the potential for prevention if the level of danger to the victim can be assessed. Machine learning offers a promising approach to… ▽ More Femicide refers to the killing of a female victim, often perpetrated by an intimate partner or family member, and is also associated with gender-based violence. Studies have shown that there is a pattern of escalating violence leading up to these killings, highlighting the potential for prevention if the level of danger to the victim can be assessed. Machine learning offers a promising approach to address this challenge by predicting risk levels based on textual descriptions of the violence. In this study, we employed the Long Short Term Memory (LSTM) technique to identify patterns of behavior in Brazilian police reports preceding femicides. Our first objective was to classify the content of these reports as indicating either a lower or higher risk of the victim being murdered, achieving an accuracy of 66%. In the second approach, we developed a model to predict the next action a victim might experience within a sequence of patterned events. Both approaches contribute to the understanding and assessment of the risks associated with domestic violence, providing authorities with valuable insights to protect women and prevent situations from escalating. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Comments: IEEE Global Humanitarian Technology Conference (GHTC) 2023

arXiv:2401.05891 [pdf, other]

LiDAR data acquisition and processing for ecology applications

Authors: Ion Ciobotari, Adriana Príncipe, Maria Alexandra Oliveira, João Nuno Silva

Abstract: The collection of ecological data in the field is essential to diagnose, monitor and manage ecosystems in a sustainable way. Since acquisition of this information through traditional methods are generally time-consuming, due to the capability of recording large volumes of data in short time periods, automation of data acquisition sees a growing trend. Terrestrial laser scanners (TLS), particularly… ▽ More The collection of ecological data in the field is essential to diagnose, monitor and manage ecosystems in a sustainable way. Since acquisition of this information through traditional methods are generally time-consuming, due to the capability of recording large volumes of data in short time periods, automation of data acquisition sees a growing trend. Terrestrial laser scanners (TLS), particularly LiDAR sensors, have been used in ecology, allowing to reconstruct the 3D structure of vegetation, and thus, infer ecosystem characteristics based on the spatial variation of the density of points. However, the low amount of information obtained per beam, lack of data analysis tools and the high cost of the equipment limit their use. This way, a low-cost TLS (<10k$) was developed along with data acquisition and processing mechanisms applicable in two case studies: an urban garden and a target area for ecological restoration. The orientation of LiDAR was modified to make observations in the vertical plane and a motor was integrated for its rotation, enabling the acquisition of 360 degree data with high resolution. Motion and location sensors were also integrated for automatic error correction and georeferencing. From the data generated, histograms of point density variation along the vegetation height were created, where shrub stratum was easily distinguishable from tree stratum, and maximum tree height and shrub cover were calculated. These results agreed with the field data, whereby the developed TLS has proved to be effective in calculating metrics of structural complexity of vegetation. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2401.03005 [pdf, other]

Evolution of urban areas and land surface temperature

Authors: Sudipan Saha, Tushar Verma, Dario Augusto Borges Oliveira

Abstract: With the global population on the rise, our cities have been expanding to accommodate the growing number of people. The expansion of cities generally leads to the engulfment of peripheral areas. However, such expansion of urban areas is likely to cause increment in areas with increased land surface temperature (LST). By considering each summer as a data point, we form LST multi-year time-series an… ▽ More With the global population on the rise, our cities have been expanding to accommodate the growing number of people. The expansion of cities generally leads to the engulfment of peripheral areas. However, such expansion of urban areas is likely to cause increment in areas with increased land surface temperature (LST). By considering each summer as a data point, we form LST multi-year time-series and cluster it to obtain spatio-temporal pattern. We observe several interesting phenomena from these patterns, e.g., some clusters show reasonable similarity to the built-up area, whereas the locations with high temporal variation are seen more in the peripheral areas. Furthermore, the LST center of mass shifts over the years for cities with development activities tilted towards a direction. We conduct the above-mentioned studies for three different cities in three different continents. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2312.09358 [pdf, other]

Echo chamber formation sharpened by priority users

Authors: Henrique F. de Arruda, Kleber A. Oliveira, Yamir Moreno

Abstract: Priority users (e.g., verified profiles on Twitter) are social media users whose content is promoted by recommendation algorithms. However, the impact of this heterogeneous user influence on opinion dynamics, such as polarization phenomena, is unknown. We conduct a computational mechanistic investigation of such consequences in a stylized setting. First, we allow priority users, whose content has… ▽ More Priority users (e.g., verified profiles on Twitter) are social media users whose content is promoted by recommendation algorithms. However, the impact of this heterogeneous user influence on opinion dynamics, such as polarization phenomena, is unknown. We conduct a computational mechanistic investigation of such consequences in a stylized setting. First, we allow priority users, whose content has greater reach (similar to algorithmic boosting), into an opinion model on adaptive networks. Then, to exploit this gain in influence, we incorporate stubborn user behavior, i.e., zealot users who remain committed to opinions throughout the dynamics. Using a novel measure of echo chamber formation, we find that prioritizing users can inadvertently reduce polarization if they post according to the same rule but sharpen echo chamber formation if they behave heterogeneously. Moreover, we show that a minority of extremist ideologues (i.e., users who are both stubborn and priority) can push the system into a transition from consensus to polarization with echo chambers. Our findings imply that the implementation of the platform's prioritization policy should be carefully monitored in order to ensure there is no abuse of users with extra influence. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2311.08547 [pdf, other]

DeepThought: An Architecture for Autonomous Self-motivated Systems

Authors: Arlindo L. Oliveira, Tiago Domingos, Mário Figueiredo, Pedro U. Lima

Abstract: The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, has raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these p… ▽ More The ability of large language models (LLMs) to engage in credible dialogues with humans, taking into account the training data and the context of the conversation, has raised discussions about their ability to exhibit intrinsic motivations, agency, or even some degree of consciousness. We argue that the internal architecture of LLMs and their finite and volatile state cannot support any of these properties. By combining insights from complementary learning systems, global neuronal workspace, and attention schema theories, we propose to integrate LLMs and other deep learning systems into an architecture for cognitive language agents able to exhibit properties akin to agency, self-motivation, even some features of meta-cognition. △ Less

Submitted 14 November, 2023; originally announced November 2023.

ACM Class: I.2

arXiv:2311.02082 [pdf]

Semantic Modelling of Organizational Knowledge as a Basis for Enterprise Data Governance 4.0 -- Application to a Unified Clinical Data Model

Authors: Miguel AP Oliveira, Stephane Manara, Bruno Molé, Thomas Muller, Aurélien Guillouche, Lysann Hesske, Bruce Jordan, Gilles Hubert, Chinmay Kulkarni, Pralipta Jagdev, Cedric R. Berger

Abstract: Individuals and organizations cope with an always-growing amount of data, which is heterogeneous in its contents and formats. An adequate data management process yielding data quality and control over its lifecycle is a prerequisite to getting value out of this data and minimizing inherent risks related to multiple usages. Common data governance frameworks rely on people, policies, and processes t… ▽ More Individuals and organizations cope with an always-growing amount of data, which is heterogeneous in its contents and formats. An adequate data management process yielding data quality and control over its lifecycle is a prerequisite to getting value out of this data and minimizing inherent risks related to multiple usages. Common data governance frameworks rely on people, policies, and processes that fall short of the overwhelming complexity of data. Yet, harnessing this complexity is necessary to achieve high-quality standards. The latter will condition any downstream data usage outcome, including generative artificial intelligence trained on this data. In this paper, we report our concrete experience establishing a simple, cost-efficient framework that enables metadata-driven, agile and (semi-)automated data governance (i.e. Data Governance 4.0). We explain how we implement and use this framework to integrate 25 years of clinical study data at an enterprise scale in a fully productive environment. The framework encompasses both methodologies and technologies leveraging semantic web principles. We built a knowledge graph describing avatars of data assets in their business context, including governance principles. Multiple ontologies articulated by an enterprise upper ontology enable key governance actions such as FAIRification, lifecycle management, definition of roles and responsibilities, lineage across transformations and provenance from source systems. This metadata model is the keystone to data governance 4.0: a semi-automatised data management process that considers the business context in an agile manner to adapt governance constraints to each use case and dynamically tune it based on business changes. △ Less

Submitted 23 November, 2023; v1 submitted 20 October, 2023; originally announced November 2023.

arXiv:2310.12112 [pdf, other]

A Cautionary Tale: On the Role of Reference Data in Empirical Privacy Defenses

Authors: Caelin G. Kaplan, Chuan Xu, Othmane Marfoq, Giovanni Neglia, Anderson Santana de Oliveira

Abstract: Within the realm of privacy-preserving machine learning, empirical privacy defenses have been proposed as a solution to achieve satisfactory levels of training data privacy without a significant drop in model utility. Most existing defenses against membership inference attacks assume access to reference data, defined as an additional dataset coming from the same (or a similar) underlying distribut… ▽ More Within the realm of privacy-preserving machine learning, empirical privacy defenses have been proposed as a solution to achieve satisfactory levels of training data privacy without a significant drop in model utility. Most existing defenses against membership inference attacks assume access to reference data, defined as an additional dataset coming from the same (or a similar) underlying distribution as training data. Despite the common use of reference data, previous works are notably reticent about defining and evaluating reference data privacy. As gains in model utility and/or training data privacy may come at the expense of reference data privacy, it is essential that all three aspects are duly considered. In this paper, we first examine the availability of reference data and its privacy treatment in previous works and demonstrate its necessity for fairly comparing defenses. Second, we propose a baseline defense that enables the utility-privacy tradeoff with respect to both training and reference data to be easily understood. Our method is formulated as an empirical risk minimization with a constraint on the generalization error, which, in practice, can be evaluated as a weighted empirical risk minimization (WERM) over the training and reference datasets. Although we conceived of WERM as a simple baseline, our experiments show that, surprisingly, it outperforms the most well-studied and current state-of-the-art empirical privacy defenses using reference data for nearly all relative privacy levels of reference and training data. Our investigation also reveals that these existing methods are unable to effectively trade off reference data privacy for model utility and/or training data privacy. Overall, our work highlights the need for a proper evaluation of the triad model utility / training data privacy / reference data privacy when comparing privacy defenses. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10575 [pdf, other]

Matching the Neuronal Representations of V1 is Necessary to Improve Robustness in CNNs with V1-like Front-ends

Authors: Ruxandra Barbulescu, Tiago Marques, Arlindo L. Oliveira

Abstract: While some convolutional neural networks (CNNs) have achieved great success in object recognition, they struggle to identify objects in images corrupted with different types of common noise patterns. Recently, it was shown that simulating computations in early visual areas at the front of CNNs leads to improvements in robustness to image corruptions. Here, we further explore this result and show t… ▽ More While some convolutional neural networks (CNNs) have achieved great success in object recognition, they struggle to identify objects in images corrupted with different types of common noise patterns. Recently, it was shown that simulating computations in early visual areas at the front of CNNs leads to improvements in robustness to image corruptions. Here, we further explore this result and show that the neuronal representations that emerge from precisely matching the distribution of RF properties found in primate V1 is key for this improvement in robustness. We built two variants of a model with a front-end modeling the primate primary visual cortex (V1): one sampling RF properties uniformly and the other sampling from empirical biological distributions. The model with the biological sampling has a considerably higher robustness to image corruptions that the uniform variant (relative difference of 8.72%). While similar neuronal sub-populations across the two variants have similar response properties and learn similar downstream weights, the impact on downstream processing is strikingly different. This result sheds light on the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2309.01751 [pdf, other]

Multispectral Indices for Wildfire Management

Authors: Afonso Oliveira, João P. Matos-Carvalho, Filipe Moutinho, Nuno Fachada

Abstract: This paper highlights and summarizes the most important multispectral indices and associated methodologies for fire management. Various fields of study are examined where multispectral indices align with wildfire prevention and management, including vegetation and soil attribute extraction, water feature map**, artificial structure identification, and post-fire burnt area estimation. The versati… ▽ More This paper highlights and summarizes the most important multispectral indices and associated methodologies for fire management. Various fields of study are examined where multispectral indices align with wildfire prevention and management, including vegetation and soil attribute extraction, water feature map**, artificial structure identification, and post-fire burnt area estimation. The versatility and effectiveness of multispectral indices in addressing specific issues in wildfire management are emphasized. Fundamental insights for optimizing data extraction are presented. Concrete indices for each task, including the NDVI and the NDWI, are suggested. Moreover, to enhance accuracy and address inherent limitations of individual index applications, the integration of complementary processing solutions and additional data sources like high-resolution imagery and ground-based measurements is recommended. This paper aims to be an immediate and comprehensive reference for researchers and stakeholders working on multispectral indices related to the prevention and management of fires. △ Less

Submitted 4 September, 2023; originally announced September 2023.

ACM Class: I.2.10; I.4; I.5; J.2

arXiv:2308.16323 [pdf, other]

Software multiplataforma para a segmentação de vasos sanguíneos em imagens da retina

Authors: João Henrique Pereira Machado, Gilson Adamczuk Oliveira, Érick Oliveira Rodrigues

Abstract: In this work, we utilize image segmentation to visually identify blood vessels in retinal examination images. This process is typically carried out manually. However, we can employ heuristic methods and machine learning to automate or at least expedite the process. In this context, we propose a cross-platform, open-source, and responsive software that allows users to manually segment a retinal ima… ▽ More In this work, we utilize image segmentation to visually identify blood vessels in retinal examination images. This process is typically carried out manually. However, we can employ heuristic methods and machine learning to automate or at least expedite the process. In this context, we propose a cross-platform, open-source, and responsive software that allows users to manually segment a retinal image. The purpose is to use the user-segmented image to retrain machine learning algorithms, thereby enhancing future automated segmentation results. Moreover, the software also incorporates and applies certain image filters established in the literature to improve vessel visualization. We propose the first solution of this kind in the literature. This is the inaugural integrated software that embodies the aforementioned attributes: open-source, responsive, and cross-platform. It offers a comprehensive solution encompassing manual vessel segmentation, as well as the automated execution of classification algorithms to refine predictive models. △ Less

Submitted 30 August, 2023; originally announced August 2023.

Comments: in Portuguese language. International Conference on Production Research - Americas 2022. https://www.even3.com.br/anais/foreigners_subscription_icpr_americas22/664603-software-multiplataforma-para-a-segmentacao-de-vasos-sanguineos-em-imagens-da-retina/

arXiv:2308.05759 [pdf, ps, other]

A machine-learning sleep-wake classification model using a reduced number of features derived from photoplethysmography and activity signals

Authors: Douglas A. Almeida, Felipe M. Dias, Marcelo A. F. Toledo, Diego A. C. Cardenas, Filipe A. C. Oliveira, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez

Abstract: Sleep is a crucial aspect of our overall health and well-being. It plays a vital role in regulating our mental and physical health, impacting our mood, memory, and cognitive function to our physical resilience and immune system. The classification of sleep stages is a mandatory step to assess sleep quality, providing the metrics to estimate the quality of sleep and how well our body is functioning… ▽ More Sleep is a crucial aspect of our overall health and well-being. It plays a vital role in regulating our mental and physical health, impacting our mood, memory, and cognitive function to our physical resilience and immune system. The classification of sleep stages is a mandatory step to assess sleep quality, providing the metrics to estimate the quality of sleep and how well our body is functioning during this essential period of rest. Photoplethysmography (PPG) has been demonstrated to be an effective signal for sleep stage inference, meaning it can be used on its own or in a combination with others signals to determine sleep stage. This information is valuable in identifying potential sleep issues and develo** strategies to improve sleep quality and overall health. In this work, we present a machine learning sleep-wake classification model based on the eXtreme Gradient Boosting (XGBoost) algorithm and features extracted from PPG signal and activity counts. The performance of our method was comparable to current state-of-the-art methods with a Sensitivity of 91.15 $\pm$ 1.16%, Specificity of 53.66 $\pm$ 1.12%, F1-score of 83.88 $\pm$ 0.56%, and Kappa of 48.0 $\pm$ 0.86%. Our method offers a significant improvement over other approaches as it uses a reduced number of features, making it suitable for implementation in wearable devices that have limited computational power. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 8 pages, 3 figures

arXiv:2308.03584 [pdf, other]

A Polystore Architecture Using Knowledge Graphs to Support Queries on Heterogeneous Data Stores

Authors: Leonardo Guerreiro Azevedo, Renan Francisco Santos Souza, Elton F. de S. Soares, Raphael M. Thiago, Julio Cesar Cardoso Tesolin, Ann C. Oliveira, Marcio Ferreira Moreno

Abstract: Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., N… ▽ More Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., NoSQL, relational DBMS, or HDFS). Besides, organization workflows independently consume these fragments, and usually, there is no explicit link among the fragments that would be useful to support an integrated view. The research challenge tackled by this work is to provide the means to query heterogeneous data residing on distinct data repositories that are not explicitly connected. We propose a federated database architecture by providing a single abstract global conceptual schema to users, allowing them to write their queries, encapsulating data heterogeneity, location, and linkage by employing: (i) meta-models to represent the global conceptual schema, the remote data local conceptual schemas, and map**s among them; (ii) provenance to create explicit links among the consumed and generated data residing in separate datasets. We evaluated the architecture through its implementation as a polystore service, following a microservice architecture approach, in a scenario that simulates a real case in Oil \& Gas industry. Also, we compared the proposed architecture to a relational multidatabase system based on foreign data wrappers, measuring the user's cognitive load to write a query (or query complexity) and the query processing time. The results demonstrated that the proposed architecture allows query writing two times less complex than the one written for the relational multidatabase system, adding an excess of no more than 30% in query processing time. △ Less

Submitted 15 March, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

Comments: Reference the paper as L. G. Azevedo, R. Souza, E. F. de S. Soares, R. M. Thiago, J. C. D. Tesolin, A. C. Oliveira, M. F. Moreno, A Polystore Architecture Using Knowledge Graphs to Support Queries on Heterogeneous Data Stores. Proceedings of 20th Brazilian Symposium in Information Systems, 2024 (to be published)

arXiv:2308.01930 [pdf, other]

Machine Learning-Based Diabetes Detection Using Photoplethysmography Signal Features

Authors: Filipe A. C. Oliveira, Felipe M. Dias, Marcelo A. F. Toledo, Diego A. C. Cardenas, Douglas A. Almeida, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez

Abstract: Diabetes is a prevalent chronic condition that compromises the health of millions of people worldwide. Minimally invasive methods are needed to prevent and control diabetes but most devices for measuring glucose levels are invasive and not amenable for continuous monitoring. Here, we present an alternative method to overcome these shortcomings based on non-invasive optical photoplethysmography (PP… ▽ More Diabetes is a prevalent chronic condition that compromises the health of millions of people worldwide. Minimally invasive methods are needed to prevent and control diabetes but most devices for measuring glucose levels are invasive and not amenable for continuous monitoring. Here, we present an alternative method to overcome these shortcomings based on non-invasive optical photoplethysmography (PPG) for detecting diabetes. We classify non-Diabetic and Diabetic patients using the PPG signal and metadata for training Logistic Regression (LR) and eXtreme Gradient Boosting (XGBoost) algorithms. We used PPG signals from a publicly available dataset. To prevent overfitting, we divided the data into five folds for cross-validation. By ensuring that patients in the training set are not in the testing set, the model's performance can be evaluated on unseen subjects' data, providing a more accurate assessment of its generalization. Our model achieved an F1-Score and AUC of $58.8\pm20.0\%$ and $79.2\pm15.0\%$ for LR and $51.7\pm16.5\%$ and $73.6\pm17.0\%$ for XGBoost, respectively. Feature analysis suggested that PPG morphological features contains diabetes-related information alongside metadata. Our findings are within the same range reported in the literature, indicating that machine learning methods are promising for develo** remote, non-invasive, and continuous measurement devices for detecting and preventing diabetes. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 11 pages, 6 figures

arXiv:2307.10018 [pdf, other]

RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023

Authors: Aline Lima de Oliveira, Cauê Addae da Silva Gomes, Cecília Virginia Santos da Silva, Charles Matheus de Sousa Alves, Danilo Andrade Martins de Souza, Driele Pires Ferreira Araújo Xavier, Edgleyson Pereira da Silva, Felipe Bezerra Martins, Lucas Henrique Cavalcanti Santos, Lucas Dias Maciel, Matheus Paixão Gumercindo dos Santos, Matheus Lafayette Vasconcelos, Matheus Vinícius Teotonio do Nascimento Andrade, João Guilherme Oliveira Carvalho de Melo, João Pedro Souza Pereira de Moura, José Ronald da Silva, José Victor Silva Cruz, Pedro Henrique Santana de Morais, Pedro Paulo Salman de Oliveira, Riei Joaquim Matos Rodrigues, Roberto Costa Fernandes, Ryan Vinicius Santos Morais, Tamara Mayara Ramos Teobaldo, Washington Igor dos Santos Silva, Edna Natividade Silva Barros

Abstract: RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Ou… ▽ More RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Our team has successfully published 2 articles related to SSL at two high-impact conferences: the 25th RoboCup International Symposium and the 19th IEEE Latin American Robotics Symposium (LARS 2022). Over the last year, we have been continuously migrating from our past codebase to Unification. We will describe the new architecture implemented and some points of software and AI refactoring. In addition, we discuss the process of integrating machined components into the mechanical system, our development for participating in the vision blackout challenge last year and what we are preparing for this year. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.08766 [pdf, other]

Quality Assessment of Photoplethysmography Signals For Cardiovascular Biomarkers Monitoring Using Wearable Devices

Authors: Felipe M. Dias, Marcelo A. F. Toledo, Diego A. C. Cardenas, Douglas A. Almeida, Filipe A. C. Oliveira, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez

Abstract: Photoplethysmography (PPG) is a non-invasive technology that measures changes in blood volume in the microvascular bed of tissue. It is commonly used in medical devices such as pulse oximeters and wrist worn heart rate monitors to monitor cardiovascular hemodynamics. PPG allows for the assessment of parameters (e.g., heart rate, pulse waveform, and peripheral perfusion) that can indicate condition… ▽ More Photoplethysmography (PPG) is a non-invasive technology that measures changes in blood volume in the microvascular bed of tissue. It is commonly used in medical devices such as pulse oximeters and wrist worn heart rate monitors to monitor cardiovascular hemodynamics. PPG allows for the assessment of parameters (e.g., heart rate, pulse waveform, and peripheral perfusion) that can indicate conditions such as vasoconstriction or vasodilation, and provides information about microvascular blood flow, making it a valuable tool for monitoring cardiovascular health. However, PPG is subject to a number of sources of variations that can impact its accuracy and reliability, especially when using a wearable device for continuous monitoring, such as motion artifacts, skin pigmentation, and vasomotion. In this study, we extracted 27 statistical features from the PPG signal for training machine-learning models based on gradient boosting (XGBoost and CatBoost) and Random Forest (RF) algorithms to assess quality of PPG signals that were labeled as good or poor quality. We used the PPG time series from a publicly available dataset and evaluated the algorithm s performance using Sensitivity (Se), Positive Predicted Value (PPV), and F1-score (F1) metrics. Our model achieved Se, PPV, and F1-score of 94.4, 95.6, and 95.0 for XGBoost, 94.7, 95.9, and 95.3 for CatBoost, and 93.7, 91.3 and 92.5 for RF, respectively. Our findings are comparable to state-of-the-art reported in the literature but using a much simpler model, indicating that ML models are promising for develo** remote, non-invasive, and continuous measurement devices. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 9 pages

arXiv:2307.02300 [pdf, other]

Improving Address Matching using Siamese Transformer Networks

Authors: André V. Duarte, Arlindo L. Oliveira

Abstract: Matching addresses is a critical task for companies and post offices involved in the processing and delivery of packages. The ramifications of incorrectly delivering a package to the wrong recipient are numerous, ranging from harm to the company's reputation to economic and environmental costs. This research introduces a deep learning-based model designed to increase the efficiency of address matc… ▽ More Matching addresses is a critical task for companies and post offices involved in the processing and delivery of packages. The ramifications of incorrectly delivering a package to the wrong recipient are numerous, ranging from harm to the company's reputation to economic and environmental costs. This research introduces a deep learning-based model designed to increase the efficiency of address matching for Portuguese addresses. The model comprises two parts: (i) a bi-encoder, which is fine-tuned to create meaningful embeddings of Portuguese postal addresses, utilized to retrieve the top 10 likely matches of the un-normalized target address from a normalized database, and (ii) a cross-encoder, which is fine-tuned to accurately rerank the 10 addresses obtained by the bi-encoder. The model has been tested on a real-case scenario of Portuguese addresses and exhibits a high degree of accuracy, exceeding 95% at the door level. When utilized with GPU computations, the inference speed is about 4.5 times quicker than other traditional approaches such as BM25. An implementation of this system in a real-world scenario would substantially increase the effectiveness of the distribution process. Such an implementation is currently under investigation. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: To be published in the 22nd EPIA Conference on Artificial Intelligence, EPIA 2023, Faial Island - Azores, Portugal, 5-8 September 2023, Proceedings

ACM Class: I.2

arXiv:2306.06834 [pdf, other]

Motivational models for validating agile requirements in Software Engineering subjects

Authors: Eduardo A. Oliveira, Leon Sterling

Abstract: This paper describes how motivational models can be used to cross check agile requirements artifacts to improve consistency and completeness of software requirements. Motivational models provide a high level understanding of the purposes of a software system. They complement personas and user stories which focus more on user needs rather than on system features. We present an exploratory case stud… ▽ More This paper describes how motivational models can be used to cross check agile requirements artifacts to improve consistency and completeness of software requirements. Motivational models provide a high level understanding of the purposes of a software system. They complement personas and user stories which focus more on user needs rather than on system features. We present an exploratory case study sought to understand how software engineering students could use motivational models to create better requirements artifacts so they are understandable to non-technical users, easily understood by developers, and are consistent with each other. Nine consistency principles were created as an outcome of our study and are now successfully adopted by software engineering students at the University of Melbourne to ensure consistency between motivational models, personas, and user stories in requirements engineering. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: 9 pages, 2 figures, SERP'21 - The 19th International Conference on Software Engineering Research and Practice

arXiv:2305.09904 [pdf, ps, other]

On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations

Authors: Arthur Castello B. de Oliveira, Milad Siami, Eduardo D. Sontag

Abstract: Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterizat… ▽ More Recent research in neural networks and machine learning suggests that using many more parameters than strictly required by the initial complexity of a regression problem can result in more accurate or faster-converging models -- contrary to classical statistical belief. This phenomenon, sometimes known as ``benign overfitting'', raises questions regarding in what other ways might overparameterization affect the properties of a learning problem. In this work, we investigate the effects of overfitting on the robustness of gradient-descent training when subject to uncertainty on the gradient estimation. This uncertainty arises naturally if the gradient is estimated from noisy data or directly measured. Our object of study is a linear neural network with a single, arbitrarily wide, hidden layer and an arbitrary number of inputs and outputs. In this paper we solve the problem for the case where the input and output of our neural-network are one-dimensional, deriving sufficient conditions for robustness of our system based on necessary and sufficient conditions for convergence in the undisturbed case. We then show that the general overparametrized formulation introduces a set of spurious equilibria which lay outside the set where the loss function is minimized, and discuss directions of future work that might extend our current results for more general formulations. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 10 pages, 1 figure, extended conference version

arXiv:2305.06129 [pdf, other]

Do code refactorings influence the merge effort?

Authors: Andre Oliveira, Vania Neves, Alexandre Plastino, Ana Carla Bibiano, Alessandro Garcia, Leonardo Murta

Abstract: In collaborative software development, multiple contributors frequently change the source code in parallel to implement new features, fix bugs, refactor existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. However, the merge operation can fail, and developer intervention is required to resolve the conflicts. Studies in the li… ▽ More In collaborative software development, multiple contributors frequently change the source code in parallel to implement new features, fix bugs, refactor existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. However, the merge operation can fail, and developer intervention is required to resolve the conflicts. Studies in the literature show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer's intervention to complete the process. In this paper, we concern about a specific type of change that affects the structure of the source code and has the potential to increase the merge effort: code refactorings. We analyze the relationship between the occurrence of refactorings and the merge effort. To do so, we applied a data mining technique called association rule extraction to find patterns of behavior that allow us to analyze the influence of refactorings on the merge effort. Our experiments extracted association rules from 40,248 merge commits that occurred in 28 popular open-source projects. The results indicate that: (i) the occurrence of refactorings increases the chances of having merge effort; (ii) the more refactorings, the greater the chances of effort; (iii) the more refactorings, the greater the effort; and (iv) parallel refactorings increase even more the chances of having effort, as well as the intensity of it. The results obtained may suggest behavioral changes in the way refactorings are implemented by developer teams. In addition, they can indicate possible ways to improve tools that support code merging and those that recommend refactorings, considering the number of refactorings and merge effort attributes. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 11 pages + 2 for citations, 7 figures, 3 tables. Preprint of a paper that will be published in the IEEE/ACM 45th International Conference on Software Engineering (ICSE 2023) - Authors' version of the work

arXiv:2304.12226 [pdf, other]

Algebraic and Geometric Characterizations Related to the Quantization Problem of the $C_{2,8}$ Channel

Authors: Anderson José de Oliveira, Giuliano Gadioli La Guardia, Reginaldo Palazzo Jr., Clarice Dias de Albuquerque, Cátia Regina de Oliveira Quilles Queiroz, Leandro Bezerra de Lima, Vandenberg Lopes Vieira

Abstract: In this paper, we consider the steps to be followed in the analysis and interpretation of the quantization problem related to the $C_{2,8}$ channel, where the Fuchsian differential equations, the generators of the Fuchsian groups, and the tessellations associated with the cases $g=2$ and $g=3$, related to the hyperbolic case, are determined. In order to obtain these results, it is necessary to det… ▽ More In this paper, we consider the steps to be followed in the analysis and interpretation of the quantization problem related to the $C_{2,8}$ channel, where the Fuchsian differential equations, the generators of the Fuchsian groups, and the tessellations associated with the cases $g=2$ and $g=3$, related to the hyperbolic case, are determined. In order to obtain these results, it is necessary to determine the genus $g$ of each surface on which this channel may be embedded. After that, the procedure is to determine the algebraic structure (Fuchsian group generators) associated with the fundamental region of each surface. To achieve this goal, an associated linear second-order Fuchsian differential equation whose linearly independent solutions provide the generators of this Fuchsian group is devised. In addition, the tessellations associated with each analyzed case are identified. These structures are identified in four situations, divided into two cases $(g=2$ and $g=3)$, obtaining, therefore, both algebraic and geometric characterizations associated with quantizing the $C_{2,8}$ channel. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 31 pages, 9 figures

arXiv:2303.08572 [pdf, other]

Distinguishing Cause from Effect on Categorical Data: The Uniform Channel Model

Authors: Mário A. T. Figueiredo, Catarina A. Oliveira

Abstract: Distinguishing cause from effect using observations of a pair of random variables is a core problem in causal discovery. Most approaches proposed for this task, namely additive noise models (ANM), are only adequate for quantitative data. We propose a criterion to address the cause-effect problem with categorical variables (living in sets with no meaningful order), inspired by seeing a conditional… ▽ More Distinguishing cause from effect using observations of a pair of random variables is a core problem in causal discovery. Most approaches proposed for this task, namely additive noise models (ANM), are only adequate for quantitative data. We propose a criterion to address the cause-effect problem with categorical variables (living in sets with no meaningful order), inspired by seeing a conditional probability mass function (pmf) as a discrete memoryless channel. We select as the most likely causal direction the one in which the conditional pmf is closer to a uniform channel (UC). The rationale is that, in a UC, as in an ANM, the conditional entropy (of the effect given the cause) is independent of the cause distribution, in agreement with the principle of independence of cause and mechanism. Our approach, which we call the uniform channel model (UCM), thus extends the ANM rationale to categorical variables. To assess how close a conditional pmf (estimated from data) is to a UC, we use statistical testing, supported by a closed-form estimate of a UC channel. On the theoretical front, we prove identifiability of the UCM and show its equivalence with a structural causal model with a low-cardinality exogenous variable. Finally, the proposed method compares favorably with recent state-of-the-art alternatives in experiments on synthetic, benchmark, and real data. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: 20 pages, 2 appendices

MSC Class: 62D20

arXiv:2303.07975 [pdf, other]

Software-based security approach for networked embedded devices

Authors: José Ferreira, Alan Oliveira, André Souto, José Cecílio

Abstract: As the Internet of Things (IoT) continues to expand, data security has become increasingly important for ensuring privacy and safety, especially given the sensitive and, sometimes, critical nature of the data handled by IoT devices. There exist hardware-based trusted execution environments used to protect data, but they are not compatible with low-cost devices that lack hardware-assisted security… ▽ More As the Internet of Things (IoT) continues to expand, data security has become increasingly important for ensuring privacy and safety, especially given the sensitive and, sometimes, critical nature of the data handled by IoT devices. There exist hardware-based trusted execution environments used to protect data, but they are not compatible with low-cost devices that lack hardware-assisted security features. The research in this paper presents software-based protection and encryption mechanisms explicitly designed for embedded devices. The proposed architecture is designed to work with low-cost, low-end devices without requiring the usual changes on the underlying hardware. It protects against hardware attacks and supports runtime updates, enabling devices to write data in protected memory. The proposed solution is an alternative data security approach for low-cost IoT devices without compromising performance or functionality. Our work underscores the importance of develo** secure and cost-effective solutions for protecting data in the context of IoT. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: 4

arXiv:2302.02910 [pdf, other]

An Empirical Analysis of Fairness Notions under Differential Privacy

Authors: Anderson Santana de Oliveira, Caelin Kaplan, Khawla Mallat, Tanmay Chakraborty

Abstract: Recent works have shown that selecting an optimal model architecture suited to the differential privacy setting is necessary to achieve the best possible utility for a given privacy budget using differentially private stochastic gradient descent (DP-SGD)(Tramer and Boneh 2020; Cheng et al. 2022). In light of these findings, we empirically analyse how different fairness notions, belonging to distin… ▽ More Recent works have shown that selecting an optimal model architecture suited to the differential privacy setting is necessary to achieve the best possible utility for a given privacy budget using differentially private stochastic gradient descent (DP-SGD)(Tramer and Boneh 2020; Cheng et al. 2022). In light of these findings, we empirically analyse how different fairness notions, belonging to distinct classes of statistical fairness criteria (independence, separation and sufficiency), are impacted when one selects a model architecture suitable for DP-SGD, optimized for utility. Using standard datasets from ML fairness literature, we show using a rigorous experimental protocol, that by selecting the optimal model architecture for DP-SGD, the differences across groups concerning the relevant fairness metrics (demographic parity, equalized odds and predictive parity) more often decrease or are negligibly impacted, compared to the non-private baseline, for which optimal model architecture has also been selected to maximize utility. These findings challenge the understanding that differential privacy will necessarily exacerbate unfairness in deep learning models trained on biased datasets. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: Accepted for oral presentation at the The Fourth AAAI Workshop on Privacy-Preserving Artificial Intelligence (PPAI-23) https://aaai-ppai23.github.io/#accepted_papers

arXiv:2301.10608 [pdf, other]

Connecting metrics for shape-texture knowledge in computer vision

Authors: Tiago Oliveira, Tiago Marques, Arlindo L. Oliveira

Abstract: Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not… ▽ More Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: 7 pages, 3 figures

arXiv:2212.08568 [pdf, other]

Biomedical image analysis competitions: The state of current participation practice

Authors: Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, Anubha Gupta, Jan Kybic, Alison Noble, Carlos Ortiz de Solórzano, Samiksha Pachade, Caroline Petitjean, Daniel Sage, Donglai Wei, Elizabeth Wilden, Deepak Alapatt, Vincent Andrearczyk, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano , et al. (331 additional authors not shown)

Abstract: The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,… ▽ More The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps. △ Less

Submitted 12 September, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2211.02627 [pdf]

An IoT Cloud and Big Data Architecture for the Maintenance of Home Appliances

Authors: Pedro Chaves, Tiago Fonseca, Luis Lino Ferreira, Bernardo Cabral, Orlando Sousa, Andre Oliveira, Jorge Landeck

Abstract: Billions of interconnected Internet of Things (IoT) sensors and devices collect tremendous amounts of data from real-world scenarios. Big data is generating increasing interest in a wide range of industries. Once data is analyzed through compute-intensive Machine Learning (ML) methods, it can derive critical business value for organizations. Powerfulplatforms are essential to handle and process su… ▽ More Billions of interconnected Internet of Things (IoT) sensors and devices collect tremendous amounts of data from real-world scenarios. Big data is generating increasing interest in a wide range of industries. Once data is analyzed through compute-intensive Machine Learning (ML) methods, it can derive critical business value for organizations. Powerfulplatforms are essential to handle and process such massive collections of information cost-effectively and conveniently. This work introduces a distributed and scalable platform architecture that can be deployed for efficient real-world big data collection and analytics. The proposed system was tested with a case study for Predictive Maintenance of Home Appliances, where current and vibration sensors with high acquisition frequency were connected to washing machines and refrigerators. The introduced platform was used to collect, store, and analyze the data. The experimental results demonstrated that the presented system could be advantageous for tackling real-world IoT scenarios in a cost-effective and local approach. △ Less

Submitted 25 October, 2022; originally announced November 2022.

Comments: 6 pages, 6 figures, IECON 2022

arXiv:2210.13167 [pdf, other]

Exploring Self-Attention for Crop-type Classification Explainability

Authors: Ivica Obadic, Ribana Roscher, Dario Augusto Borges Oliveira, Xiao Xiang Zhu

Abstract: Automated crop-type classification using Sentinel-2 satellite time series is essential to support agriculture monitoring. Recently, deep learning models based on transformer encoders became a promising approach for crop-type classification. Using explainable machine learning to reveal the inner workings of these models is an important step towards improving stakeholders' trust and efficient agricu… ▽ More Automated crop-type classification using Sentinel-2 satellite time series is essential to support agriculture monitoring. Recently, deep learning models based on transformer encoders became a promising approach for crop-type classification. Using explainable machine learning to reveal the inner workings of these models is an important step towards improving stakeholders' trust and efficient agriculture monitoring. In this paper, we introduce a novel explainability framework that aims to shed a light on the essential crop disambiguation patterns learned by a state-of-the-art transformer encoder model. More specifically, we process the attention weights of a trained transformer encoder to reveal the critical dates for crop disambiguation and use domain knowledge to uncover the phenological events that support the model performance. We also present a sensitivity analysis approach to understand better the attention capability for revealing crop-specific phenological events. We report compelling results showing that attention patterns strongly relate to key dates, and consequently, to the critical phenological events for crop-type classification. These findings might be relevant for improving stakeholder trust and optimizing agriculture monitoring processes. Additionally, our sensitivity analysis demonstrates the limitation of attention weights for identifying the important events in the crop phenology as we empirically show that the unveiled phenological events depend on the other crops in the data considered during training. △ Less

Submitted 24 October, 2022; originally announced October 2022.

arXiv:2210.11327 [pdf, other]

Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

Authors: Moacir Antonelli Ponti, Lucas de Angelis Oliveira, Mathias Esteban, Valentina Garcia, Juan Martín Román, Luis Argerich

Abstract: Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose… ▽ More Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution. △ Less

Submitted 22 February, 2024; v1 submitted 20 October, 2022; originally announced October 2022.

arXiv:2209.10901 [pdf, other]

Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

Authors: Manuel Goulão, Arlindo L. Oliveira

Abstract: The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-ar… ▽ More The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data efficiency when we evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with the temporal order verification task shows the best results across all experiments, with richer representations, more focused attention maps and sparser representation vectors throughout the layers of the encoder, which shows the importance of exploring such similarity dimension. With this work, we hope to provide some insights into the representations learned by ViT during a self-supervised pretraining with observations from RL environments and which properties arise in the representations that lead to the best-performing agents. The source code will be available at: https://github.com/mgoulao/TOV-VICReg △ Less

Submitted 18 July, 2023; v1 submitted 22 September, 2022; originally announced September 2022.

arXiv:2209.07928 [pdf, other]

The BLue Amazon Brain (BLAB): A Modular Architecture of Services about the Brazilian Maritime Territory

Authors: Paulo Pirozelli, Ais B. R. Castro, Ana Luiza C. de Oliveira, André S. Oliveira, Flávio N. Cação, Igor C. Silveira, João G. M. Campos, Laura C. Motheo, Leticia F. Figueiredo, Lucas F. A. O. Pellicer, Marcelo A. José, Marcos M. José, Pedro de M. Ligabue, Ricardo S. Grava, Rodrigo M. Tavares, Vinícius B. Matos, Yan V. Sym, Anna H. R. Costa, Anarosa A. F. Brandão, Denis D. Mauá, Fabio G. Cozman, Sarajane M. Peres

Abstract: We describe the first steps in the development of an artificial agent focused on the Brazilian maritime territory, a large region within the South Atlantic also known as the Blue Amazon. The "BLue Amazon Brain" (BLAB) integrates a number of services aimed at disseminating information about this region and its importance, functioning as a tool for environmental awareness. The main service provided… ▽ More We describe the first steps in the development of an artificial agent focused on the Brazilian maritime territory, a large region within the South Atlantic also known as the Blue Amazon. The "BLue Amazon Brain" (BLAB) integrates a number of services aimed at disseminating information about this region and its importance, functioning as a tool for environmental awareness. The main service provided by BLAB is a conversational facility that deals with complex questions about the Blue Amazon, called BLAB-Chat; its central component is a controller that manages several task-oriented natural language processing modules (e.g., question answering and summarizer systems). These modules have access to an internal data lake as well as to third-party databases. A news reporter (BLAB-Reporter) and a purposely-developed wiki (BLAB-Wiki) are also part of the BLAB service architecture. In this paper, we describe our current version of BLAB's architecture (interface, backend, web services, NLP modules, and resources) and comment on the challenges we have faced so far, such as the lack of training data and the scattered state of domain information. Solving these issues presents a considerable challenge in the development of artificial intelligence for technical domains. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Journal ref: AI: Modeling Oceans and Climate Change (IJCAI-ECAI), 2022

arXiv:2209.06932 [pdf, other]

Optimizing Connectivity through Network Gradients for Restricted Boltzmann Machines

Authors: A. C. N. de Oliveira, D. R. Figueiredo

Abstract: Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the lear… ▽ More Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the learning performance of shallow networks is a fundamental problem. While recent principled approaches explicitly include network connections as model parameters that must be optimized, they often rely on explicit penalization or have network sparsity as a hyperparameter. This work presents the Network Connectivity Gradients (NCG), a method to find optimal connectivity patterns for RBMs based on the idea of network gradients: computing the gradient of every possible connection, given a specific connection pattern, and using the gradient to drive a continuous connection strength parameter that in turn is used to determine the connection pattern. Thus, learning RBM parameters and learning network connections is truly jointly performed, albeit with different learning rates, and without changes to the model's classic objective function. The method is applied to the MNIST and other data sets showing that better RBM models are found for the benchmark tasks of sample generation and input classification. Results also show that NCG is robust to network initialization, both adding and removing network connections while learning. △ Less

Submitted 3 December, 2022; v1 submitted 14 September, 2022; originally announced September 2022.

arXiv:2208.11607 [pdf, other]

Learning crop type map** from regional label proportions in large-scale SAR and optical imagery

Authors: Laura E. C. La Rosa, Dario A. B. Oliveira, Pedram Ghamisi

Abstract: The application of deep learning algorithms to Earth observation (EO) in recent years has enabled substantial progress in fields that rely on remotely sensed data. However, given the data scale in EO, creating large datasets with pixel-level annotations by experts is expensive and highly time-consuming. In this context, priors are seen as an attractive way to alleviate the burden of manual labelin… ▽ More The application of deep learning algorithms to Earth observation (EO) in recent years has enabled substantial progress in fields that rely on remotely sensed data. However, given the data scale in EO, creating large datasets with pixel-level annotations by experts is expensive and highly time-consuming. In this context, priors are seen as an attractive way to alleviate the burden of manual labeling when training deep learning methods for EO. For some applications, those priors are readily available. Motivated by the great success of contrastive-learning methods for self-supervised feature representation learning in many computer-vision tasks, this study proposes an online deep clustering method using crop label proportions as priors to learn a sample-level classifier based on government crop-proportion data for a whole agricultural region. We evaluate the method using two large datasets from two different agricultural regions in Brazil. Extensive experiments demonstrate that the method is robust to different data types (synthetic-aperture radar and optical images), reporting higher accuracy values considering the major crop types in the target regions. Thus, it can alleviate the burden of large-scale image annotation in EO applications. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Showing 1–50 of 147 results for author: Oliveira, A