Search | arXiv e-print repository

doi 10.1007/s43681-024-00469-8

Crossing the principle-practice gap in AI ethics with ethical problem-solving

Authors: Nicholas Kluge Corrêa, James William Santos, Camila Galvão, Marcelo Pasetti, Dieine Schiavon, Faizah Naqvi, Robayet Hossain, Nythamar De Oliveira

Abstract: The past years have presented a surge in (AI) development, fueled by breakthroughs in deep learning, increased computational power, and substantial investments in the field. Given the generative capabilities of more recent AI systems, the era of large-scale AI models has transformed various domains that intersect our daily lives. However, this progress raises concerns about the balance between tec… ▽ More The past years have presented a surge in (AI) development, fueled by breakthroughs in deep learning, increased computational power, and substantial investments in the field. Given the generative capabilities of more recent AI systems, the era of large-scale AI models has transformed various domains that intersect our daily lives. However, this progress raises concerns about the balance between technological advancement, ethical considerations, safety measures, and financial interests. Moreover, using such systems in sensitive areas amplifies our general ethical awareness, prompting a reemergence of debates on governance, regulation, and human values. However, amidst this landscape, how to bridge the principle-practice gap separating ethical discourse from the technical side of AI development remains an open problem. In response to this challenge, the present work proposes a framework to help shorten this gap: ethical problem-solving (EPS). EPS is a methodology promoting responsible, human-centric, and value-oriented AI development. The framework's core resides in translating principles into practical implementations using impact assessment surveys and a differential recommendation methodology. We utilize EPS as a blueprint to propose the implementation of Ethics as a Service Platform, which is currently available as a simple demonstration. We released all framework components openly and with a permissive license, ho** the community would adopt and extend our efforts into other contexts. Available in https://github.com/Nkluge\-correa/ethical\-problem\-solving △ Less

Submitted 16 April, 2024; originally announced June 2024.

arXiv:2404.16992 [pdf, other]

doi 10.1145/3661167.3661225

A Catalog of Transformations to Remove Smells From Natural Language Tests

Authors: Manoel Aranda, Naelson Oliveira, Elvys Soares, Márcio Ribeiro, Davi Romão, Ullyanne Patriota, Rohit Gheyi, Emerson Souza, Ivan Machado

Abstract: Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification. Existing research has extensively addressed test smells in automated software tests but little attention has been given to smells in natural language tests. While some research has identified and catalogued such smells, there is a lack of systematic ap… ▽ More Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification. Existing research has extensively addressed test smells in automated software tests but little attention has been given to smells in natural language tests. While some research has identified and catalogued such smells, there is a lack of systematic approaches for their removal. Consequently, there is also a lack of tools to automatically identify and remove natural language test smells. This paper introduces a catalog of transformations designed to remove seven natural language test smells and a companion tool implemented using Natural Language Processing (NLP) techniques. Our work aims to enhance the quality and reliability of natural language tests during software development. The research employs a two-fold empirical strategy to evaluate its contributions. First, a survey involving 15 software testing professionals assesses the acceptance and usefulness of the catalog's transformations. Second, an empirical study evaluates our tool to remove natural language test smells by analyzing a sample of real-practice tests from the Ubuntu OS. The results indicate that software testing professionals find the transformations valuable. Additionally, the automated tool demonstrates a good level of precision, as evidenced by a F-Measure rate of 83.70% △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Distinguished Paper Award at International Conference on Evaluation and Assessment in Software Engineering (EASE), 2024 edition

ACM Class: D.2.5

arXiv:2401.16640 [pdf, other]

doi 10.1016/j.mlwa.2024.100558

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese

Authors: Nicholas Kluge Corrêa, Sophia Falk, Shiza Fatimah, Aniket Sen, Nythamar de Oliveira

Abstract: Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational deman… ▽ More Large language models (LLMs) have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama △ Less

Submitted 17 May, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: 21 pages, 5 figures

Journal ref: Machine Learning With Applications, 16, 100558

arXiv:2312.17479 [pdf, other]

Culturally-Attuned Moral Machines: Implicit Learning of Human Value Systems by AI through Inverse Reinforcement Learning

Authors: Nigini Oliveira, Jasmine Li, Koosha Khalvati, Rodolfo Cortes Barragan, Katharina Reinecke, Andrew N. Meltzoff, Rajesh P. N. Rao

Abstract: Constructing a universal moral code for artificial intelligence (AI) is difficult or even impossible, given that different human cultures have different definitions of morality and different societal norms. We therefore argue that the value system of an AI should be culturally attuned: just as a child raised in a particular culture learns the specific values and norms of that culture, we propose t… ▽ More Constructing a universal moral code for artificial intelligence (AI) is difficult or even impossible, given that different human cultures have different definitions of morality and different societal norms. We therefore argue that the value system of an AI should be culturally attuned: just as a child raised in a particular culture learns the specific values and norms of that culture, we propose that an AI agent operating in a particular human community should acquire that community's moral, ethical, and cultural codes. How AI systems might acquire such codes from human observation and interaction has remained an open question. Here, we propose using inverse reinforcement learning (IRL) as a method for AI agents to acquire a culturally-attuned value system implicitly. We test our approach using an experimental paradigm in which AI agents use IRL to learn different reward functions, which govern the agents' moral values, by observing the behavior of different cultural groups in an online virtual world requiring real-time decision making. We show that an AI agent learning from the average behavior of a particular cultural group can acquire altruistic characteristics reflective of that group's behavior, and this learned value system can generalize to new scenarios requiring altruistic judgments. Our results provide, to our knowledge, the first demonstration that AI agents could potentially be endowed with the ability to continually learn their values and norms from observing and interacting with humans, thereby becoming attuned to the culture they are operating in. △ Less

Submitted 29 December, 2023; originally announced December 2023.

arXiv:2308.01386 [pdf, other]

Manual Tests Do Smell! Cataloging and Identifying Natural Language Test Smells

Authors: Elvys Soares, Manoel Aranda, Naelson Oliveira, Márcio Ribeiro, Rohit Gheyi, Emerson Souza, Ivan Machado, André Santos, Baldoino Fonseca, Rodrigo Bonifácio

Abstract: Background: Test smells indicate potential problems in the design and implementation of automated software tests that may negatively impact test code maintainability, coverage, and reliability. When poorly described, manual tests written in natural language may suffer from related problems, which enable their analysis from the point of view of test smells. Despite the possible prejudice to manuall… ▽ More Background: Test smells indicate potential problems in the design and implementation of automated software tests that may negatively impact test code maintainability, coverage, and reliability. When poorly described, manual tests written in natural language may suffer from related problems, which enable their analysis from the point of view of test smells. Despite the possible prejudice to manually tested software products, little is known about test smells in manual tests, which results in many open questions regarding their types, frequency, and harm to tests written in natural language. Aims: Therefore, this study aims to contribute to a catalog of test smells for manual tests. Method: We perform a two-fold empirical strategy. First, an exploratory study in manual tests of three systems: the Ubuntu Operational System, the Brazilian Electronic Voting Machine, and the User Interface of a large smartphone manufacturer. We use our findings to propose a catalog of eight test smells and identification rules based on syntactical and morphological text analysis, validating our catalog with 24 in-company test engineers. Second, using our proposals, we create a tool based on Natural Language Processing (NLP) to analyze the subject systems' tests, validating the results. Results: We observed the occurrence of eight test smells. A survey of 24 in-company test professionals showed that 80.7% agreed with our catalog definitions and examples. Our NLP-based tool achieved a precision of 92%, recall of 95%, and f-measure of 93.5%, and its execution evidenced 13,169 occurrences of our cataloged test smells in the analyzed systems. Conclusion: We contribute with a catalog of natural language test smells and novel detection strategies that better explore the capabilities of current NLP mechanisms with promising results and reduced effort to analyze tests written in different idioms. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: The 17th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2023

arXiv:2308.00886 [pdf]

Enhancing Machine Learning Performance with Continuous In-Session Ground Truth Scores: Pilot Study on Objective Skeletal Muscle Pain Intensity Prediction

Authors: Boluwatife E. Faremi, Jonathon Stavres, Nuno Oliveira, Zhaoxian Zhou, Andrew H. Sung

Abstract: Machine learning (ML) models trained on subjective self-report scores struggle to objectively classify pain accurately due to the significant variance between real-time pain experiences and recorded scores afterwards. This study developed two devices for acquisition of real-time, continuous in-session pain scores and gathering of ANS-modulated endodermal activity (EDA).The experiment recruited N =… ▽ More Machine learning (ML) models trained on subjective self-report scores struggle to objectively classify pain accurately due to the significant variance between real-time pain experiences and recorded scores afterwards. This study developed two devices for acquisition of real-time, continuous in-session pain scores and gathering of ANS-modulated endodermal activity (EDA).The experiment recruited N = 24 subjects who underwent a post-exercise circulatory occlusion (PECO) with stretch, inducing discomfort. Subject data were stored in a custom pain platform, facilitating extraction of time-domain EDA features and in-session ground truth scores. Moreover, post-experiment visual analog scale (VAS) scores were collected from each subject. Machine learning models, namely Multi-layer Perceptron (MLP) and Random Forest (RF), were trained using corresponding objective EDA features combined with in-session scores and post-session scores, respectively. Over a 10-fold cross-validation, the macro-averaged geometric mean score revealed MLP and RF models trained with objective EDA features and in-session scores achieved superior performance (75.9% and 78.3%) compared to models trained with post-session scores (70.3% and 74.6%) respectively. This pioneering study demonstrates that using continuous in-session ground truth scores significantly enhances ML performance in pain intensity characterization, overcoming ground truth sparsity-related issues, data imbalance, and high variance. This study informs future objective-based ML pain system training. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: 18 pages, 2-page Appendix, 7 figures

ACM Class: B.7; D.2.5; D.2.9; H.2.8; H.2.1; I.2; J.2; J.6; K.6.3

arXiv:2211.07315 [pdf]

doi 10.1590/0100-6045.2022.V45N4.NN

Counterfactual Analysis by Algorithmic Complexity: A metric between possible worlds

Authors: Nicholas Kluge Corrêa, Nythamar Fernandes De Oliveira

Abstract: Counterfactuals have become an important area of interdisciplinary interest, especially in logic, philosophy of language, epistemology, metaphysics, psychology, decision theory, and even artificial intelligence. In this study, we propose a new form of analysis for counterfactuals: analysis by algorithmic complexity. Inspired by Lewis-Stalnaker's Nicholas Corrêa 2 Manuscrito-Rev. Int. Fil. Campinas… ▽ More Counterfactuals have become an important area of interdisciplinary interest, especially in logic, philosophy of language, epistemology, metaphysics, psychology, decision theory, and even artificial intelligence. In this study, we propose a new form of analysis for counterfactuals: analysis by algorithmic complexity. Inspired by Lewis-Stalnaker's Nicholas Corrêa 2 Manuscrito-Rev. Int. Fil. Campinas, 2022. Possible Worlds Semantics, the proposed method allows for a new interpretation of the debate between David Lewis and Robert Stalnaker regarding the Limit and Singularity assumptions. Besides other results, we offer a new way to answer the problems raised by Goodman and Quine regarding vagueness, context-dependence, and the non-monotonicity of counterfactuals. Engaging in a dialogue with literature, this study will seek to bring new insights and tools to this debate. We hope our method of analysis can make counterfactuals more understandable in an intuitively plausible way, and a philosophically justifiable manner, aligned with the way we usually think about counterfactual propositions and our imaginative reasoning. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Journal ref: November 2022, Manuscrito 46(4):1-34

arXiv:2210.15289 [pdf]

On the Efficiency of Ethics as a Governing Tool for Artificial Intelligence

Authors: Nicholas Kluge Corrêa, Nythamar De Oliveira, Diogo Massmann

Abstract: The 4th Industrial Revolution is the culmination of the digital age. Nowadays, technologies such as robotics, nanotechnology, genetics, and artificial intelligence promise to transform our world and the way we live. Artificial Intelligence Ethics and Safety is an emerging research field that has been gaining popularity in recent years. Several private, public and non-governmental organizations hav… ▽ More The 4th Industrial Revolution is the culmination of the digital age. Nowadays, technologies such as robotics, nanotechnology, genetics, and artificial intelligence promise to transform our world and the way we live. Artificial Intelligence Ethics and Safety is an emerging research field that has been gaining popularity in recent years. Several private, public and non-governmental organizations have published guidelines proposing ethical principles for regulating the use and development of autonomous intelligent systems. Meta-analyses of the AI Ethics research field point to convergence on certain principles that supposedly govern the AI industry. However, little is known about the effectiveness of this form of Ethics. In this paper, we would like to conduct a critical analysis of the current state of AI Ethics and suggest that this form of governance based on principled ethical guidelines is not sufficient to norm the AI industry and its developers. We believe that drastic changes are necessary, both in the training processes of professionals in the fields related to the development of software and intelligent systems and in the increased regulation of these professionals and their industry. To this end, we suggest that law should benefit from recent contributions from bioethics, to make the contributions of AI ethics to governance explicit in legal terms. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2210.04794 [pdf, other]

Towards a case-based learning approach to support software architecture education

Authors: Brauner R. N. Oliveira, Elisa Y. Nakagawa

Abstract: Software architecture education remains challenging for instructors, students, and software industry professionals. Several initiatives have been proposed to mitigate the inherent challenges, including games, supporting tools, collaborative courses, and hands-on projects. Case-based learning has been introduced in software architecture, and its benefits are recognized. However, choosing the right… ▽ More Software architecture education remains challenging for instructors, students, and software industry professionals. Several initiatives have been proposed to mitigate the inherent challenges, including games, supporting tools, collaborative courses, and hands-on projects. Case-based learning has been introduced in software architecture, and its benefits are recognized. However, choosing the right cases that cover the stated learning objectives and develo** learning activities to achieve high-order learning are also challenging. The main goal of this paper is to present a case-based learning approach that guides the development of learning objectives, the finding and selection of real-world software architecture cases, and the design of instructional activities. We applied our approach in software architecture related courses during the past few years. The results show that it can leverage the ways to adequately explore cases for educational purposes while also motivating instructors and students to the software architecture education. △ Less

Submitted 12 September, 2022; originally announced October 2022.

arXiv:2209.06932 [pdf, other]

Optimizing Connectivity through Network Gradients for Restricted Boltzmann Machines

Authors: A. C. N. de Oliveira, D. R. Figueiredo

Abstract: Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the lear… ▽ More Leveraging sparse networks to connect successive layers in deep neural networks has recently been shown to provide benefits to large scale state-of-the-art models. However, network connectivity also plays a significant role on the learning performance of shallow networks, such as the classic Restricted Boltzmann Machines (RBM). Efficiently finding sparse connectivity patterns that improve the learning performance of shallow networks is a fundamental problem. While recent principled approaches explicitly include network connections as model parameters that must be optimized, they often rely on explicit penalization or have network sparsity as a hyperparameter. This work presents the Network Connectivity Gradients (NCG), a method to find optimal connectivity patterns for RBMs based on the idea of network gradients: computing the gradient of every possible connection, given a specific connection pattern, and using the gradient to drive a continuous connection strength parameter that in turn is used to determine the connection pattern. Thus, learning RBM parameters and learning network connections is truly jointly performed, albeit with different learning rates, and without changes to the model's classic objective function. The method is applied to the MNIST and other data sets showing that better RBM models are found for the benchmark tasks of sample generation and input classification. Results also show that NCG is robust to network initialization, both adding and removing network connections while learning. △ Less

Submitted 3 December, 2022; v1 submitted 14 September, 2022; originally announced September 2022.

arXiv:2208.14375 [pdf]

doi 10.1016/j.compbiomed.2017.05.013

Automated recognition of the pericardium contour on processed CT images using genetic algorithms

Authors: E. O. Rodrigues, L. O. Rodrigues, L. S. N. Oliveira, A. Conci, P. Liatsis

Abstract: This work proposes the use of Genetic Algorithms (GA) in tracing and recognizing the pericardium contour of the human heart using Computed Tomography (CT) images. We assume that each slice of the pericardium can be modelled by an ellipse, the parameters of which need to be optimally determined. An optimal ellipse would be one that closely follows the pericardium contour and, consequently, separate… ▽ More This work proposes the use of Genetic Algorithms (GA) in tracing and recognizing the pericardium contour of the human heart using Computed Tomography (CT) images. We assume that each slice of the pericardium can be modelled by an ellipse, the parameters of which need to be optimally determined. An optimal ellipse would be one that closely follows the pericardium contour and, consequently, separates appropriately the epicardial and mediastinal fats of the human heart. Tracing and automatically identifying the pericardium contour aids in medical diagnosis. Usually, this process is done manually or not done at all due to the effort required. Besides, detecting the pericardium may improve previously proposed automated methodologies that separate the two types of fat associated to the human heart. Quantification of these fats provides important health risk marker information, as they are associated with the development of certain cardiovascular pathologies. Finally, we conclude that GA offers satisfiable solutions in a feasible amount of processing time. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Journal ref: Computers in Biology and Medicine, Volume 87, 2017, Pages 38-45, ISSN 0010-4825

arXiv:2207.01595 [pdf, ps, other]

Deep Learning for Short-term Instant Energy Consumption Forecasting in the Manufacturing Sector

Authors: Nuno Oliveira, Norberto Sousa, Isabel Praça

Abstract: Electricity is a volatile power source that requires great planning and resource management for both short and long term. More specifically, in the short-term, accurate instant energy consumption forecasting contributes greatly to improve the efficiency of buildings, opening new avenues for the adoption of renewable energy. In that regard, data-driven approaches, namely the ones based on machine l… ▽ More Electricity is a volatile power source that requires great planning and resource management for both short and long term. More specifically, in the short-term, accurate instant energy consumption forecasting contributes greatly to improve the efficiency of buildings, opening new avenues for the adoption of renewable energy. In that regard, data-driven approaches, namely the ones based on machine learning, are begin to be preferred over more traditional ones since they provide not only more simplified ways of deployment but also state of the art results. In that sense, this work applies and compares the performance of several deep learning algorithms, LSTM, CNN, mixed CNN-LSTM and TCN, in a real testbed within the manufacturing sector. The experimental results suggest that the TCN is the most reliable method for predicting instant energy consumption in the short-term. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2206.11922 [pdf]

doi 10.1016/j.patter.2023.100857

Worldwide AI Ethics: a review of 200 guidelines and recommendations for AI governance

Authors: Nicholas Kluge Corrêa, Camila Galvão, James William Santos, Carolina Del Pino, Edson Pontes Pinto, Camila Barbosa, Diogo Massmann, Rodrigo Mambrini, Luiza Galvão, Edmund Terem, Nythamar de Oliveira

Abstract: The utilization of artificial intelligence (AI) applications has experienced tremendous growth in recent years, bringing forth numerous benefits and conveniences. However, this expansion has also provoked ethical concerns, such as privacy breaches, algorithmic discrimination, security and reliability issues, transparency, and other unintended consequences. To determine whether a global consensus e… ▽ More The utilization of artificial intelligence (AI) applications has experienced tremendous growth in recent years, bringing forth numerous benefits and conveniences. However, this expansion has also provoked ethical concerns, such as privacy breaches, algorithmic discrimination, security and reliability issues, transparency, and other unintended consequences. To determine whether a global consensus exists regarding the ethical principles that should govern AI applications and to contribute to the formation of future regulations, this paper conducts a meta-analysis of 200 governance policies and ethical guidelines for AI usage published by public bodies, academic institutions, private companies, and civil society organizations worldwide. We identified at least 17 resonating principles prevalent in the policies and guidelines of our dataset, released as an open-source database and tool. We present the limitations of performing a global scale analysis study paired with a critical analysis of our findings, presenting areas of consensus that should be incorporated into future regulatory efforts. All components tied to this work can be found in https://nkluge-correa.github.io/worldwide_AI-ethics/ △ Less

Submitted 19 February, 2024; v1 submitted 23 June, 2022; originally announced June 2022.

Journal ref: Patterns, VOLUME 4, ISSUE 10, 100857, OCTOBER 13, 2023

arXiv:2206.11866 [pdf]

doi 10.1007/978-3-031-20859-1_13

A Multi-Policy Framework for Deep Learning-Based Fake News Detection

Authors: João Vitorino, Tiago Dias, Tiago Fonseca, Nuno Oliveira, Isabel Praça

Abstract: Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPS… ▽ More Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPSC), a framework that automates fake news detection by using deep learning techniques to analyze a statement itself and its related news articles, predicting whether it is seemingly credible or suspicious. The proposed framework was evaluated using four merged datasets containing real and fake news. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) models were trained to utilize both lexical and syntactic features, and their performance was evaluated. The obtained results demonstrate that a multi-policy analysis reliably identifies suspicious statements, which can be advantageous for fake news detection. △ Less

Submitted 1 June, 2022; originally announced June 2022.

Comments: 10 pages, 1 table, 3 figures, DCAI 2022 conference

arXiv:2203.04234 [pdf]

doi 10.3390/fi14040108

Adaptative Perturbation Patterns: Realistic Adversarial Learning for Robust Intrusion Detection

Authors: João Vitorino, Nuno Oliveira, Isabel Praça

Abstract: Adversarial attacks pose a major threat to machine learning and to the systems that rely on it. In the cybersecurity domain, adversarial cyber-attack examples capable of evading detection are especially concerning. Nonetheless, an example generated for a domain with tabular data must be realistic within that domain. This work establishes the fundamental constraint levels required to achieve realis… ▽ More Adversarial attacks pose a major threat to machine learning and to the systems that rely on it. In the cybersecurity domain, adversarial cyber-attack examples capable of evading detection are especially concerning. Nonetheless, an example generated for a domain with tabular data must be realistic within that domain. This work establishes the fundamental constraint levels required to achieve realism and introduces the Adaptative Perturbation Pattern Method (A2PM) to fulfill these constraints in a gray-box setting. A2PM relies on pattern sequences that are independently adapted to the characteristics of each class to create valid and coherent data perturbations. The proposed method was evaluated in a cybersecurity case study with two scenarios: Enterprise and Internet of Things (IoT) networks. Multilayer Perceptron (MLP) and Random Forest (RF) classifiers were created with regular and adversarial training, using the CIC-IDS2017 and IoT-23 datasets. In each scenario, targeted and untargeted attacks were performed against the classifiers, and the generated examples were compared with the original network traffic flows to assess their realism. The obtained results demonstrate that A2PM provides a scalable generation of realistic adversarial examples, which can be advantageous for both adversarial training and attacks. △ Less

Submitted 29 March, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 18 pages, 6 tables, 10 figures, Future Internet journal

arXiv:2201.07866 [pdf]

A Practical Approach of Actions for FAIRification Workflows

Authors: Natalia Queiroz de Oliveira, Vânia Borges, Henrique F. Rodrigues, Maria Luiza Machado Campos, Giseli Rabello Lopes

Abstract: Since their proposal in 2016, the FAIR principles have been largely discussed by different communities and initiatives involved in the development of infrastructures to enhance support for data findability, accessibility, interoperability, and reuse. One of the challenges in implementing these principles lies in defining a well-delimited process with organized and detailed actions. This paper pres… ▽ More Since their proposal in 2016, the FAIR principles have been largely discussed by different communities and initiatives involved in the development of infrastructures to enhance support for data findability, accessibility, interoperability, and reuse. One of the challenges in implementing these principles lies in defining a well-delimited process with organized and detailed actions. This paper presents a workflow of actions that is being adopted in the VODAN BR pilot for generating FAIR (meta)data for COVID-19 research. It provides the understanding of each step of the process, establishing their contribution. In this work, we also evaluate potential tools to (semi)automatize (meta)data treatment whenever possible. Although defined for a particular use case, it is expected that this workflow can be applied for other epidemical research and in other domains, benefiting the entire scientific community. △ Less

Submitted 19 January, 2022; originally announced January 2022.

Comments: Preprint. Submitted to MTSR2021 on 25th October 2021. 12 pages. To be published in "Metadata and Semantic Research"

arXiv:2112.14821 [pdf, other]

Anomaly Detection in Cyber-Physical Systems: Reconstruction of a Prediction Error Feature Space

Authors: Nuno Oliveira, Norberto Sousa, Jorge Oliveira, Isabel Praça

Abstract: Cyber-physical systems are infrastructures that use digital information such as network communications and sensor readings to control entities in the physical world. Many cyber-physical systems in airports, hospitals and nuclear power plants are regarded as critical infrastructures since a disruption of its normal functionality can result in negative consequences for the society. In the last few y… ▽ More Cyber-physical systems are infrastructures that use digital information such as network communications and sensor readings to control entities in the physical world. Many cyber-physical systems in airports, hospitals and nuclear power plants are regarded as critical infrastructures since a disruption of its normal functionality can result in negative consequences for the society. In the last few years, some security solutions for cyber-physical systems based on artificial intelligence have been proposed. Nevertheless, knowledge domain is required to properly setup and train artificial intelligence algorithms. Our work proposes a novel anomaly detection framework based on error space reconstruction, where genetic algorithms are used to perform hyperparameter optimization of machine learning methods. The proposed method achieved an F1-score of 87.89% in the SWaT dataset. △ Less

Submitted 29 December, 2021; originally announced December 2021.

arXiv:2112.01103 [pdf]

A tool to support the investigation and visualization of cyber and/or physical incidents

Authors: Inês Macedo, Sinan Wanous, Nuno Oliveira, Orlando Sousa, Isabel Praça

Abstract: Investigating efficiently the data collected from a system's activity can help to detect malicious attempts and better understand the context behind past incident occurrences. Nowadays, several solutions can be used to monitor system activities to detect probable abnormalities and malfunctions. However, most of these systems overwhelm their users with vast amounts of information, making it harder… ▽ More Investigating efficiently the data collected from a system's activity can help to detect malicious attempts and better understand the context behind past incident occurrences. Nowadays, several solutions can be used to monitor system activities to detect probable abnormalities and malfunctions. However, most of these systems overwhelm their users with vast amounts of information, making it harder for them to perceive incident occurrences and their context. Our approach combines a dynamic and intuitive user interface with Machine Learning forecasts to provide an intelligent investigation tool that facilitates the security operator's work. Our system can also act as an enhanced and fully automated decision support mechanism that provides suggestions about possible incident occurrences. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2111.10280 [pdf]

A Hybrid Approach for an Interpretable and Explainable Intrusion Detection System

Authors: Tiago Dias, Nuno Oliveira, Norberto Sousa, Isabel Praça, Orlando Sousa

Abstract: Cybersecurity has been a concern for quite a while now. In the latest years, cyberattacks have been increasing in size and complexity, fueled by significant advances in technology. Nowadays, there is an unavoidable necessity of protecting systems and data crucial for business continuity. Hence, many intrusion detection systems have been created in an attempt to mitigate these threats and contribut… ▽ More Cybersecurity has been a concern for quite a while now. In the latest years, cyberattacks have been increasing in size and complexity, fueled by significant advances in technology. Nowadays, there is an unavoidable necessity of protecting systems and data crucial for business continuity. Hence, many intrusion detection systems have been created in an attempt to mitigate these threats and contribute to a timelier detection. This work proposes an interpretable and explainable hybrid intrusion detection system, which makes use of artificial intelligence methods to achieve better and more long-lasting security. The system combines experts' written rules and dynamic knowledge continuously generated by a decision tree algorithm as new shreds of evidence emerge from network activity. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: 11 pages, 5 figures, 1 table, ISDA conference

arXiv:2109.12386 [pdf, other]

A Multi-Agent System for Autonomous Mobile Robot Coordination

Authors: Norberto Sousa, Nuno Oliveira, Isabel Praça

Abstract: The automation of internal logistics and inventory-related tasks is one of the main challenges of modern-day manufacturing corporations since it allows a more effective application of their human resources. Nowadays, Autonomous Mobile Robots (AMR) are state of the art technologies for such applications due to their great adaptability in dynamic environments, replacing more traditional solutions su… ▽ More The automation of internal logistics and inventory-related tasks is one of the main challenges of modern-day manufacturing corporations since it allows a more effective application of their human resources. Nowadays, Autonomous Mobile Robots (AMR) are state of the art technologies for such applications due to their great adaptability in dynamic environments, replacing more traditional solutions such as Automated Guided Vehicles (AGV), which are quite limited in terms of flexibility and require expensive facility updates for their installation. The application of Artificial Intelligence (AI) to increase AMRs capabilities has been contributing for the development of more sophisticated and efficient robots. Nevertheless, multi-robot coordination and cooperation for solving complex tasks is still a hot research line with increasing interest. This work proposes a Multi-Agent System for coordinating multiple TIAGo robots in tasks related to the manufacturing ecosystem such as the transportation and dispatching of raw materials, finished products and tools. Furthermore, the system is showcased in a realistic simulation using both Gazebo and Robot Operating System (ROS). △ Less

Submitted 25 September, 2021; originally announced September 2021.

arXiv:2107.02753 [pdf, other]

Machine Learning for Network-based Intrusion Detection Systems: an Analysis of the CIDDS-001 Dataset

Authors: José Carneiro, Nuno Oliveira, Norberto Sousa, Eva Maia, Isabel Praça

Abstract: With the increasing amount of reliance on digital data and computer networks by corporations and the public in general, the occurrence of cyber attacks has become a great threat to the normal functioning of our society. Intrusion detection systems seek to address this threat by preemptively detecting attacks in real time while attempting to block them or minimizing their damage. These systems can… ▽ More With the increasing amount of reliance on digital data and computer networks by corporations and the public in general, the occurrence of cyber attacks has become a great threat to the normal functioning of our society. Intrusion detection systems seek to address this threat by preemptively detecting attacks in real time while attempting to block them or minimizing their damage. These systems can function in many ways being some of them based on artificial intelligence methods. Datasets containing both normal network traffic and cyber attacks are used for training these algorithms so that they can learn the underlying patterns of network-based data. The CIDDS-001 is one of the most used datasets for network-based intrusion detection research. Regarding this dataset, in the majority of works published so far, the Class label was used for training machine learning algorithms. However, there is another label in the CIDDS-001, AttackType, that seems very promising for this purpose and remains considerably unexplored. This work seeks to make a comparison between two machine learning models, K-Nearest Neighbours and Random Forest, which were trained with both these labels in order to ascertain whether AttackType can produce reliable results in comparison with the Class label. △ Less

Submitted 2 July, 2021; originally announced July 2021.

arXiv:2107.00082 [pdf, other]

A Search Engine for Scientific Publications: a Cybersecurity Case Study

Authors: Nuno Oliveira, Norberto Sousa, Isabel Praça

Abstract: Cybersecurity is a very challenging topic of research nowadays, as digitalization increases the interaction of people, software and services on the Internet by means of technology devices and networks connected to it. The field is broad and has a lot of unexplored ground under numerous disciplines such as management, psychology, and data science. Its large disciplinary spectrum and many significan… ▽ More Cybersecurity is a very challenging topic of research nowadays, as digitalization increases the interaction of people, software and services on the Internet by means of technology devices and networks connected to it. The field is broad and has a lot of unexplored ground under numerous disciplines such as management, psychology, and data science. Its large disciplinary spectrum and many significant research topics generate a considerable amount of information, making it hard for us to find what we are looking for when researching a particular subject. This work proposes a new search engine for scientific publications which combines both information retrieval and reading comprehension algorithms to extract answers from a collection of domain-specific documents. The proposed solution although being applied to the context of cybersecurity exhibited great generalization capabilities and can be easily adapted to perform under other distinct knowledge domains. △ Less

Submitted 30 June, 2021; originally announced July 2021.

arXiv:2103.07953 [pdf, other]

doi 10.1109/ACCESS.2021.3137633

A new interpretable unsupervised anomaly detection method based on residual explanation

Authors: David F. N. Oliveira, Lucio F. Vismari, Alexandre M. Nascimento, Jorge R. de Almeida Jr, Paulo S. Cugnasca, Joao B. Camargo Jr, Leandro Almeida, Rafael Gripp, Marcelo Neves

Abstract: Despite the superior performance in modeling complex patterns to address challenging problems, the black-box nature of Deep Learning (DL) methods impose limitations to their application in real-world critical domains. The lack of a smooth manner for enabling human reasoning about the black-box decisions hinder any preventive action to unexpected events, in which may lead to catastrophic consequenc… ▽ More Despite the superior performance in modeling complex patterns to address challenging problems, the black-box nature of Deep Learning (DL) methods impose limitations to their application in real-world critical domains. The lack of a smooth manner for enabling human reasoning about the black-box decisions hinder any preventive action to unexpected events, in which may lead to catastrophic consequences. To tackle the unclearness from black-box models, interpretability became a fundamental requirement in DL-based systems, leveraging trust and knowledge by providing ways to understand the model's behavior. Although a current hot topic, further advances are still needed to overcome the existing limitations of the current interpretability methods in unsupervised DL-based models for Anomaly Detection (AD). Autoencoders (AE) are the core of unsupervised DL-based for AD applications, achieving best-in-class performance. However, due to their hybrid aspect to obtain the results (by requiring additional calculations out of network), only agnostic interpretable methods can be applied to AE-based AD. These agnostic methods are computationally expensive to process a large number of parameters. In this paper we present the RXP (Residual eXPlainer), a new interpretability method to deal with the limitations for AE-based AD in large-scale systems. It stands out for its implementation simplicity, low computational cost and deterministic behavior, in which explanations are obtained through the deviation analysis of reconstructed input features. In an experiment using data from a real heavy-haul railway line, the proposed method achieved superior performance compared to SHAP, demonstrating its potential to support decision making in large scale critical systems. △ Less

Submitted 14 March, 2021; originally announced March 2021.

Comments: 8 pages

ACM Class: I.2.6; I.2.1

arXiv:2010.10510 [pdf, other]

Compiling quantamorphisms for the IBM Q Experience

Authors: Ana Neri, Rui Soares Barbosa, José N. Oliveira

Abstract: Based on the connection between the categorical derivation of classical programs from specifications and the category-theoretic approach to quantum physics, this paper contributes to extending the laws of classical program algebra to quantum programming. This aims at building correct-by-construction quantum circuits to be deployed on quantum devices such as those available at the IBM Q Experience.… ▽ More Based on the connection between the categorical derivation of classical programs from specifications and the category-theoretic approach to quantum physics, this paper contributes to extending the laws of classical program algebra to quantum programming. This aims at building correct-by-construction quantum circuits to be deployed on quantum devices such as those available at the IBM Q Experience. Quantum circuit reversibility is ensured by minimal complements, extended recursively. Measurements are postponed to the end of such recursive computations, termed "quantamorphisms", thus maximising the quantum effect. Quantamorphisms are classical catamorphisms which, extended to ensure quantum reversibility, implement quantum cycles (vulg. for-loops) and quantum folds on lists. By Kleisli correspondence, quantamorphisms can be written as monadic functional programs with quantum parameters. This enables the use of Haskell, a monadic functional programming language, to perform the experimental work. Such calculated quantum programs prepared in Haskell are pushed through Quipper to the Qiskit interface to IBM Q quantum devices. The generated quantum circuits - often quite large - exhibit the predicted behaviour. However, running them on real quantum devices incurs into a significant amount of errors. As quantum devices are constantly evolving, an increase in reliability is likely in the near future, allowing for our programs to run more accurately. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: 18 pages

arXiv:2010.07018 [pdf]

doi 10.6531/JFS.202109_26(1).0005

Singularity and Coordination Problems: Pandemic Lessons from 2020

Authors: Nicholas Kluge Corrêa, Nythamar De Oliveira

Abstract: Are there any indications that a Technological Singularity may be on the horizon? In trying to answer these questions, the authors made a small introduction to the area of safety research in artificial intelligence. The authors review some of the current paradigms in the development of autonomous intelligent systems, searching for evidence that may indicate the coming of a possible Technological S… ▽ More Are there any indications that a Technological Singularity may be on the horizon? In trying to answer these questions, the authors made a small introduction to the area of safety research in artificial intelligence. The authors review some of the current paradigms in the development of autonomous intelligent systems, searching for evidence that may indicate the coming of a possible Technological Singularity. Finally, the authors present a reflection using the COVID-19 pandemic, something that showed that global society biggest problem in managing existential risks is its lack of coordination skills as a global society. △ Less

Submitted 1 October, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Report number: 2021,Vol. 26(1) 61--74

Journal ref: Journal of Futures Studies, 2021

arXiv:2008.02783 [pdf]

doi 10.15448/1984-6746.2020.2.37439

Modelos dinâmicos aplicados à aprendizagem de valores em inteligência artificial

Authors: Nicholas Kluge Corrêa, Nythamar De Oliveira

Abstract: Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance is not made prudently and critically, reflexively, it can result in negative outcomes for humanity. For this reason, several researchers in the area have developed a robust, beneficial, and safe conc… ▽ More Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance is not made prudently and critically, reflexively, it can result in negative outcomes for humanity. For this reason, several researchers in the area have developed a robust, beneficial, and safe concept of AI for the preservation of humanity and the environment. Currently, several of the open problems in the field of AI research arise from the difficulty of avoiding unwanted behaviors of intelligent agents and systems, and at the same time specifying what we really want such systems to do, especially when we look for the possibility of intelligent agents acting in several domains over the long term. It is of utmost importance that artificial intelligent agents have their values aligned with human values, given the fact that we cannot expect an AI to develop human moral values simply because of its intelligence, as discussed in the Orthogonality Thesis. Perhaps this difficulty comes from the way we are addressing the problem of expressing objectives, values, and ends, using representational cognitive methods. A solution to this problem would be the dynamic approach proposed by Dreyfus, whose phenomenological philosophy shows that the human experience of being-in-the-world in several aspects is not well represented by the symbolic or connectionist cognitive method, especially in regards to the question of learning values. A possible approach to this problem would be to use theoretical models such as SED (situated embodied dynamics) to address the values learning problem in AI. △ Less

Submitted 29 July, 2020; originally announced August 2020.

Comments: in Portuguese

Journal ref: Veritas 65(2):1-15 (2020)

arXiv:2007.16200 [pdf, other]

Quantum One-class Classification With a Distance-based Classifier

Authors: Nicolas M. de Oliveira, Lucas P. de Albuquerque, Wilson R. de Oliveira, Teresa B. Ludermir, Adenilton J. da Silva

Abstract: The advancement of technology in Quantum Computing has brought possibilities for the execution of algorithms in real quantum devices. However, the existing errors in the current quantum hardware and the low number of available qubits make it necessary to use solutions that use fewer qubits and fewer operations, mitigating such obstacles. Hadamard Classifier (HC) is a distance-based quantum machine… ▽ More The advancement of technology in Quantum Computing has brought possibilities for the execution of algorithms in real quantum devices. However, the existing errors in the current quantum hardware and the low number of available qubits make it necessary to use solutions that use fewer qubits and fewer operations, mitigating such obstacles. Hadamard Classifier (HC) is a distance-based quantum machine learning model for pattern recognition. We present a new classifier based on HC named Quantum One-class Classifier (QOCC) that consists of a minimal quantum machine learning model with fewer operations and qubits, thus being able to mitigate errors from NISQ (Noisy Intermediate-Scale Quantum) computers. Experimental results were obtained by running the proposed classifier on a quantum device and show that QOCC has advantages over HC. △ Less

Submitted 6 May, 2021; v1 submitted 31 July, 2020; originally announced July 2020.

Comments: Accepted for publication in The International Joint Conference on Neural Networks (IJCNN), 2021

arXiv:2007.04477 [pdf]

doi 10.47289/AIEJ20210716-2

Good AI for the Present of Humanity Democratizing AI Governance

Authors: Nicholas Kluge Corrêa, Nythamar de Oliveira

Abstract: What do Cyberpunk and AI Ethics have to do with each other? Cyberpunk is a sub-genre of science fiction that explores the post-human relationships between human experience and technology. One similarity between AI Ethics and Cyberpunk literature is that both seek to explore future social and ethical problems that our technological advances may bring upon society. In recent years, an increasing num… ▽ More What do Cyberpunk and AI Ethics have to do with each other? Cyberpunk is a sub-genre of science fiction that explores the post-human relationships between human experience and technology. One similarity between AI Ethics and Cyberpunk literature is that both seek to explore future social and ethical problems that our technological advances may bring upon society. In recent years, an increasing number of ethical matters involving AI have been pointed and debated, and several ethical principles and guides have been suggested as governance policies for the tech industry. However, would this be the role of AI Ethics? To serve as a soft and ambiguous version of the law? We would like to advocate in this article for a more Cyberpunk way of doing AI Ethics, with a more democratic way of governance. In this study, we will seek to expose some of the deficits of the underlying power structures of the AI industry, and suggest that AI governance be subject to public opinion, so that good AI can become good AI for all. △ Less

Submitted 16 August, 2021; v1 submitted 8 July, 2020; originally announced July 2020.

Journal ref: The AI Ethics Journal (2021)

arXiv:2005.05538 [pdf]

doi 10.6394/aoristo.v2i4.27982

Dynamic Cognition Applied to Value Learning in Artificial Intelligence

Authors: Nythamar de Oliveira, Nicholas Kluge Corrêa

Abstract: Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance isn't done with prudence, it can result in negative outcomes for humanity. For this reason, several researchers in the area are trying to develop a robust, beneficial, and safe concept of artificial… ▽ More Experts in Artificial Intelligence (AI) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. Nevertheless, if such an advance isn't done with prudence, it can result in negative outcomes for humanity. For this reason, several researchers in the area are trying to develop a robust, beneficial, and safe concept of artificial intelligence. Currently, several of the open problems in the field of AI research arise from the difficulty of avoiding unwanted behaviors of intelligent agents, and at the same time specifying what we want such systems to do. It is of utmost importance that artificial intelligent agents have their values aligned with human values, given the fact that we cannot expect an AI to develop our moral preferences simply because of its intelligence, as discussed in the Orthogonality Thesis. Perhaps this difficulty comes from the way we are addressing the problem of expressing objectives, values, and ends, using representational cognitive methods. A solution to this problem would be the dynamic cognitive approach proposed by Dreyfus, whose phenomenological philosophy defends that the human experience of being-in-the-world cannot be represented by the symbolic or connectionist cognitive methods. A possible approach to this problem would be to use theoretical models such as SED (situated embodied dynamics) to address the values learning problem in AI. △ Less

Submitted 23 August, 2021; v1 submitted 11 May, 2020; originally announced May 2020.

Journal ref: Aoristo - International Journal of Phenomenology, Hermeneutics and Metaphysics (2021)

arXiv:2001.10063 [pdf, other]

Towards Open-Set Semantic Segmentation of Aerial Images

Authors: Caio C. V. da Silva, Keiller Nogueira, Hugo N. Oliveira, Jefersson A. dos Santos

Abstract: Classical and more recently deep computer vision methods are optimized for visible spectrum images, commonly encoded in grayscale or RGB colorspaces acquired from smartphones or cameras. A more uncommon source of images exploited in the remote sensing field are satellite and aerial images. However, the development of pattern recognition approaches for these data is relatively recent, mainly due to… ▽ More Classical and more recently deep computer vision methods are optimized for visible spectrum images, commonly encoded in grayscale or RGB colorspaces acquired from smartphones or cameras. A more uncommon source of images exploited in the remote sensing field are satellite and aerial images. However, the development of pattern recognition approaches for these data is relatively recent, mainly due to the limited availability of this type of images, as until recently they were used exclusively for military purposes. Access to aerial imagery, including spectral information, has been increasing mainly due to the low cost of drones, cheapening of imaging satellite launch costs, and novel public datasets. Usually remote sensing applications employ computer vision techniques strictly modeled for classification tasks in closed set scenarios. However, real-world tasks rarely fit into closed set contexts, frequently presenting previously unknown classes, characterizing them as open set scenarios. Focusing on this problem, this is the first paper to study and develop semantic segmentation techniques for open set scenarios applied to remote sensing images. The main contributions of this paper are: 1) a discussion of related works in open set semantic segmentation, showing evidence that these techniques can be adapted for open set remote sensing tasks; 2) the development and evaluation of a novel approach for open set semantic segmentation. Our method yielded competitive results when compared to closed set methods for the same dataset. △ Less

Submitted 27 January, 2020; originally announced January 2020.

arXiv:1904.10370 [pdf]

A survey on Big Data and Machine Learning for Chemistry

Authors: Jose F Rodrigues Jr, Larisa Florea, Maria C F de Oliveira, Dermot Diamond, Osvaldo N Oliveira Jr

Abstract: Herein we review aspects of leading-edge research and innovation in chemistry which exploits big data and machine learning (ML), two computer science fields that combine to yield machine intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. But the potential benefits of ML come at the cost of big data production;… ▽ More Herein we review aspects of leading-edge research and innovation in chemistry which exploits big data and machine learning (ML), two computer science fields that combine to yield machine intelligence. ML can accelerate the solution of intricate chemical problems and even solve problems that otherwise would not be tractable. But the potential benefits of ML come at the cost of big data production; that is, the algorithms, in order to learn, demand large volumes of data of various natures and from different sources, from materials properties to sensor data. In the survey, we propose a roadmap for future developments, with emphasis on materials discovery and chemical sensing, and within the context of the Internet of Things (IoT), both prominent research fields for ML in the context of big data. In addition to providing an overview of recent advances, we elaborate upon the conceptual and practical limitations of big data and ML applied to chemistry, outlining processes, discussing pitfalls, and reviewing cases of success and failure. △ Less

Submitted 23 April, 2019; originally announced April 2019.

MSC Class: 74Exx; 74Fxx; 97Rxx

arXiv:1809.00641 [pdf, other]

Typed Linear Algebra for Efficient Analytical Querying

Authors: João M. Afonso, Gabriel D. Fernandes, João P. Fernandes, Filipe Oliveira, Bruno M. Ribeiro, Rogério Pontes, José N. Oliveira, Alberto J. Proença

Abstract: This paper uses typed linear algebra (LA) to represent data and perform analytical querying in a single, unified framework. The typed approach offers strong type checking (as in modern programming languages) and a diagrammatic way of expressing queries (paths in LA diagrams). A kernel of LA operators has been implemented so that paths extracted from LA diagrams can be executed. The approach is val… ▽ More This paper uses typed linear algebra (LA) to represent data and perform analytical querying in a single, unified framework. The typed approach offers strong type checking (as in modern programming languages) and a diagrammatic way of expressing queries (paths in LA diagrams). A kernel of LA operators has been implemented so that paths extracted from LA diagrams can be executed. The approach is validated and evaluated taking TPC-H benchmark queries as reference. The performance of the LA-based approach is compared with popular database competitors (PostgreSQL and MySQL). △ Less

Submitted 3 September, 2018; originally announced September 2018.

arXiv:1709.09013 [pdf, ps, other]

Programming from Metaphorisms

Authors: J. N. Oliveira

Abstract: This paper presents a study of the metaphorism pattern of relational specification, showing how it can be refined into recursive programs. Metaphorisms express input-output relationships which preserve relevant information while at the same time some intended optimization takes place. Text processing, sorting, representation changers, etc., are examples of metaphorisms. The kind of metaphorism ref… ▽ More This paper presents a study of the metaphorism pattern of relational specification, showing how it can be refined into recursive programs. Metaphorisms express input-output relationships which preserve relevant information while at the same time some intended optimization takes place. Text processing, sorting, representation changers, etc., are examples of metaphorisms. The kind of metaphorism refinement studied in this paper is a strategy known as change of virtual data structure. By framing metaphorisms in the class of (inductive) regular relations, sufficient conditions are given for such implementations to be calculated using relation algebra. The strategy is illustrated with examples including the derivation of the quicksort and mergesort algorithms, showing what they have in common and what makes them different from the very start of development. △ Less

Submitted 19 September, 2017; originally announced September 2017.

arXiv:1705.04187 [pdf, other]

doi 10.1016/j.physa.2017.12.054

On the role of words in the network structure of texts: application to authorship attribution

Authors: Camilo Akimushkin, Diego R. Amancio, Osvaldo N. Oliveira Jr

Abstract: Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words and bigrams, while methods based on co-occurrence networks consider the structure of texts regardless of the nodes label (i.e. the words semantics). In this paper, we reconcile these distinct viewpoints by introducing a generalized similarity measure to compare texts which accounts for… ▽ More Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words and bigrams, while methods based on co-occurrence networks consider the structure of texts regardless of the nodes label (i.e. the words semantics). In this paper, we reconcile these distinct viewpoints by introducing a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. We use the similarity measure for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values from 90% to 98.75%, much higher than with the traditional the TF-IDF approach for the same collections. These accuracies are also higher than taking only the topology of networks into account. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network. △ Less

Submitted 11 May, 2017; originally announced May 2017.

Journal ref: Physica A v. 495, p. 49-58, 2018

arXiv:1704.08088 [pdf, other]

Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts

Authors: Leandro B. dos Santos, Edilson A. Corrêa Jr, Osvaldo N. Oliveira Jr, Diego R. Amancio, Letícia L. Mansur, Sandra M. Aluísio

Abstract: Mild Cognitive Impairment (MCI) is a mental disorder difficult to diagnose. Linguistic features, mainly from parsers, have been used to detect MCI, but this is not suitable for large-scale assessments. MCI disfluencies produce non-grammatical speech that requires manual or high precision automatic correction of transcripts. In this paper, we modeled transcripts into complex networks and enriched t… ▽ More Mild Cognitive Impairment (MCI) is a mental disorder difficult to diagnose. Linguistic features, mainly from parsers, have been used to detect MCI, but this is not suitable for large-scale assessments. MCI disfluencies produce non-grammatical speech that requires manual or high precision automatic correction of transcripts. In this paper, we modeled transcripts into complex networks and enriched them with word embedding (CNE) to better represent short texts produced in neuropsychological assessments. The network measurements were applied with well-known classifiers to automatically identify MCI in transcripts, in a binary classification task. A comparison was made with the performance of traditional approaches using Bag of Words (BoW) and linguistic features for three datasets: DementiaBank in English, and Cinderella and Arizona-Battery in Portuguese. Overall, CNE provided higher accuracy than using only complex networks, while Support Vector Machine was superior to other classifiers. CNE provided the highest accuracies for DementiaBank and Cinderella, but BoW was more efficient for the Arizona-Battery dataset probably owing to its short narratives. The approach using linguistic features yielded higher accuracy if the transcriptions of the Cinderella dataset were manually revised. Taken together, the results indicate that complex networks enriched with embedding is promising for detecting MCI in large-scale assessments △ Less

Submitted 26 April, 2017; originally announced April 2017.

Comments: Published in Annual Meeting of the Association for Computational Linguist 2017

arXiv:1610.08201 [pdf, ps, other]

doi 10.4204/EPTCS.228.5

An Enhanced Model for Stochastic Coordination

Authors: Nuno Oliveira, Luis Soares Barbosa

Abstract: Applications developed over the cloud coordinate several, often anonymous, computational resources, distributed over different execution nodes, within flexible architectures. Coordination models able to represent quantitative data provide a powerful basis for their analysis and validation. This paper extends IMCreo, a semantic model for Stochastic reo based on interactive Markov chains, to enhance… ▽ More Applications developed over the cloud coordinate several, often anonymous, computational resources, distributed over different execution nodes, within flexible architectures. Coordination models able to represent quantitative data provide a powerful basis for their analysis and validation. This paper extends IMCreo, a semantic model for Stochastic reo based on interactive Markov chains, to enhance its scalability, by regarding each channel and node, as well as interface components, as independent stochastic processes that may (or may not) synchronise with the rest of the coordination circuit. △ Less

Submitted 26 October, 2016; originally announced October 2016.

Comments: In Proceedings iFMCloud 2016, arXiv:1610.07700

Journal ref: EPTCS 228, 2016, pp. 35-45

arXiv:1608.01965 [pdf, other]

doi 10.1371/journal.pone.0170527

Text authorship identified using the dynamics of word co-occurrence networks

Authors: Camilo Akimushkin, Diego R. Amancio, Osvaldo N. Oliveira Jr

Abstract: The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sec… ▽ More The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion. △ Less

Submitted 29 July, 2016; originally announced August 2016.

Journal ref: PLoS ONE 12(1): e0170527, 2017

arXiv:1510.03004 [pdf]

Assessing the Value of Peer-Produced Information for Exploratory Search

Authors: Elizeu Santos-Neto, Flavio Figueiredo, Nigini Oliveira, Nazareno Andrade, Jussara Almeida, Matei Ripeanu

Abstract: Tagging is a popular feature that supports several collaborative tasks, including search, as tags produced by one user can help others finding relevant content. However, task performance depends on the existence of 'good' tags. A first step towards creating incentives for users to produce 'good' tags is the quantification of their value in the first place. This work fills this gap by combining qua… ▽ More Tagging is a popular feature that supports several collaborative tasks, including search, as tags produced by one user can help others finding relevant content. However, task performance depends on the existence of 'good' tags. A first step towards creating incentives for users to produce 'good' tags is the quantification of their value in the first place. This work fills this gap by combining qualitative and quantitative research methods. In particular, using contextual interviews, we first determine aspects that influence users' perception of tags' value for exploratory search. Next, we formalize some of the identified aspects and propose an information-theoretical method with provable properties that quantifies the two most important aspects (according to the qualitative analysis) that influence the perception of tag value: the ability of a tag to reduce the search space while retrieving relevant items to the user. The evaluation on real data shows that our method is accurate: tags that users consider more important have higher value than tags users have not expressed interest. △ Less

Submitted 10 October, 2015; originally announced October 2015.

Comments: 12 pages

arXiv:1506.05690 [pdf, other]

doi 10.1016/j.joi.2016.03.008

Using network science and text analytics to produce surveys in a scientific topic

Authors: Filipi N. Silva, Diego R. Amancio, Maria Bardosova, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: The use of science to understand its own structure is becoming popular, but understanding the organization of knowledge areas is still limited because some patterns are only discoverable with proper computational treatment of large-scale datasets. In this paper, we introduce a network-based methodology combined with text analytics to construct the taxonomy of science fields. The methodology is ill… ▽ More The use of science to understand its own structure is becoming popular, but understanding the organization of knowledge areas is still limited because some patterns are only discoverable with proper computational treatment of large-scale datasets. In this paper, we introduce a network-based methodology combined with text analytics to construct the taxonomy of science fields. The methodology is illustrated with application to two topics: complex networks (CN) and photonic crystals (PC). We built citation networks using data from the Web of Science and used a community detection algorithm for partitioning to obtain science maps of the fields considered. We also created an importance index for text analytics in order to obtain keywords that define the communities. A dendrogram of the relatedness among the subtopics was also obtained. Among the interesting patterns that emerged from the analysis, we highlight the identification of two well-defined communities in PC area, which is consistent with the known existence of two distinct communities of researchers in the area: telecommunication engineers and physicists. With the methodology, it was also possible to assess the interdisciplinary and time evolution of subtopics defined by the keywords. The automatic tools described here are potentially useful not only to provide an overview of scientific areas but also to assist scientists in performing systematic research on a specific topic. △ Less

Submitted 16 March, 2016; v1 submitted 18 June, 2015; originally announced June 2015.

Journal ref: Journal of Informetrics 10 (2016) pp. 487-502

arXiv:1412.6853 [pdf, other]

Musical elements in the discrete-time representation of sound

Authors: Renato Fabbri, Vilson Vieira da Silva Junior, Antônio Carlos Silvano Pessotti, Débora Cristina Corrêa, Osvaldo N. Oliveira Jr

Abstract: The representation of basic elements of music in terms of discrete audio signals is often used in software for musical creation and design. Nevertheless, there is no unified approach that relates these elements to the discrete samples of digitized sound. In this article, each musical element is related by equations and algorithms to the discrete-time samples of sounds, and each of these relations… ▽ More The representation of basic elements of music in terms of discrete audio signals is often used in software for musical creation and design. Nevertheless, there is no unified approach that relates these elements to the discrete samples of digitized sound. In this article, each musical element is related by equations and algorithms to the discrete-time samples of sounds, and each of these relations are implemented in scripts within a software toolbox, referred to as MASS (Music and Audio in Sample Sequences). The fundamental element, the musical note with duration, volume, pitch and timbre, is related quantitatively to characteristics of the digital signal. Internal variations of a note, such as tremolos, vibratos and spectral fluctuations, are also considered, which enables the synthesis of notes inspired by real instruments and new sonorities. With this representation of notes, resources are provided for the generation of higher scale musical structures, such as rhythmic meter, pitch intervals and cycles. This framework enables precise and trustful scientific experiments, data sonification and is useful for education and art. The efficacy of MASS is confirmed by the synthesis of small musical pieces using basic notes, elaborated notes and notes in music, which reflects the organization of the toolbox and thus of this article. It is possible to synthesize whole albums through collage of the scripts and settings specified by the user. With the open source paradigm, the toolbox can be promptly scrutinized, expanded in co-authorship processes and used with freedom by musicians, engineers and other interested parties. In fact, MASS has already been employed for diverse purposes which include music production, artistic presentations, psychoacoustic experiments and computer language diffusion where the appeal of audiovisual artifacts is exploited for education. △ Less

Submitted 26 October, 2017; v1 submitted 21 December, 2014; originally announced December 2014.

Comments: A software toolbox, a Python Package, musical pieces and further documents are in: https://github.com/ttm/mass

arXiv:1403.4513 [pdf, other]

doi 10.1088/1742-5468/2012/08/P08010

A quantitative approach to evolution of music and philosophy

Authors: Vilson Vieira, Renato Fabbri, Gonzalo Travieso, Osvaldo N. Oliveira Jr., Luciano da Fontoura Costa

Abstract: The development of new statistical and computational methods is increasingly making it possible to bridge the gap between hard sciences and humanities. In this study, we propose an approach based on a quantitative evaluation of attributes of objects in fields of humanities, from which concepts such as dialectics and opposition are formally defined mathematically. As case studies, we analyzed the t… ▽ More The development of new statistical and computational methods is increasingly making it possible to bridge the gap between hard sciences and humanities. In this study, we propose an approach based on a quantitative evaluation of attributes of objects in fields of humanities, from which concepts such as dialectics and opposition are formally defined mathematically. As case studies, we analyzed the temporal evolution of classical music and philosophy by obtaining data for 8 features characterizing the corresponding fields for 7 well-known composers and philosophers, which were treated with multivariate statistics and pattern recognition methods. A bootstrap method was applied to avoid statistical bias caused by the small sample data set, with which hundreds of artificial composers and philosophers were generated, influenced by the 7 names originally chosen. Upon defining indices for opposition, skewness and counter-dialectics, we confirmed the intuitive analysis of historians in that classical music evolved according to a master-apprentice tradition, while in philosophy changes were driven by opposition. Though these case studies were meant only to show the possibility of treating phenomena in humanities quantitatively, including a quantitative measure of concepts such as dialectics and opposition the results are encouraging for further application of the approach presented here to many other areas, since it is entirely generic. △ Less

Submitted 13 November, 2013; originally announced March 2014.

Comments: arXiv admin note: substantial text overlap with arXiv:1109.4653

MSC Class: 62A01

Journal ref: J. Stat. Mech. (2012) P08010

arXiv:1312.4818 [pdf, ps, other]

Ty** linear algebra: A biproduct-oriented approach

Authors: Hugo Daniel Macedo, José N. Oliveira

Abstract: Interested in formalizing the generation of fast running code for linear algebra applications, the authors show how an index-free, calculational approach to matrix algebra can be developed by regarding matrices as morphisms of a category with biproducts. This shifts the traditional view of matrices as indexed structures to a type-level perspective analogous to that of the pointfree algebra of prog… ▽ More Interested in formalizing the generation of fast running code for linear algebra applications, the authors show how an index-free, calculational approach to matrix algebra can be developed by regarding matrices as morphisms of a category with biproducts. This shifts the traditional view of matrices as indexed structures to a type-level perspective analogous to that of the pointfree algebra of programming. The derivation of fusion, cancellation and abide laws from the biproduct equations makes it easy to calculate algorithms implementing matrix multiplication, the central operation of matrix algebra, ranging from its divide-and-conquer version to its vectorization implementation. From errant attempts to learn how particular products and coproducts emerge from biproducts, not only blocked matrix algebra is rediscovered but also a way of extending other operations (e.g. Gaussian elimination) blockwise, in a calculational style, is found. The prospect of building biproduct-based type checkers for computer algebra systems such as MatlabTM is also considered. △ Less

Submitted 17 December, 2013; originally announced December 2013.

Comments: Science of Computer Programming (2013)

arXiv:1311.3687 [pdf, ps, other]

Calculating risk in functional programming

Authors: Daniel Murta, Jose Nuno Oliveira

Abstract: In the trend towards tolerating hardware unreliability, accuracy is exchanged for cost savings. Running on less reliable machines, "functionally correct" code becomes risky and one needs to know how risk propagates so as to mitigate it. Risk estimation, however, seems to live outside the average programmer's technical competence and core practice. In this paper we propose that risk be constructive… ▽ More In the trend towards tolerating hardware unreliability, accuracy is exchanged for cost savings. Running on less reliable machines, "functionally correct" code becomes risky and one needs to know how risk propagates so as to mitigate it. Risk estimation, however, seems to live outside the average programmer's technical competence and core practice. In this paper we propose that risk be constructively handled in functional programming by (a) writing programs which may choose between expected and faulty behaviour, and by (b) reasoning about them in a linear algebra extension to standard, a la Bird-Moor algebra of programming. In particular, the propagation of faults across standard program transformation techniques known as tupling and fusion is calculated, enabling the fault of the whole to be expressed in terms of the faults of its parts. △ Less

Submitted 14 November, 2013; originally announced November 2013.

arXiv:1311.1266 [pdf, other]

doi 10.1007/s11192-014-1381-9

Topological-collaborative approach for disambiguating authors' names in collaborative networks

Authors: Diego R. Amancio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: Concepts and methods of complex networks have been employed to uncover patterns in a myriad of complex systems. Unfortunately, the relevance and significance of these patterns strongly depends on the reliability of the data sets. In the study of collaboration networks, for instance, unavoidable noise pervading author's collaboration datasets arises when authors share the same name. To address this… ▽ More Concepts and methods of complex networks have been employed to uncover patterns in a myriad of complex systems. Unfortunately, the relevance and significance of these patterns strongly depends on the reliability of the data sets. In the study of collaboration networks, for instance, unavoidable noise pervading author's collaboration datasets arises when authors share the same name. To address this problem, we derive a hybrid approach based on authors' collaboration patterns and on topological features of collaborative networks. Our results show that the combination of strategies, in most cases, performs better than the traditional approach which disregards topological features. We also show that the main factor for improving the discriminability of homonymous authors is the average distance between authors. Finally, we show that it is possible to predict the weighting associated to each strategy compounding the hybrid system by examining the discrimination obtained from the traditional analysis of collaboration patterns. Once the methodology devised here is generic, our approach is potentially useful to classify many other networked systems governed by complex interactions. △ Less

Submitted 30 June, 2014; v1 submitted 5 November, 2013; originally announced November 2013.

Comments: To appear in Scientometrics, 2014

Journal ref: Scientometrics 102 (1), 465--485, 2015

arXiv:1310.7769 [pdf, other]

doi 10.1016/j.physa.2017.04.109

Temporal stability in human interaction networks

Authors: Renato Fabbri, Ricardo Fabbri, Deborah C. Antunes, Marilia M. Pisani, Osvaldo N. Oliveira Jr

Abstract: This paper reports on stable (or invariant) properties of human interaction networks, with benchmarks derived from public email lists. Activity, recognized through messages sent, along time and topology were observed in snapshots in a timeline, and at different scales. Our analysis shows that activity is practically the same for all networks across timescales ranging from seconds to months. The pr… ▽ More This paper reports on stable (or invariant) properties of human interaction networks, with benchmarks derived from public email lists. Activity, recognized through messages sent, along time and topology were observed in snapshots in a timeline, and at different scales. Our analysis shows that activity is practically the same for all networks across timescales ranging from seconds to months. The principal components of the participants in the topological metrics space remain practically unchanged as different sets of messages are considered. The activity of participants follows the expected scale-free trace, thus yielding the hub, intermediary and peripheral classes of vertices by comparison against the Erdös-Rényi model. The relative sizes of these three sectors are essentially the same for all email lists and the same along time. Typically, $<15\%$ of the vertices are hubs, 15-45\% are intermediary and $>45\%$ are peripheral vertices. Similar results for the distribution of participants in the three sectors and for the relative importance of the topological metrics were obtained for 12 additional networks from Facebook, Twitter and ParticipaBR. These properties are consistent with the literature and may be general for human interaction networks, which has important implications for establishing a typology of participants based on quantitative criteria. △ Less

Submitted 28 October, 2017; v1 submitted 29 October, 2013; originally announced October 2013.

Comments: See ancillary Supporting Information PDF file for further tables and figures. More information on code and further files can be found at https://github.com/ttm/articleStabilityInteractionNetworks

ACM Class: I.5.3

Journal ref: Physica A: Statistical Mechanics and its Applications, Volume 486, 15 November 2017, Pages 92-105

arXiv:1309.0040 [pdf, ps, other]

doi 10.1103/PhysRevLett.112.148701

Enhanced Flow in Small-World Networks

Authors: Cláudio L. N. Oliveira, Pablo A. Morais, André A. Moreira, José S. Andrade Jr

Abstract: The small-world property is known to have a profound effect on the navigation efficiency of complex networks [J. M. Kleinberg, Nature 406, 845 (2000)]. Accordingly, the proper addition of shortcuts to a regular substrate can lead to the formation of a highly efficient structure for information propagation. Here we show that enhanced flow properties can also be observed in these complex topologies.… ▽ More The small-world property is known to have a profound effect on the navigation efficiency of complex networks [J. M. Kleinberg, Nature 406, 845 (2000)]. Accordingly, the proper addition of shortcuts to a regular substrate can lead to the formation of a highly efficient structure for information propagation. Here we show that enhanced flow properties can also be observed in these complex topologies. Precisely, our model is a network built from an underlying regular lattice over which long-range connections are randomly added according to the probability, $P_{ij}\sim r_{ij}^{-α}$, where $r_{ij}$ is the Manhattan distance between nodes $i$ and $j$, and the exponent $α$ is a controlling parameter. The mean two-point global conductance of the system is computed by considering that each link has a local conductance given by $g_{ij}\propto r_{ij}^{-δ}$, where $δ$ determines the extent of the geographical limitations (costs) on the long-range connections. Our results show that the best flow conditions are obtained for $δ=0$ with $α=0$, while for $δ\gg 1$ the overall conductance always increases with $α$. For $δ\approx 1$, $α=d$ becomes the optimal exponent, where $d$ is the topological dimension of the substrate. Interestingly, this exponent is identical to the one obtained for optimal navigation in small-world networks using decentralized algorithms. △ Less

Submitted 30 August, 2013; originally announced September 2013.

arXiv:1308.6295 [pdf, ps, other]

doi 10.1088/1742-5468/2015/03/P03003

Robustness of community structure to node removal

Authors: Diego R. Amancio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: The identification of modular structures is essential for characterizing real networks formed by a mesoscopic level of organization where clusters contain nodes with a high internal degree of connectivity. Many methods have been developed to unveil community structures, but only a few studies have probed their suitability in incomplete networks. Here we assess the accuracy of community detection t… ▽ More The identification of modular structures is essential for characterizing real networks formed by a mesoscopic level of organization where clusters contain nodes with a high internal degree of connectivity. Many methods have been developed to unveil community structures, but only a few studies have probed their suitability in incomplete networks. Here we assess the accuracy of community detection techniques in incomplete networks generated in sampling processes. We show that the walktrap and fast greedy algorithms are highly accurate for detecting the modular structure of incomplete complex networks even if many of their nodes are removed. Furthermore, we implemented an approach that improved the time performance of the walktrap and fast greedy algorithms, while retaining the accuracy rate in identifying the community membership of nodes. Taken together our results show that this new approach can be applied to speed up virtually any community detection method in dense complex networks, as it is the case of similarity networks. △ Less

Submitted 6 February, 2015; v1 submitted 28 August, 2013; originally announced August 2013.

Journal ref: J. Stat. Mech. (2015) P03003

arXiv:1303.0350 [pdf, ps, other]

doi 10.1016/j.physa.2012.04.011

Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts

Authors: Diego R. Amancio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: There are different ways to define similarity for grou** similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex n… ▽ More There are different ways to define similarity for grou** similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between the various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. △ Less

Submitted 1 March, 2013; originally announced March 2013.

Journal ref: Physica A 391 18 4406-4419, (2012)

arXiv:1303.0347 [pdf, other]

doi 10.1371/journal.pone.0067310

Probing the statistical properties of unknown texts: application to the Voynich Manuscript

Authors: Diego R. Amancio, Eduardo G. Altmann, Diego Rybski, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed investigating the properties of statistical measurements across different languages and texts. In this study we propose a framework that aims at determining if a text is compatible with a natural language and which languages are c… ▽ More While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed investigating the properties of statistical measurements across different languages and texts. In this study we propose a framework that aims at determining if a text is compatible with a natural language and which languages are closest to it, without any knowledge of the meaning of the words. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing text, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for key-words of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications. △ Less

Submitted 1 March, 2013; originally announced March 2013.

Journal ref: PLoS ONE 8(7): e67310 (2013)

arXiv:1302.4504 [pdf, other]

doi 10.1209/0295-5075/99/48002

On the use of topological features and hierarchical characterization for disambiguating names in collaborative networks

Authors: Diego R. Amancio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Abstract: Many features of complex systems can now be unveiled by applying statistical physics methods to treat them as social networks. The power of the analysis may be limited, however, by the presence of ambiguity in names, e.g., caused by homonymy in collaborative networks. In this paper we show that the ability to distinguish between homonymous authors is enhanced when longer-distance connections are c… ▽ More Many features of complex systems can now be unveiled by applying statistical physics methods to treat them as social networks. The power of the analysis may be limited, however, by the presence of ambiguity in names, e.g., caused by homonymy in collaborative networks. In this paper we show that the ability to distinguish between homonymous authors is enhanced when longer-distance connections are considered, rather than looking at only the immediate neighbors of a node in the collaborative network. Optimized results were obtained upon using the 3rd hierarchy in connections. Furthermore, reasonable distinction among authors could also be achieved upon using pattern recognition strategies for the data generated from the topology of the collaborative network. These results were obtained with a network from papers in the arXiv repository, into which homonymy was deliberately introduced to test the methods with a controlled, reliable dataset. In all cases, several methods of supervised and unsupervised machine learning were used, leading to the same overall results. The suitability of using deeper hierarchies and network topology was confirmed with a real database of movie actors, with the additional finding that the distinguishing ability can be further enhanced by combining topology features and long-range connections in the collaborative network. △ Less

Submitted 18 February, 2013; originally announced February 2013.

Journal ref: Europhysics Letters (2012) 99 48002

Showing 1–50 of 59 results for author: Oliveira, N