-
Learning EFSM Models with Registers in Guards
Authors:
Germán Vega,
Roland Groz,
Catherine Oriat,
Michael Foster,
Neil Walkinshaw,
Adenilso Simão
Abstract:
This paper presents an active inference method for Extended Finite State Machines, where inputs and outputs are parametrized, and transitions can be conditioned by guards involving input parameters and internal variables called registers. The method applies to (software) systems that cannot be reset, so it learns an EFSM model of the system on a single trace.
This paper presents an active inference method for Extended Finite State Machines, where inputs and outputs are parametrized, and transitions can be conditioned by guards involving input parameters and internal variables called registers. The method applies to (software) systems that cannot be reset, so it learns an EFSM model of the system on a single trace.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
SPARQL Generation: an analysis on fine-tuning OpenLLaMA for Question Answering over a Life Science Knowledge Graph
Authors:
Julio C. Rangel,
Tarcisio Mendes de Farias,
Ana Claudia Sima,
Norio Kobayashi
Abstract:
The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the ca…
▽ More
The recent success of Large Language Models (LLM) in a wide range of Natural Language Processing applications opens the path towards novel Question Answering Systems over Knowledge Graphs leveraging LLMs. However, one of the main obstacles preventing their implementation is the scarcity of training data for the task of translating questions into corresponding SPARQL queries, particularly in the case of domain-specific KGs. To overcome this challenge, in this study, we evaluate several strategies for fine-tuning the OpenLlama LLM for question answering over life science knowledge graphs. In particular, we propose an end-to-end data augmentation approach for extending a set of existing queries over a given knowledge graph towards a larger dataset of semantically enriched question-to-SPARQL query pairs, enabling fine-tuning even for datasets where these pairs are scarce. In this context, we also investigate the role of semantic "clues" in the queries, such as meaningful variable names and inline comments. Finally, we evaluate our approach over the real-world Bgee gene expression knowledge graph and we show that semantic clues can improve model performance by up to 33% compared to a baseline with random variable names and no comments included.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Human Protein Protein Interaction Networks: A Topological Comparison Review
Authors:
Rodrigo Henrique Ramos,
Cynthia de Oliveira Lage Ferreira,
Adenilso Simao
Abstract:
Protein-Protein Interaction Networks aim to model the interactome, providing a powerful tool for understanding the complex relationships governing cellular processes. These networks have numerous applications, including functional enrichment, discovering cancer driver genes, identifying drug targets, and more. Various databases make protein-protein networks available for many species, including Ho…
▽ More
Protein-Protein Interaction Networks aim to model the interactome, providing a powerful tool for understanding the complex relationships governing cellular processes. These networks have numerous applications, including functional enrichment, discovering cancer driver genes, identifying drug targets, and more. Various databases make protein-protein networks available for many species, including Homo sapiens. This work topologically compares four Homo sapiens networks using a coarse-to-fine approach, comparing global characteristics, sub-network topology, specific nodes centrality, and interaction significance. Results show that the four human protein networks share many common protein-encoding genes and some global measures, but significantly differ in the interactions and neighbourhood. Small sub-networks from cancer pathways performed better than the whole networks, indicating an improved topological consistency in functional pathways. The centrality analysis shows that the same genes play different roles in different networks. We discuss how studies and analyses that rely on protein-protein networks for humans should consider their similarities and distinctions.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
On the Potential of Artificial Intelligence Chatbots for Data Exploration of Federated Bioinformatics Knowledge Graphs
Authors:
Ana-Claudia Sima,
Tarcisio Mendes de Farias
Abstract:
In this paper, we present work in progress on the role of artificial intelligence (AI) chatbots, such as ChatGPT, in facilitating data access to federated knowledge graphs. In particular, we provide examples from the field of bioinformatics, to illustrate the potential use of Conversational AI to describe datasets, as well as generate and explain (federated) queries across datasets for the benefit…
▽ More
In this paper, we present work in progress on the role of artificial intelligence (AI) chatbots, such as ChatGPT, in facilitating data access to federated knowledge graphs. In particular, we provide examples from the field of bioinformatics, to illustrate the potential use of Conversational AI to describe datasets, as well as generate and explain (federated) queries across datasets for the benefit of domain experts.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Online Recognition of Incomplete Gesture Data to Interface Collaborative Robots
Authors:
M. A. Simão,
O. Gibaru,
P. Neto
Abstract:
Online recognition of gestures is critical for intuitive human-robot interaction (HRI) and further push collaborative robotics into the market, making robots accessible to more people. The problem is that it is difficult to achieve accurate gesture recognition in real unstructured environments, often using distorted and incomplete multisensory data. This paper introduces an HRI framework to classi…
▽ More
Online recognition of gestures is critical for intuitive human-robot interaction (HRI) and further push collaborative robotics into the market, making robots accessible to more people. The problem is that it is difficult to achieve accurate gesture recognition in real unstructured environments, often using distorted and incomplete multisensory data. This paper introduces an HRI framework to classify large vocabularies of interwoven static gestures (SGs) and dynamic gestures (DGs) captured with wearable sensors. DG features are obtained by applying data dimensionality reduction to raw data from sensors (resampling with cubic interpolation and principal component analysis). Experimental tests were conducted using the UC2017 hand gesture dataset with samples from eight different subjects. The classification models show an accuracy of 95.6% for a library of 24 SGs with a random forest and 99.3% for 10 DGs using artificial neural networks. These results compare equally or favorably with different commonly used classifiers. Long short-term memory deep networks achieved similar performance in online frame-by-frame classification using raw incomplete data, performing better in terms of accuracy than static models with specially crafted features, but worse in training and inference time. The recognized gestures are used to teleoperate a robot in a collaborative process that consists in preparing a breakfast meal.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
Federating and querying heterogeneous and distributed Web APIs and triple stores
Authors:
Tarcisio Mendes de Farias,
Christophe Dessimoz,
Aaron Ayllon Benitez,
Chen Yang,
Jiao Long,
Ana-Claudia Sima
Abstract:
Today's international corporations such as BASF, a leading company in the crop protection industry, produce and consume more and more data that are often fragmented and accessible through Web APIs. In addition, part of the proprietary and public data of BASF's interest are stored in triple stores and accessible with the SPARQL query language. Homogenizing the data access modes and the underlying s…
▽ More
Today's international corporations such as BASF, a leading company in the crop protection industry, produce and consume more and more data that are often fragmented and accessible through Web APIs. In addition, part of the proprietary and public data of BASF's interest are stored in triple stores and accessible with the SPARQL query language. Homogenizing the data access modes and the underlying semantics of the data without modifying or replicating the original data sources become important requirements to achieve data integration and interoperability. In this work, we propose a federated data integration architecture within an industrial setup, that relies on an ontology-based data access method. Our performance evaluation in terms of query response time showed that most queries can be answered in under 1 second.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes
Authors:
Mosè Manni,
Matthew R Berkeley,
Mathieu Seppey,
Felipe A Simao,
Evgeny M Zdobnov
Abstract:
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansio…
▽ More
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate dataset for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly-introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data
Authors:
Ana Claudia Sima,
Tarcisio Mendes de Farias,
Maria Anisimova,
Christophe Dessimoz,
Marc Robinson-Rechavi,
Erich Zbinden,
Kurt Stockinger
Abstract:
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training dat…
▽ More
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.
In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]
Authors:
Sihem Amer-Yahia,
Georgia Koutrika,
Frederic Bastian,
Theofilos Belmpas,
Martin Braschler,
Ursin Brunner,
Diego Calvanese,
Maximilian Fabricius,
Orest Gkini,
Catherine Kosten,
Davide Lanti,
Antonis Litke,
Hendrik Lücke-Tieke,
Francesco Alessandro Massucci,
Tarcisio Mendes de Farias,
Alessandro Mosca,
Francesco Multari,
Nikolaos Papadakis,
Dimitris Papadopoulos,
Yogendra Patil,
Aurélien Personnaz,
Guillem Rull,
Ana Sima,
Ellery Smith,
Dimitrios Skoutas
, et al. (3 additional authors not shown)
Abstract:
A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data expl…
▽ More
A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data exploration system -- that leverages, on the one hand, Machine Learning and, on the other hand, semantics for the purpose of Data Management (DM). Our vision is to develop a classic unified, comprehensive platform that provides extensive access to open datasets, and we demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics. INODE offers sustainable services in (a) data modeling and linking, (b) integrated query processing using natural language, (c) guidance, and (d) data exploration through visualization, thus facilitating the user in discovering new insights. We demonstrate that our system is uniquely accessible to a wide range of users from larger scientific communities to the public. Finally, we briefly illustrate how this work paves the way for new research opportunities in DM.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
A Systematic Study of Cross-Project Defect Prediction With Meta-Learning
Authors:
Faimison Porto,
Leandro Minku,
Emilia Mendes,
Adenilso Simao
Abstract:
The prediction of defects in a target project based on data from external projects is called Cross-Project Defect Prediction (CPDP). Several methods have been proposed to improve the predictive performance of CPDP models. However, there is a lack of comparison among state-of-the-art methods. Moreover, previous work has shown that the most suitable method for a project can vary according to the pro…
▽ More
The prediction of defects in a target project based on data from external projects is called Cross-Project Defect Prediction (CPDP). Several methods have been proposed to improve the predictive performance of CPDP models. However, there is a lack of comparison among state-of-the-art methods. Moreover, previous work has shown that the most suitable method for a project can vary according to the project being predicted. This makes the choice of which method to use difficult. We provide an extensive experimental comparison of 31 CPDP methods derived from state-of-the-art approaches, applied to 47 versions of 15 open source software projects. Four methods stood out as presenting the best performances across datasets. However, the most suitable among these methods still varies according to the project being predicted. Therefore, we propose and evaluate a meta-learning solution designed to automatically select and recommend the most suitable CPDP method for a project. Our results show that the meta-learning solution is able to learn from previous experiences and recommend suitable methods dynamically. When compared to the base methods, however, the proposed solution presented minor difference of performance. These results provide valuable knowledge about the possibilities and limitations of a meta-learning solution applied for CPDP.
△ Less
Submitted 31 May, 2019; v1 submitted 16 February, 2018;
originally announced February 2018.
-
Generating Complete and Finite Test Suite for ioco: Is It Possible?
Authors:
Adenilso Simao,
Alexandre Petrenko
Abstract:
Testing from Input/Output Transition Systems has been intensely investigated. The conformance between the implementation and the specification is often determined by the so-called ioco-relation. However, generating tests for ioco is usually hindered by the problem of conflicts between inputs and outputs. Moreover, the generation is mainly based on nondeterministic methods, which may deliver comple…
▽ More
Testing from Input/Output Transition Systems has been intensely investigated. The conformance between the implementation and the specification is often determined by the so-called ioco-relation. However, generating tests for ioco is usually hindered by the problem of conflicts between inputs and outputs. Moreover, the generation is mainly based on nondeterministic methods, which may deliver complete test suites but require an unbounded number of executions. In this paper, we investigate whether it is possible to construct a finite test suite which is complete in a predefined fault domain for the classical ioco relation even in the presence of input/output conflicts. We demonstrate that it is possible under certain assumptions about the specification and implementation, by proposing a method for complete test generation, based on a traditional method developed for FSM.
△ Less
Submitted 27 March, 2014;
originally announced March 2014.
-
Polarization properties of diffractively produced Λ_c^+
Authors:
Yu. Arestov,
F. R. A. Simao
Abstract:
The Pomeron-gluon-gluon interaction is considered in the QCD-based model for the charmed baryon production in the process Pomeron + p --> Λ_c^+ + X.
The polarization of the produced heavy quark is induced effectively through the non-perturbative long-range interaction with the gluon field of the type [sigma*rotA]. The x_F-dependence of Λ_c^+ polarization, P_(x_F,p_T), has been studied. Its abs…
▽ More
The Pomeron-gluon-gluon interaction is considered in the QCD-based model for the charmed baryon production in the process Pomeron + p --> Λ_c^+ + X.
The polarization of the produced heavy quark is induced effectively through the non-perturbative long-range interaction with the gluon field of the type [sigma*rotA]. The x_F-dependence of Λ_c^+ polarization, P_(x_F,p_T), has been studied. Its absolute value depends on the model parameter a and it appears to be sizeable in the wide range of a values: when a ranges from 0.1 to 1.0, the polarization P(x_F,p_T) varies from --0.2 to --0.5 at x_F ~ 0.5 and p_T in the interval 1 - 2 GeV/c.
△ Less
Submitted 20 November, 1998;
originally announced November 1998.
-
Asymmetry studies in Lambda 0/Lambda 0-bar, Xi-/Xi+ and Omega-/Omega+ production
Authors:
J. C. Anjos,
J. Magnin,
F. R. A. Simao,
J. Solano
Abstract:
We present a study on hyperon/anti-hyperon production asymmetries in the framework of the recombination model. The production asymmetries for Lambda 0/Lambda 0-bar, Xi-/Xi+ and Omega-/Omega+ are studied as a function of x_F. Predictions of the model are compared to preliminary data on hyperon/anti-hyperon production asymmetries in 500 GeV/c pi- p interactions from the Fermilab E791 experiment. T…
▽ More
We present a study on hyperon/anti-hyperon production asymmetries in the framework of the recombination model. The production asymmetries for Lambda 0/Lambda 0-bar, Xi-/Xi+ and Omega-/Omega+ are studied as a function of x_F. Predictions of the model are compared to preliminary data on hyperon/anti-hyperon production asymmetries in 500 GeV/c pi- p interactions from the Fermilab E791 experiment. The model predicts a growing asymmetry with the number of valence quarks shared by the target and the produced hyperons in the x_F < 0 region. In the positive x_F region, the model predicts constant asymmetries for Lambda 0/Lambda 0-bar and Omega-/Omega+ production and a growing asymmetry with x_F for Xi-/Xi+. We found a qualitatively good agreement between the model predictions and data, showing that recombination is a competitive mechanism in the hadronization process.
△ Less
Submitted 24 June, 1998; v1 submitted 17 June, 1998;
originally announced June 1998.
-
Hyperon production asymmetries in 500 GeV/c pion nucleus interactions
Authors:
J. Solano,
J. Magnin,
F. R. A. Simao,
E791 collaboration
Abstract:
We present a preliminary study from Fermilab experiment E791 of Lambda^0 / Lambda^0 bar, Xi^- / Xi^+ and Omega^- /Omega^+ production asymmetries from pi^- nucleus interactions at 500 Gev/c. The production asymmetries for these particles are studied as a function of x_F and pt^2. We observed an asymmetry in the target fragmentation region for Lambda^0's larger than that for Xi's, suggesting diqua…
▽ More
We present a preliminary study from Fermilab experiment E791 of Lambda^0 / Lambda^0 bar, Xi^- / Xi^+ and Omega^- /Omega^+ production asymmetries from pi^- nucleus interactions at 500 Gev/c. The production asymmetries for these particles are studied as a function of x_F and pt^2. We observed an asymmetry in the target fragmentation region for Lambda^0's larger than that for Xi's, suggesting diquark effects. The asymmetry for Omega's is significatively smaller than for the other two hyperons consistent with the fact that Omega's do not share valence quarks with either the pion or the target particle. In the beam fragmentation region, the asymmetry tends to 0.1 for both Lambda^0's and Xi's. The asymmetries vs pt^2 are approximately constant for the three strange baryons under study.
△ Less
Submitted 20 November, 1997; v1 submitted 31 October, 1997;
originally announced October 1997.
-
The $Λ_0$ Polarization and the Recombination Mechanism
Authors:
G. Herrera,
J. Magnin,
Luis M. Montaño,
F. R. A. Simão
Abstract:
We use the recombination and the Thomas Precession Model to obtain a prediction for the $Λ_0$ polarization in the $p+p \to Λ_0+X$ reaction. We study the effect of the recombination function on the $Λ_0$ polarization.
We use the recombination and the Thomas Precession Model to obtain a prediction for the $Λ_0$ polarization in the $p+p \to Λ_0+X$ reaction. We study the effect of the recombination function on the $Λ_0$ polarization.
△ Less
Submitted 5 February, 1997;
originally announced February 1997.
-
The Charm of the Proton and the $Λ_c^{+}$ Production
Authors:
J. dos Anjos,
G. Herrera,
J. Magnin,
F. R. A. Simão
Abstract:
We propose a two component model for charmed baryon production in $pp$ collisions consisting of the conventional parton fusion mechanism and fragmentation plus quarks recombination in which a $ud$ valence diquark from the proton recombines with a $c$-sea quark to produce a $Λ_c^+$. Our two-component model is compared with the intrinsic charm two-component model and experimental data.
We propose a two component model for charmed baryon production in $pp$ collisions consisting of the conventional parton fusion mechanism and fragmentation plus quarks recombination in which a $ud$ valence diquark from the proton recombines with a $c$-sea quark to produce a $Λ_c^+$. Our two-component model is compared with the intrinsic charm two-component model and experimental data.
△ Less
Submitted 5 February, 1997;
originally announced February 1997.
-
Production and polarization of $Λ_c^+$ and the charm of the proton
Authors:
J. C. Anjos,
G. Herrera,
J. Magnin,
F. R. A. Simao
Abstract:
We propose a two-component model involving the parton fusion mechanism and recombination of a $ud$ valence diquark with a sea $c$-quark of the incident proton to describe $Λ_c^+$ inclusive production in $pp$ collisions. We also study the polarization of the produced $Λ_c^+$ in the framework of the Thomas Precession Model for polarization. We show that a measurement of the $Λ_c$ polarization is a…
▽ More
We propose a two-component model involving the parton fusion mechanism and recombination of a $ud$ valence diquark with a sea $c$-quark of the incident proton to describe $Λ_c^+$ inclusive production in $pp$ collisions. We also study the polarization of the produced $Λ_c^+$ in the framework of the Thomas Precession Model for polarization. We show that a measurement of the $Λ_c$ polarization is a sensitive test of its production mechanism. In particular the intrinsic charm model predicts a positive polarization for the $Λ_c$ within the framework of the Thomas Precession Model, while according to the model presented here the $Λ_c$ polarization should be negative. The measurement of the $Λ_c$ polarization provides a close examination of intrinsic charm Fock states in the proton and give interesting information about the hadroproduction of charm.
△ Less
Submitted 24 April, 1997; v1 submitted 5 February, 1997;
originally announced February 1997.