-
Lessons learned to boost a bioinformatics knowledge base reusability, the Bgee experience
Authors:
Tarcisio Mendes de Farias,
Julien Wollbrett,
Marc Robinson-Rechavi,
Frederic Bastian
Abstract:
Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and with…
▽ More
Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without interoperability the utility lies dormant. Results, in this article, we discuss several approaches to boost interoperability depending on the interoperable parts. The findings are driven by several real-world scenario examples that were mostly implemented by Bgee, a well-established gene expression database. To better justify the findings are transferable, for each Bgee interoperability experience, we also highlight similar implementations by major bioinformatics knowledge bases. Moreover, we discuss ten general main lessons learnt. These lessons can be applied in the context of any bioinformatics knowledge base to foster data reusability. Conclusions, this work provides pragmatic methods and transferable skills to promote reusability of bioinformatics knowledge bases by focusing on interoperability.
△ Less
Submitted 4 July, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Creation and unification of development and life stage ontologies for animals
Authors:
Anne Niknejad,
Christopher J. Mungall,
David Osumi-Sutherland,
Marc Robinson-Rechavi,
Frederic B. Bastian
Abstract:
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper ou…
▽ More
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper our ability to answer comparative biology questions and to interpret functional results. We present here a collection of development and life stage ontologies for 21 animal species, and their merge into a common multi-species ontology. This work has allowed the integration and comparison of transcriptomics data in 52 animal species.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data
Authors:
Ana Claudia Sima,
Tarcisio Mendes de Farias,
Maria Anisimova,
Christophe Dessimoz,
Marc Robinson-Rechavi,
Erich Zbinden,
Kurt Stockinger
Abstract:
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training dat…
▽ More
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.
In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.
△ Less
Submitted 14 June, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
Molecular evolution and gene function
Authors:
Marc Robinson-Rechavi
Abstract:
One of the basic questions of phylogenomics is how gene function evolves, whether among species or inside gene families. In this chapter, we provide a brief overview of the problems associated with defining gene function in a manner which allows comparisons which are both large scale and evolutionarily relevant. The main source of functional data, despite its limitations, is transcriptomics. Funct…
▽ More
One of the basic questions of phylogenomics is how gene function evolves, whether among species or inside gene families. In this chapter, we provide a brief overview of the problems associated with defining gene function in a manner which allows comparisons which are both large scale and evolutionarily relevant. The main source of functional data, despite its limitations, is transcriptomics. Functional data provides information on evolutionary mechanisms primarily by showing which functional classes of genes evolve under stronger or weaker purifying or adaptive selection, and on which classes of mutations (e.g., substitutions or duplications). However, the example of the "ortholog conjecture" shows that we are still not at a point where we can confidently study phylogenomically the evolution of gene function at a precise scale.
△ Less
Submitted 7 October, 2019; v1 submitted 4 October, 2019;
originally announced October 2019.
-
Detecting patterns of species diversification in the presence of both rate shifts and mass extinctions
Authors:
Sacha Laurent,
Marc Robinson-Rechavi,
Nicolas Salamin
Abstract:
Recent methodological advances are enabling better examination of speciation and extinction processes and patterns. A major open question is the origin of large discrepancies in species number between groups of the same age. Existing frameworks to model this diversity either focus on changes between lineages, neglecting global effects such as mass extinctions, or focus on changes over time which w…
▽ More
Recent methodological advances are enabling better examination of speciation and extinction processes and patterns. A major open question is the origin of large discrepancies in species number between groups of the same age. Existing frameworks to model this diversity either focus on changes between lineages, neglecting global effects such as mass extinctions, or focus on changes over time which would affect all lineages. Yet it seems probable that both lineages differences and mass extinctions affect the same groups. Here we used simulations to test the performance of two widely used methods, under complex scenarios. We report good performances, although with a tendency to over-predict events when increasing the complexity of the scenario. Overall, we find that lineage shifts are better detected than mass extinctions. This work has significance for assessing the methods currently used for estimating changes in diversification using phylogenies and develo** new tests.
△ Less
Submitted 31 August, 2015; v1 submitted 22 April, 2014;
originally announced April 2014.
-
Patterns of positive selection in seven ant genomes
Authors:
Julien Roux,
Eyal Privman,
Sebastien Moretti,
Josephine T. Daub,
Marc Robinson-Rechavi,
Laurent Keller
Abstract:
The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant…
▽ More
The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant lineage. We also reanalyzed genome-wide datasets in bees and flies with the same methodology, to check whether positive selection was specific to ants or also present in other insects. Notably, genes implicated in immunity were enriched for positively selected genes in the three lineages, ruling out the hypothesis that the evolution of hygienic behaviors in social insects caused a major relaxation of selective pressure on immune genes. Our scan also indicated that genes implicated in neurogenesis and olfaction started to undergo increased positive selection before the evolution of sociality in Hymenoptera. Finally, the comparison between these three lineages allowed us to pinpoint molecular evolution patterns that were specific to the ant lineage. In particular, there was ant-specific recurrent positive selection on genes with mitochondrial functions, suggesting that mitochondrial activity was improved during the evolution of this lineage. This might have been an important step toward the evolution of extreme lifespan that is a hallmark of ants.
△ Less
Submitted 7 May, 2014; v1 submitted 19 November, 2013;
originally announced November 2013.
-
IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics
Authors:
Marta Rosikiewicz,
Marc Robinson-Rechavi
Abstract:
Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for…
▽ More
Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: ftp://lausanne.isb-sib.ch/pub/databases/Bgee/general/IQRray.R
△ Less
Submitted 8 October, 2013;
originally announced October 2013.
-
The hourglass and the early conservation models - co-existing evolutionary patterns in vertebrate development
Authors:
Barbara Piasecka,
Pawel Lichocki,
Sebastien Moretti,
Sven Bergmann,
Marc Robinson-Rechavi
Abstract:
Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the…
▽ More
Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the hourglass model has been favored. Studies usually report descriptive statistics calculated for all genes over all developmental time points. This introduces dependencies between the sets of compared genes, and may lead to biased results. Here we overcome this problem using an alternative modular analysis. We used the Iterative Signature Algorithm to identify distinct modules of genes co-expressed specifically in consecutive stages of zebrafish development. We then performed a detailed comparison of several gene properties between modules, allowing for a less biased and more powerful analysis. Notably, our analysis corroborated the hourglass pattern only at the regulatory level, with sequences of regulatory regions being most conserved for genes expressed in mid-development, but not at the level of gene sequence, age or expression, in contrast to some previous studies. The early conservation model was supported with gene duplication and birth that were the most rare for genes expressed in early development. Finally, for all gene properties we observed the least conservation for genes expressed in late development or adult, consistent with both models. Overall, with the modular approach, we showed that different levels of molecular evolution follow different patterns of developmental constraints. Thus both models are valid, but with respect to different genomic features.
△ Less
Submitted 13 March, 2013; v1 submitted 24 October, 2012;
originally announced October 2012.
-
gcodeml: A Grid-enabled Tool for Detecting Positive Selection in Biological Evolution
Authors:
Sébastien Moretti,
Riccardo Murri,
Sergio Maffioletti,
Arnold Kuzniar,
Briséïs Castella,
Nicolas Salamin,
Marc Robinson-Rechavi,
Heinz Stockinger
Abstract:
One of the important questions in biological evolution is to know if certain changes along protein coding genes have contributed to the adaptation of species. This problem is known to be biologically complex and computationally very expensive. It, therefore, requires efficient Grid or cluster solutions to overcome the computational challenge. We have developed a Grid-enabled tool (gcodeml) that re…
▽ More
One of the important questions in biological evolution is to know if certain changes along protein coding genes have contributed to the adaptation of species. This problem is known to be biologically complex and computationally very expensive. It, therefore, requires efficient Grid or cluster solutions to overcome the computational challenge. We have developed a Grid-enabled tool (gcodeml) that relies on the PAML (codeml) package to help analyse large phylogenetic datasets on both Grids and computational clusters. Although we report on results for gcodeml, our approach is applicable and customisable to related problems in biology or other scientific domains.
△ Less
Submitted 14 March, 2012;
originally announced March 2012.
-
Developmental constraints on vertebrate genome evolution
Authors:
J. Roux,
M. Robinson-Rechavi
Abstract:
Constraints in embryonic development are thought to bias the direction of evolution by making some changes less likely, and others more likely, depending on their consequences on ontogeny. Here, we characterize the constraints acting on genome evolution in vertebrates. We used gene expression data from two vertebrates: zebrafish, using a microarray experiment spanning 14 stages of development, and…
▽ More
Constraints in embryonic development are thought to bias the direction of evolution by making some changes less likely, and others more likely, depending on their consequences on ontogeny. Here, we characterize the constraints acting on genome evolution in vertebrates. We used gene expression data from two vertebrates: zebrafish, using a microarray experiment spanning 14 stages of development, and mouse, using EST counts for 26 stages of development. We show that, in both species, genes expressed early in development (1) have a more dramatic effect of knock-out or mutation and (2) are more likely to revert to single copy after whole genome duplication, relative to genes expressed late. This supports high constraints on early stages of vertebrate development, making them less open to innovations (gene gain or gene loss). Results are robust to different sources of data-gene expression from microarrays, ESTs, or in situ hybridizations; and mutants from directed KO, transgenic insertions, point mutations, or morpholinos. We determine the pattern of these constraints, which differs from the model used to describe vertebrate morphological conservation ("hourglass" model). While morphological constraints reach a maximum at mid-development (the "phylotypic" stage), genomic constraints appear to decrease in a monotonous manner over developmental time.
△ Less
Submitted 6 March, 2012;
originally announced March 2012.
-
Rapid divergence of the ecdysone receptor in Diptera and Lepidoptera suggests coevolution between ECR and USP-RXR
Authors:
François Bonneton,
Dominique Zelus,
Thomas Iwema,
Marc Robinson-Rechavi,
Vincent Laudet
Abstract:
Ecdysteroid hormones are major regulators in reproduction and development of insects, including larval molts and metamorphosis. The functional ecdysone receptor is a heterodimer of ECR (NR1H1) and USP-RXR (NR2B4), which is the orthologue of vertebrate retinoid X receptors (RXR alpha, beta, gamma). Both proteins belong to the superfamily of nuclear hormone receptors, ligand-dependent transcriptio…
▽ More
Ecdysteroid hormones are major regulators in reproduction and development of insects, including larval molts and metamorphosis. The functional ecdysone receptor is a heterodimer of ECR (NR1H1) and USP-RXR (NR2B4), which is the orthologue of vertebrate retinoid X receptors (RXR alpha, beta, gamma). Both proteins belong to the superfamily of nuclear hormone receptors, ligand-dependent transcription factors that share two conserved domains: the DNA-binding domain (DBD) and the ligand-binding domain (LBD). In order to gain further insight into the evolution of metamorphosis and gene regulation by ecdysone in arthropods, we performed a phylogenetic analysis of both partners of the heterodimer ECR/USP-RXR. Overall, 38 USP-RXR and 19 ECR protein sequences, from 33 species, have been used for this analysis. Interestingly, sequence alignments and structural comparisons reveal high divergence rates, for both ECR and USP-RXR, specifically among Diptera and Lepidoptera. The most impressive differences affect the ligand-binding domain of USP-RXR. In addition, ECR sequences show variability in other domains, namely the DNA-binding and the carboxy-terminal F domains. Our data provide the first evidence that ECR and USP-RXR may have coevolved during holometabolous insect diversification, leading to a functional divergence of the ecdysone receptor. These results have general implications on fundamental aspects of insect development, evolution of nuclear receptors, and the design of specific insecticides.
△ Less
Submitted 7 June, 2008;
originally announced June 2008.