Search | arXiv e-print repository

Lessons learned to boost a bioinformatics knowledge base reusability, the Bgee experience

Authors: Tarcisio Mendes de Farias, Julien Wollbrett, Marc Robinson-Rechavi, Frederic Bastian

Abstract: Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and with… ▽ More Background, enhancing interoperability of bioinformatics knowledge bases is a high priority requirement to maximize data reusability, and thus increase their utility such as the return on investment for biomedical research. A knowledge base may provide useful information for life scientists and other knowledge bases, but it only acquires exchange value once the knowledge base is (re)used, and without interoperability the utility lies dormant. Results, in this article, we discuss several approaches to boost interoperability depending on the interoperable parts. The findings are driven by several real-world scenario examples that were mostly implemented by Bgee, a well-established gene expression database. To better justify the findings are transferable, for each Bgee interoperability experience, we also highlight similar implementations by major bioinformatics knowledge bases. Moreover, we discuss ten general main lessons learnt. These lessons can be applied in the context of any bioinformatics knowledge base to foster data reusability. Conclusions, this work provides pragmatic methods and transferable skills to promote reusability of bioinformatics knowledge bases by focusing on interoperability. △ Less

Submitted 4 July, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2206.12231 [pdf]

Creation and unification of development and life stage ontologies for animals

Authors: Anne Niknejad, Christopher J. Mungall, David Osumi-Sutherland, Marc Robinson-Rechavi, Frederic B. Bastian

Abstract: With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper ou… ▽ More With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper our ability to answer comparative biology questions and to interpret functional results. We present here a collection of development and life stage ontologies for 21 animal species, and their merge into a common multi-species ontology. This work has allowed the integration and comparison of transcriptomics data in 52 animal species. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: 2 pages, 1 table, accepted at Bio-Ontologies COSI ISMB 2022 conference. AN developed species-specific ontologies, links to Uberon. CJM and DOS developed Uberon life-stage ontology, ontology design principles. DOS developed fly ontology. MRR contributed work supervision. FBB supervised the work, built the integration to Uberon, and wrote the paper. CJM, DOS and FBB maintain the repository

ACM Class: J.3; I.2.4

arXiv:2104.13744 [pdf, other]

Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data

Authors: Ana Claudia Sima, Tarcisio Mendes de Farias, Maria Anisimova, Christophe Dessimoz, Marc Robinson-Rechavi, Erich Zbinden, Kurt Stockinger

Abstract: The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training dat… ▽ More The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. △ Less

Submitted 14 June, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Journal ref: 33rd International Conference on Scientific and Statistical Database Management (SSDBM 2021)

arXiv:1910.01940 [pdf]

Molecular evolution and gene function

Authors: Marc Robinson-Rechavi

Abstract: One of the basic questions of phylogenomics is how gene function evolves, whether among species or inside gene families. In this chapter, we provide a brief overview of the problems associated with defining gene function in a manner which allows comparisons which are both large scale and evolutionarily relevant. The main source of functional data, despite its limitations, is transcriptomics. Funct… ▽ More One of the basic questions of phylogenomics is how gene function evolves, whether among species or inside gene families. In this chapter, we provide a brief overview of the problems associated with defining gene function in a manner which allows comparisons which are both large scale and evolutionarily relevant. The main source of functional data, despite its limitations, is transcriptomics. Functional data provides information on evolutionary mechanisms primarily by showing which functional classes of genes evolve under stronger or weaker purifying or adaptive selection, and on which classes of mutations (e.g., substitutions or duplications). However, the example of the "ortholog conjecture" shows that we are still not at a point where we can confidently study phylogenomically the evolution of gene function at a precise scale. △ Less

Submitted 7 October, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

Comments: To be published in book "Phylogenomics" (Nicolas Galtier, Céline Scornavacca, Frédéric Delsuc, Eds.)

arXiv:1404.5441 [pdf, other]

doi 10.1186/s12862-015-0432-z

Detecting patterns of species diversification in the presence of both rate shifts and mass extinctions

Authors: Sacha Laurent, Marc Robinson-Rechavi, Nicolas Salamin

Abstract: Recent methodological advances are enabling better examination of speciation and extinction processes and patterns. A major open question is the origin of large discrepancies in species number between groups of the same age. Existing frameworks to model this diversity either focus on changes between lineages, neglecting global effects such as mass extinctions, or focus on changes over time which w… ▽ More Recent methodological advances are enabling better examination of speciation and extinction processes and patterns. A major open question is the origin of large discrepancies in species number between groups of the same age. Existing frameworks to model this diversity either focus on changes between lineages, neglecting global effects such as mass extinctions, or focus on changes over time which would affect all lineages. Yet it seems probable that both lineages differences and mass extinctions affect the same groups. Here we used simulations to test the performance of two widely used methods, under complex scenarios. We report good performances, although with a tendency to over-predict events when increasing the complexity of the scenario. Overall, we find that lineage shifts are better detected than mass extinctions. This work has significance for assessing the methods currently used for estimating changes in diversification using phylogenies and develo** new tests. △ Less

Submitted 31 August, 2015; v1 submitted 22 April, 2014; originally announced April 2014.

Comments: 34 pages, 11 figures

Journal ref: BMC Evolutionary Biology 2015 15:157

arXiv:1311.4706 [pdf]

Patterns of positive selection in seven ant genomes

Authors: Julien Roux, Eyal Privman, Sebastien Moretti, Josephine T. Daub, Marc Robinson-Rechavi, Laurent Keller

Abstract: The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant… ▽ More The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant lineage. We also reanalyzed genome-wide datasets in bees and flies with the same methodology, to check whether positive selection was specific to ants or also present in other insects. Notably, genes implicated in immunity were enriched for positively selected genes in the three lineages, ruling out the hypothesis that the evolution of hygienic behaviors in social insects caused a major relaxation of selective pressure on immune genes. Our scan also indicated that genes implicated in neurogenesis and olfaction started to undergo increased positive selection before the evolution of sociality in Hymenoptera. Finally, the comparison between these three lineages allowed us to pinpoint molecular evolution patterns that were specific to the ant lineage. In particular, there was ant-specific recurrent positive selection on genes with mitochondrial functions, suggesting that mitochondrial activity was improved during the evolution of this lineage. This might have been an important step toward the evolution of extreme lifespan that is a hallmark of ants. △ Less

Submitted 7 May, 2014; v1 submitted 19 November, 2013; originally announced November 2013.

arXiv:1310.2129 [pdf]

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics

Authors: Marta Rosikiewicz, Marc Robinson-Rechavi

Abstract: Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for… ▽ More Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: ftp://lausanne.isb-sib.ch/pub/databases/Bgee/general/IQRray.R △ Less

Submitted 8 October, 2013; originally announced October 2013.

arXiv:1210.6444 [pdf, ps, other]

The hourglass and the early conservation models - co-existing evolutionary patterns in vertebrate development

Authors: Barbara Piasecka, Pawel Lichocki, Sebastien Moretti, Sven Bergmann, Marc Robinson-Rechavi

Abstract: Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the… ▽ More Developmental constraints have been postulated to limit the space of feasible phenotypes and thus shape animal evolution. These constraints have been suggested to be the strongest during either early or mid-embryogenesis, which corresponds to the early conservation model or the hourglass model, respectively. Conflicting results have been reported, but in recent studies of animal transcriptomes the hourglass model has been favored. Studies usually report descriptive statistics calculated for all genes over all developmental time points. This introduces dependencies between the sets of compared genes, and may lead to biased results. Here we overcome this problem using an alternative modular analysis. We used the Iterative Signature Algorithm to identify distinct modules of genes co-expressed specifically in consecutive stages of zebrafish development. We then performed a detailed comparison of several gene properties between modules, allowing for a less biased and more powerful analysis. Notably, our analysis corroborated the hourglass pattern only at the regulatory level, with sequences of regulatory regions being most conserved for genes expressed in mid-development, but not at the level of gene sequence, age or expression, in contrast to some previous studies. The early conservation model was supported with gene duplication and birth that were the most rare for genes expressed in early development. Finally, for all gene properties we observed the least conservation for genes expressed in late development or adult, consistent with both models. Overall, with the modular approach, we showed that different levels of molecular evolution follow different patterns of developmental constraints. Thus both models are valid, but with respect to different genomic features. △ Less

Submitted 13 March, 2013; v1 submitted 24 October, 2012; originally announced October 2012.

arXiv:1203.3092 [pdf]

gcodeml: A Grid-enabled Tool for Detecting Positive Selection in Biological Evolution

Authors: Sébastien Moretti, Riccardo Murri, Sergio Maffioletti, Arnold Kuzniar, Briséïs Castella, Nicolas Salamin, Marc Robinson-Rechavi, Heinz Stockinger

Abstract: One of the important questions in biological evolution is to know if certain changes along protein coding genes have contributed to the adaptation of species. This problem is known to be biologically complex and computationally very expensive. It, therefore, requires efficient Grid or cluster solutions to overcome the computational challenge. We have developed a Grid-enabled tool (gcodeml) that re… ▽ More One of the important questions in biological evolution is to know if certain changes along protein coding genes have contributed to the adaptation of species. This problem is known to be biologically complex and computationally very expensive. It, therefore, requires efficient Grid or cluster solutions to overcome the computational challenge. We have developed a Grid-enabled tool (gcodeml) that relies on the PAML (codeml) package to help analyse large phylogenetic datasets on both Grids and computational clusters. Although we report on results for gcodeml, our approach is applicable and customisable to related problems in biology or other scientific domains. △ Less

Submitted 14 March, 2012; originally announced March 2012.

Comments: 10 pages, 4 figures. To appear in the HealthGrid 2012 conf

arXiv:1203.1471 [pdf]

doi 10.1371/journal.pgen.1000311

Developmental constraints on vertebrate genome evolution

Authors: J. Roux, M. Robinson-Rechavi

Abstract: Constraints in embryonic development are thought to bias the direction of evolution by making some changes less likely, and others more likely, depending on their consequences on ontogeny. Here, we characterize the constraints acting on genome evolution in vertebrates. We used gene expression data from two vertebrates: zebrafish, using a microarray experiment spanning 14 stages of development, and… ▽ More Constraints in embryonic development are thought to bias the direction of evolution by making some changes less likely, and others more likely, depending on their consequences on ontogeny. Here, we characterize the constraints acting on genome evolution in vertebrates. We used gene expression data from two vertebrates: zebrafish, using a microarray experiment spanning 14 stages of development, and mouse, using EST counts for 26 stages of development. We show that, in both species, genes expressed early in development (1) have a more dramatic effect of knock-out or mutation and (2) are more likely to revert to single copy after whole genome duplication, relative to genes expressed late. This supports high constraints on early stages of vertebrate development, making them less open to innovations (gene gain or gene loss). Results are robust to different sources of data-gene expression from microarrays, ESTs, or in situ hybridizations; and mutants from directed KO, transgenic insertions, point mutations, or morpholinos. We determine the pattern of these constraints, which differs from the model used to describe vertebrate morphological conservation ("hourglass" model). While morphological constraints reach a maximum at mid-development (the "phylotypic" stage), genomic constraints appear to decrease in a monotonous manner over developmental time. △ Less

Submitted 6 March, 2012; originally announced March 2012.

Journal ref: PLoS Genetics 4 (2008) e1000311

arXiv:0806.1267 [pdf]

Rapid divergence of the ecdysone receptor in Diptera and Lepidoptera suggests coevolution between ECR and USP-RXR

Authors: François Bonneton, Dominique Zelus, Thomas Iwema, Marc Robinson-Rechavi, Vincent Laudet

Abstract: Ecdysteroid hormones are major regulators in reproduction and development of insects, including larval molts and metamorphosis. The functional ecdysone receptor is a heterodimer of ECR (NR1H1) and USP-RXR (NR2B4), which is the orthologue of vertebrate retinoid X receptors (RXR alpha, beta, gamma). Both proteins belong to the superfamily of nuclear hormone receptors, ligand-dependent transcriptio… ▽ More Ecdysteroid hormones are major regulators in reproduction and development of insects, including larval molts and metamorphosis. The functional ecdysone receptor is a heterodimer of ECR (NR1H1) and USP-RXR (NR2B4), which is the orthologue of vertebrate retinoid X receptors (RXR alpha, beta, gamma). Both proteins belong to the superfamily of nuclear hormone receptors, ligand-dependent transcription factors that share two conserved domains: the DNA-binding domain (DBD) and the ligand-binding domain (LBD). In order to gain further insight into the evolution of metamorphosis and gene regulation by ecdysone in arthropods, we performed a phylogenetic analysis of both partners of the heterodimer ECR/USP-RXR. Overall, 38 USP-RXR and 19 ECR protein sequences, from 33 species, have been used for this analysis. Interestingly, sequence alignments and structural comparisons reveal high divergence rates, for both ECR and USP-RXR, specifically among Diptera and Lepidoptera. The most impressive differences affect the ligand-binding domain of USP-RXR. In addition, ECR sequences show variability in other domains, namely the DNA-binding and the carboxy-terminal F domains. Our data provide the first evidence that ECR and USP-RXR may have coevolved during holometabolous insect diversification, leading to a functional divergence of the ecdysone receptor. These results have general implications on fundamental aspects of insect development, evolution of nuclear receptors, and the design of specific insecticides. △ Less

Submitted 7 June, 2008; originally announced June 2008.

Journal ref: Molecular Biology and Evolution 4, 20 (2003) 541-553

Showing 1–11 of 11 results for author: Robinson-Rechavi, M