-
Exploring Reproducibility and FAIR Principles in Data Science Using Ecological Niche Modeling as a Case Study
Authors:
Maria Luiza Mondelli,
A. Townsend Peterson,
Luiz M. R. Gadelha Jr
Abstract:
Reproducibility is a fundamental requirement of the scientific process since it enables outcomes to be replicated and verified. Computational scientific experiments can benefit from improved reproducibility for many reasons, including validation of results and reuse by other scientists. However, designing reproducible experiments remains a challenge and hence the need for develo** methodologies…
▽ More
Reproducibility is a fundamental requirement of the scientific process since it enables outcomes to be replicated and verified. Computational scientific experiments can benefit from improved reproducibility for many reasons, including validation of results and reuse by other scientists. However, designing reproducible experiments remains a challenge and hence the need for develo** methodologies and tools that can support this process. Here, we propose a conceptual model for reproducibility to specify its main attributes and properties, along with a framework that allows for computational experiments to be findable, accessible, interoperable, and reusable. We present a case study in ecological niche modeling to demonstrate and evaluate the implementation of this framework.
△ Less
Submitted 31 August, 2019;
originally announced September 2019.
-
A survey of biodiversity informatics: Concepts, practices, and challenges
Authors:
Luiz M. R. Gadelha Jr.,
Pedro C. de Siracusa,
Artur Ziviani,
Eduardo Couto Dalcin,
Helen Michelle Affe,
Marinez Ferreira de Siqueira,
Luís Alexandre Estevão da Silva,
Douglas A. Augusto,
Eduardo Krempser,
Marcia Chame,
Raquel Lopes Costa,
Pedro Milet Meirelles,
Fabiano Thompson
Abstract:
The unprecedented size of the human population, along with its associated economic activities, have an ever increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide them. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision-maker…
▽ More
The unprecedented size of the human population, along with its associated economic activities, have an ever increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide them. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision-makers in ways that they can effectively use them. The development and deployment of mechanisms to produce these indicators depend on having access to trustworthy data from field surveys and automated sensors, biological collections, molecular data, and historic academic literature. The transformation of this raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques used to manage and analyze this data comprise an area often called biodiversity informatics (or e-Biodiversity). Biodiversity data follows a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.
△ Less
Submitted 7 December, 2020; v1 submitted 29 September, 2018;
originally announced October 2018.
-
BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments
Authors:
Maria Luiza Mondelli,
Thiago Magalhães,
Guilherme Loss,
Michael Wilde,
Ian Foster,
Marta Mattoso,
Daniel S. Katz,
Helio J. C. Barbosa,
Ana Tereza R. Vasconcelos,
Kary Ocaña,
Luiz M. R. Gadelha Jr
Abstract:
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this wo…
▽ More
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing (HPC) techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems (SWfMS) and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
△ Less
Submitted 11 January, 2018;
originally announced January 2018.