Search | arXiv e-print repository

Four principles for improved statistical ecology

Authors: Gordana Popovic, Tanya J. Mason, Tiago A. Marques, Joanne Potts, Szymon M. Drobniak, Rocío Joo, Res Altwegg, Carolyn C. I. Burns, Michael A. McCarthy, Alison Johnston, Shinichi Nakagawa, Louise McMillan, Kadambari Devarajan, Patrick l. Taggart, Alison C. Wunderlich, Magdalena M. Mair, Juan Andrés Martínez-Lanfranco, Malgorzata Lagisz, Patrice P. Pottier

Abstract: Increasing attention has been drawn to the misuse of statistical methods over recent years, with particular concern about the prevalence of practices such as poor experimental design, cherry-picking and inadequate reporting. These failures are largely unintentional and no more common in ecology than in other scientific disciplines, with many of them easily remedied given the right guidance. Orig… ▽ More Increasing attention has been drawn to the misuse of statistical methods over recent years, with particular concern about the prevalence of practices such as poor experimental design, cherry-picking and inadequate reporting. These failures are largely unintentional and no more common in ecology than in other scientific disciplines, with many of them easily remedied given the right guidance. Originating from a discussion at the 2020 International Statistical Ecology Conference, we show how ecologists can build their research following four guiding principles for impactful statistical research practices: 1. Define a focused research question, then plan sampling and analysis to answer it; 2. Develop a model that accounts for the distribution and dependence of your data; 3. Emphasise effect sizes to replace statistical significance with ecological relevance; 4. Report your methods and findings in sufficient detail so that your research is valid and reproducible. Listed in approximate order of importance, these principles provide a framework for experimental design and reporting that guards against unsound practices. Starting with a well-defined research question allows researchers to create an efficient study to answer it, and guards against poor research practices that lead to false positives and poor replicability. Correct and appropriate statistical models give sound conclusions, good reporting practices and a focus on ecological relevance make results impactful and replicable. Illustrated with an example from a recent study into the impact of disturbance on upland swamps, this paper explains the rationale for the selection and use of effective statistical practices and provides practical guidance for ecologists seeking to improve their use of statistical methods. △ Less

Submitted 2 February, 2023; originally announced February 2023.

Comments: 19 pages, 2 figures

arXiv:2108.12471 [pdf, other]

Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function

Authors: Katherine S. Lim, Andrew G. Reidenbach, Bruce K. Hua, Jeremy W. Mason, Christopher J. Gerry, Paul A. Clemons, Connor W. Coley

Abstract: DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find small molecules that bind a protein target. Applying QSAR modeling to DEL data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been shown recently by training binary classifiers to learn D… ▽ More DNA-encoded library (DEL) screening and quantitative structure-activity relationship (QSAR) modeling are two techniques used in drug discovery to find small molecules that bind a protein target. Applying QSAR modeling to DEL data can facilitate the selection of compounds for off-DNA synthesis and evaluation. Such a combined approach has been shown recently by training binary classifiers to learn DEL enrichments of aggregated "disynthons" to accommodate the sparse and noisy nature of DEL data. However, a binary classifier cannot distinguish between different levels of enrichment, and information is potentially lost during disynthon aggregation. Here, we demonstrate a regression approach to learning DEL enrichments of individual molecules using a custom negative log-likelihood loss function that effectively denoises DEL data and introduces opportunities for visualization of learned structure-activity relationships (SAR). Our approach explicitly models the Poisson statistics of the sequencing process used in the DEL experimental workflow under a frequentist view. We illustrate this approach on a dataset of 108k compounds screened against CAIX, and a dataset of 5.7M compounds screened against sEH and SIRT2. Due to the treatment of uncertainty in the data through the negative log-likelihood loss function, the models can ignore low-confidence outliers. While our approach does not demonstrate a benefit for extrapolation to novel structures, we expect our denoising and visualization pipeline to be useful in identifying SAR trends and enriched pharmacophores in DEL data. Further, this approach to uncertainty-aware regression is applicable to other sparse or noisy datasets where the nature of stochasticity is known or can be modeled; in particular, the Poisson enrichment ratio metric we use can apply to other settings that compare sequencing count data between two experimental conditions. △ Less

Submitted 27 April, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

arXiv:1512.04591 [pdf, other]

The Prisoner's dilemma as a cancer model

Authors: Jeffrey West, Zaki Hasnain, Jeremy Mason, Paul K. Newton

Abstract: Tumor development is an evolutionary process in which a heterogeneous population of cells with differential growth capabilities compete for resources in order to gain a proliferative advantage. What are the minimal ingredients needed to recreate some of the emergent features of such a develo** complex ecosystem? What is a tumor doing before we can detect it? We outline a mathematical model, driv… ▽ More Tumor development is an evolutionary process in which a heterogeneous population of cells with differential growth capabilities compete for resources in order to gain a proliferative advantage. What are the minimal ingredients needed to recreate some of the emergent features of such a develo** complex ecosystem? What is a tumor doing before we can detect it? We outline a mathematical model, driven by a stochastic Moran process, in which cancer cells and healthy cells compete for dominance in the population. Each are assigned payoffs according to a Prisoner's Dilemma evolutionary game where the healthy cells are the cooperators and the cancer cells are the defectors. With point mutational dynamics, heredity, and a fitness landscape controlling birth and death rates, natural selection acts on the cell population and simulated "cancer-like" features emerge, such as Gompertzian tumor growth driven by heterogeneity, the log-kill law which (linearly) relates therapeutic dose density to the (log) probability of cancer cell survival, and the Norton-Simon hypothesis which (linearly) relates tumor regression rates to tumor growth rates. We highlight the utility, clarity, and power that such models provide, despite (and because of) their simplicity and built-in assumptions. △ Less

Submitted 16 January, 2016; v1 submitted 14 December, 2015; originally announced December 2015.

arXiv:1501.00682 [pdf, ps, other]

Quasi-Conscious Multivariate Systems

Authors: Jonathan Mason

Abstract: Conscious experience is awash with underlying relationships. Moreover, for various brain regions such as the visual cortex, the system is biased toward some states. Representing this bias using a probability distribution shows that the system can define expected quantities. The mathematical theory in the present paper links these facts by using expected float entropy (efe), which is a measure of t… ▽ More Conscious experience is awash with underlying relationships. Moreover, for various brain regions such as the visual cortex, the system is biased toward some states. Representing this bias using a probability distribution shows that the system can define expected quantities. The mathematical theory in the present paper links these facts by using expected float entropy (efe), which is a measure of the expected amount of information needed, to specify the state of the system, beyond what is already known about the system from relationships that appear as parameters. Under the requirement that the relationship parameters minimise efe, the brain defines relationships. It is proposed that when a brain state is interpreted in the context of these relationships the brain state acquires meaning in the form of the relational content of the associated experience. For a given set, the theory represents relationships using weighted relations which assign continuous weights, from 0 to 1, to the elements of the Cartesian product of that set. The relationship parameters include weighted relations on the nodes of the system and on their set of states. Examples obtained using Monte-Carlo methods (where relationship parameters are chosen uniformly at random) suggest that efe distributions with long left tails are most important. △ Less

Submitted 9 August, 2015; v1 submitted 4 January, 2015; originally announced January 2015.

Comments: 33 pages (double spacing), 11 figures, 15 Tables

arXiv:1407.6959 [pdf]

doi 10.1089/cmb.2014.0290

A scalable method for molecular network reconstruction identifies properties of targets and mutations in acute myeloid leukemia

Authors: Edison Ong, Anthony Szedlak, Yunyi Kang, Peyton Smith, Nicholas Smith, Madison McBride, Darren Finlay, Kristiina Vuori, James Mason, Edward D. Ball, Carlo Piermarocchi, Giovanni Paternostro

Abstract: A key aim of systems biology is the reconstruction of molecular networks, however we do not yet have networks that integrate information from all datasets available for a particular clinical condition. This is in part due to the limited scalability, in terms of required computational time and power, of existing algorithms. Network reconstruction methods should also be scalable in the sense of allo… ▽ More A key aim of systems biology is the reconstruction of molecular networks, however we do not yet have networks that integrate information from all datasets available for a particular clinical condition. This is in part due to the limited scalability, in terms of required computational time and power, of existing algorithms. Network reconstruction methods should also be scalable in the sense of allowing scientists from different backgrounds to efficiently integrate additional data. We present a network model of acute myeloid leukemia (AML). In the current version (AML 2.1) we have used gene expression data (both microarray and RNA-seq) from five different studies comprising a total of 771 AML samples and a protein-protein interactions dataset. Our scalable network reconstruction method is in part based on the well-known property of gene expression correlation among interacting molecules. The difficulty of distinguishing between direct and indirect interactions is addressed optimizing the coefficient of variation of gene expression, using a validated gold standard dataset of direct interactions. Computational time is much reduced compared to other network reconstruction methods. A key feature is the study of the reproducibility of interactions found in independent clinical datasets. An analysis of the most significant clusters, and of the network properties (intraset efficiency, degree, betweenness centrality and PageRank) of common AML mutations demonstrated the biological significance of the network. A statistical analysis of the response of blast cells from eleven AML patients to a library of kinase inhibitors provided an experimental validation of the network. A combination of network and experimental data identified CDK1, CDK2, CDK4 and CDK6 and other kinases as potential therapeutic targets in AML. △ Less

Submitted 25 July, 2014; originally announced July 2014.

Journal ref: Journal of Computational Biology. April 2015, 22(4): 253-265

arXiv:1203.3113 [pdf, ps, other]

doi 10.1002/cplx.21431

Consciousness and the structuring property of typical data

Authors: Jonathan W. Mason

Abstract: The theoretical base for consciousness, in particular an explanation of how consciousness is defined by the brain, has long been sought by science. We propose a partial theory of consciousness as relations defined by typical data. The theory is based on the idea that a brain state on its own is almost meaningless but in the context of the typical brain states, defined by the brain's structure, a p… ▽ More The theoretical base for consciousness, in particular an explanation of how consciousness is defined by the brain, has long been sought by science. We propose a partial theory of consciousness as relations defined by typical data. The theory is based on the idea that a brain state on its own is almost meaningless but in the context of the typical brain states, defined by the brain's structure, a particular brain state is highly structured by relations. The proposed theory can be applied and tested both theoretically and experimentally. Precisely how typical data determines relations is fully established using discrete mathematics. △ Less

Submitted 31 December, 2012; v1 submitted 14 March, 2012; originally announced March 2012.

Comments: 16 pages, 8 figures, First submitted for publication March 2012

MSC Class: 92B20; 91E30

Showing 1–6 of 6 results for author: Mason, J