-
Novel community data in ecology -- properties and prospects
Authors:
Florian Hartig,
Nerea Abrego,
Alex Bush,
Jonathan M. Chase,
Gurutzeta Guillera-Arroita,
Mathew A. Leibold,
Otso Ovaskainen,
Loïc Pellissier,
Maximilian Pichler,
Giovanni Poggiato,
Laura Pollock,
Sara Si-Moussi,
Wilfried Thuiller,
Duarte S. Viana,
David I. Warton,
Damaris Zurell,
Douglas W. Yu
Abstract:
New technologies for acquiring biological information such as eDNA, acoustic or optical sensors, make it possible to generate spatial community observations at unprecedented scales. The potential of these novel community data to standardize community observations at high spatial, temporal, and taxonomic resolution and at large spatial scale ('many rows and many columns') has been widely discussed,…
▽ More
New technologies for acquiring biological information such as eDNA, acoustic or optical sensors, make it possible to generate spatial community observations at unprecedented scales. The potential of these novel community data to standardize community observations at high spatial, temporal, and taxonomic resolution and at large spatial scale ('many rows and many columns') has been widely discussed, but so far, there has been little integration of these data with ecological models and theory. Here, we review these developments and highlight emerging solutions, focusing on statistical methods for analyzing novel community data, in particular joint species distribution models; the new ecological questions that can be answered with these data; and the potential implications of these developments for policy and conservation.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Global simulation envelopes for diagnostic plots in regression models
Authors:
David I. Warton
Abstract:
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that enable construction of global envelopes around func…
▽ More
Residual plots are often used to interrogate regression model assumptions, but interpreting them requires an understanding of how much sampling variation to expect when assumptions are satisfied. In this paper, we propose constructing global envelopes around data (or around trends fitted to data) on residual plots, exploiting recent advances that enable construction of global envelopes around functions by simulation. While the proposed tools are primarily intended as a graphical aid, they can be interpreted as formal tests of model assumptions, which enables the study of their properties via simulation experiments. We considered three model scenarios -- fitting a linear model, generalized linear model or generalized linear mixed model -- and explored the power of global simulation envelope tests constructed around data on quantile-quantile plots, or around trend lines on residual vs fits plots or scale-location plots. Global envelope tests compared favorably to commonly used tests of assumptions at detecting violations of distributional and linearity assumptions. Freely available \texttt{R} software (\texttt{ecostats::plotenvelope}) enables application of these tools to any fitted model that has methods for the \texttt{simulate}, \texttt{residuals} and \texttt{predict} functions.
△ Less
Submitted 24 October, 2022; v1 submitted 2 August, 2022;
originally announced August 2022.
-
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays
Authors:
Łukasz Kidziński,
Francis K. C. Hui,
David I. Warton,
Trevor Hastie
Abstract:
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) ge…
▽ More
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses.
In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
△ Less
Submitted 27 January, 2022; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Multi-species distribution modeling using penalized mixture of regressions
Authors:
Francis K. C. Hui,
David I. Warton,
Scott D. Foster
Abstract:
Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 (2011) 955-963] proposed using finite mixture of regres…
▽ More
Multi-species distribution modeling, which relates the occurrence of multiple species to environmental variables, is an important tool used by ecologists for both predicting the distribution of species in a community and identifying the important variables driving species co-occurrences. Recently, Dunstan, Foster and Darnell [Ecol. Model. 222 (2011) 955-963] proposed using finite mixture of regression (FMR) models for multi-species distribution modeling, where species are clustered based on their environmental response to form a small number of "archetypal responses." As an illustrative example, they applied their mixture model approach to a presence-absence data set of 200 marine organisms, collected along the Great Barrier Reef in Australia. Little attention, however, was given to the problem of model selection - since the archetypes (mixture components) may depend on different but likely overlap** sets of covariates, a method is needed for performing variable selection on all components simultaneously. In this article, we consider using penalized likelihood functions for variable selection in FMR models. We propose two penalties which exploit the grouped structure of the covariates, that is, each covariate is represented by a group of coefficients, one for each component. This leads to an attractive form of shrinkage that allows a covariate to be removed from all components simultaneously. Both penalties are shown to possess specific forms of variable selection consistency, with simulations indicating they outperform other methods which do not take into account the grouped structure. When applied to the Great Barrier Reef data set, penalized FMR models offer more insight into the important variables driving species co-occurrence in the marine community (compared to previous results where no model selection was conducted), while offering a computationally stable method of modeling complex species-environment relationships (through regularization).
△ Less
Submitted 16 September, 2015;
originally announced September 2015.
-
Poisson point process models solve the "pseudo-absence problem" for presence-only data in ecology
Authors:
David I. Warton,
Leah C. Shepherd
Abstract:
Presence-only data, point locations where a species has been recorded as being present, are often used in modeling the distribution of a species as a function of a set of explanatory variables---whether to map species occurrence, to understand its association with the environment, or to predict its response to environmental change. Currently, ecologists most commonly analyze presence-only data by…
▽ More
Presence-only data, point locations where a species has been recorded as being present, are often used in modeling the distribution of a species as a function of a set of explanatory variables---whether to map species occurrence, to understand its association with the environment, or to predict its response to environmental change. Currently, ecologists most commonly analyze presence-only data by adding randomly chosen "pseudo-absences" to the data such that it can be analyzed using logistic regression, an approach which has weaknesses in model specification, in interpretation, and in implementation. To address these issues, we propose Poisson point process modeling of the intensity of presences. We also derive a link between the proposed approach and logistic regression---specifically, we show that as the number of pseudo-absences increases (in a regular or uniform random arrangement), logistic regression slope parameters and their standard errors converge to those of the corresponding Poisson point process model. We discuss the practical implications of these results. In particular, point process modeling offers a framework for choice of the number and location of pseudo-absences, both of which are currently chosen by ad hoc and sometimes ineffective methods in ecology, a point which we illustrate by example.
△ Less
Submitted 15 November, 2010;
originally announced November 2010.