-
An Analysis of Pacing Profiles in Sprint Kayak Racing Using Functional Principal Components and Hidden Markov Models
Authors:
Harry Estreich,
Nicola Bullock,
Mark Osborne,
Edgar Santos-Fernandez,
Paul Pao-Yen Wu
Abstract:
This study analysed sprint kayak pacing profiles in order to categorise and compare an athlete's race profile throughout their career. We used functional principal component analysis of normalised velocity data for 500m and 1000m races to quantify pacing. The first four principal components explained 90.77% of the variation over 500m and 78.80% over 1000m. These principal components were then asso…
▽ More
This study analysed sprint kayak pacing profiles in order to categorise and compare an athlete's race profile throughout their career. We used functional principal component analysis of normalised velocity data for 500m and 1000m races to quantify pacing. The first four principal components explained 90.77% of the variation over 500m and 78.80% over 1000m. These principal components were then associated with unique pacing characteristics with the first component defined as a dropoff in velocity and the second component defined as a kick. We then applied a Hidden Markov model to categorise each profile over an athlete's career, using the PC scores, into different types of race profiles. This model included age and event type and we identified a trend for a higher dropoff in development pathway athletes. Using the four different race profile types, four athletes had all their race profiles throughout their careers analysed. It was identified that an athlete's pacing profile can and does change throughout their career as an athlete matures. This information provides coaches, practitioners and athletes with expectations as to how pacing profiles can be expected to change across the course of an athlete's career.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Bayesian Design for Sampling Anomalous Spatio-Temporal Data
Authors:
Katie Buchhorn,
Kerrie Mengersen,
Edgar Santos-Fernandez,
James McGree
Abstract:
Data collected from arrays of sensors are essential for informed decision-making in various systems. However, the presence of anomalies can compromise the accuracy and reliability of insights drawn from the collected data or information obtained via statistical analysis. This study aims to develop a robust Bayesian optimal experimental design (BOED) framework with anomaly detection methods for hig…
▽ More
Data collected from arrays of sensors are essential for informed decision-making in various systems. However, the presence of anomalies can compromise the accuracy and reliability of insights drawn from the collected data or information obtained via statistical analysis. This study aims to develop a robust Bayesian optimal experimental design (BOED) framework with anomaly detection methods for high-quality data collection. We introduce a general framework that involves anomaly generation, detection and error scoring when searching for an optimal design. This method is demonstrated using two comprehensive simulated case studies: the first study uses a spatial dataset, and the second uses a spatio-temporal river network dataset. As a baseline approach, we employed a commonly used prediction-based utility function based on minimising errors. Results illustrate the trade-off between predictive accuracy and anomaly detection performance for our method under various design scenarios. An optimal design robust to anomalies ensures the collection and analysis of more trustworthy data, playing a crucial role in understanding the dynamics of complex systems such as the environment, therefore enabling informed decisions in monitoring, management, and response.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
Conditional normalization in time series analysis
Authors:
Puwasala Gamakumara,
Edgar Santos-Fernandez,
Priyanga Dilini Talagala,
Rob J. Hyndman,
Kerrie Mengersen,
Catherine Leigh
Abstract:
Time series often reflect variation associated with other related variables. Controlling for the effect of these variables is useful when modeling or analysing the time series. We introduce a novel approach to normalize time series data conditional on a set of covariates. We do this by modeling the conditional mean and the conditional variance of the time series with generalized additive models us…
▽ More
Time series often reflect variation associated with other related variables. Controlling for the effect of these variables is useful when modeling or analysing the time series. We introduce a novel approach to normalize time series data conditional on a set of covariates. We do this by modeling the conditional mean and the conditional variance of the time series with generalized additive models using a set of covariates. The conditional mean and variance are then used to normalize the time series. We illustrate the use of conditionally normalized series using two applications involving river network data. First, we show how these normalized time series can be used to impute missing values in the data. Second, we show how the normalized series can be used to estimate the conditional autocorrelation function and conditional cross-correlation functions via additive models. Finally we use the conditional cross-correlations to estimate the time it takes water to flow between two locations in a river network.
△ Less
Submitted 21 May, 2023;
originally announced May 2023.
-
Increasing trust in new data sources: crowdsourcing image classification for ecology
Authors:
Edgar Santos-Fernandez,
Julie Vercelloni,
Aiden Price,
Grace Heron,
Bryce Christensen,
Erin E. Peterson,
Kerrie Mengersen
Abstract:
Crowdsourcing methods facilitate the production of scientific information by non-experts. This form of citizen science (CS) is becoming a key source of complementary data in many fields to inform data-driven decisions and study challenging problems. However, concerns about the validity of these data often constrain their utility. In this paper, we focus on the use of citizen science data in addres…
▽ More
Crowdsourcing methods facilitate the production of scientific information by non-experts. This form of citizen science (CS) is becoming a key source of complementary data in many fields to inform data-driven decisions and study challenging problems. However, concerns about the validity of these data often constrain their utility. In this paper, we focus on the use of citizen science data in addressing complex challenges in environmental conservation. We consider this issue from three perspectives. First, we present a literature scan of papers that have employed Bayesian models with citizen science in ecology. Second, we compare several popular majority vote algorithms and introduce a Bayesian item response model that estimates and accounts for participants' abilities after adjusting for the difficulty of the images they have classified. The model also enables participants to be clustered into groups based on ability. Third, we apply the model in a case study involving the classification of corals from underwater images from the Great Barrier Reef, Australia. We show that the model achieved superior results in general and, for difficult tasks, a weighted consensus method that uses only groups of experts and experienced participants produced better performance measures. Moreover, we found that participants learn as they have more classification opportunities, which substantially increases their abilities over time. Overall, the paper demonstrates the feasibility of CS for answering complex and challenging ecological questions when these data are appropriately analysed. This serves as motivation for future work to increase the efficacy and trustworthiness of this emerging source of data.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Graph Neural Network-Based Anomaly Detection for River Network Systems
Authors:
Katie Buchhorn,
Edgar Santos-Fernandez,
Kerrie Mengersen,
Robert Salomone
Abstract:
Water is the lifeblood of river networks, and its quality plays a crucial role in sustaining both aquatic ecosystems and human societies. Real-time monitoring of water quality is increasingly reliant on in-situ sensor technology. Anomaly detection is crucial for identifying erroneous patterns in sensor data, but can be a challenging task due to the complexity and variability of the data, even unde…
▽ More
Water is the lifeblood of river networks, and its quality plays a crucial role in sustaining both aquatic ecosystems and human societies. Real-time monitoring of water quality is increasingly reliant on in-situ sensor technology. Anomaly detection is crucial for identifying erroneous patterns in sensor data, but can be a challenging task due to the complexity and variability of the data, even under normal conditions. This paper presents a solution to the challenging task of anomaly detection for river network sensor data, which is essential for accurate and continuous monitoring. We use a graph neural network model, the recently proposed Graph Deviation Network (GDN), which employs graph attention-based forecasting to capture the complex spatio-temporal relationships between sensors. We propose an alternate anomaly scoring method, GDN+, based on the learned graph. To evaluate the model's efficacy, we introduce new benchmarking simulation experiments with highly-sophisticated dependency structures and subsequence anomalies of various types. We further examine the strengths and weaknesses of this baseline approach, GDN, in comparison to other benchmarking methods on complex real-world river network data. Findings suggest that GDN+ outperforms the baseline approach in high-dimensional data, while also providing improved interpretability. We also introduce software called gnnad.
△ Less
Submitted 31 May, 2023; v1 submitted 18 April, 2023;
originally announced April 2023.
-
Being Bayesian in the 2020s: opportunities and challenges in the practice of modern applied Bayesian statistics
Authors:
Joshua J. Bon,
Adam Bretherton,
Katie Buchhorn,
Susanna Cramb,
Christopher Drovandi,
Conor Hassan,
Adrianne L. Jenner,
Helen J. Mayfield,
James M. McGree,
Kerrie Mengersen,
Aiden Price,
Robert Salomone,
Edgar Santos-Fernandez,
Julie Vercelloni,
Xiaoyu Wang
Abstract:
Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six moder…
▽ More
Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated analysis, inference for implicit models, model transfer and purposeful software products.
△ Less
Submitted 17 January, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
clusterBMA: Bayesian model averaging for clustering
Authors:
Owen Forbes,
Edgar Santos-Fernandez,
Paul Pao-Yen Wu,
Hong-Bo Xie,
Paul E. Schwenn,
Jim Lagopoulos,
Lia Mills,
Dashiell D. Sacks,
Daniel F. Hermens,
Kerrie Mengersen
Abstract:
Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and…
▽ More
Various methods have been developed to combine inference across multiple sets of results for unsupervised clustering, within the ensemble clustering literature. The approach of reporting results from one `best' model out of several candidate clustering models generally ignores the uncertainty that arises from model selection, and results in inferences that are sensitive to the particular model and parameters chosen. Bayesian model averaging (BMA) is a popular approach for combining results across multiple models that offers some attractive benefits in this setting, including probabilistic interpretation of the combined cluster structure and quantification of model-based uncertainty.
In this work we introduce clusterBMA, a method that enables weighted model averaging across results from multiple unsupervised clustering algorithms. We use clustering internal validation criteria to develop an approximation of the posterior model probability, used for weighting the results from each model. From a consensus matrix representing a weighted average of the clustering solutions across models, we apply symmetric simplex matrix factorisation to calculate final probabilistic cluster allocations. In addition to outperforming other ensemble clustering methods on simulated data, clusterBMA offers unique features including probabilistic allocation to averaged clusters, combining allocation probabilities from 'hard' and 'soft' clustering algorithms, and measuring model-based uncertainty in averaged cluster allocation. This method is implemented in an accompanying R package of the same name.
△ Less
Submitted 25 March, 2023; v1 submitted 9 September, 2022;
originally announced September 2022.
-
Bayesian Design with Sampling Windows for Complex Spatial Processes
Authors:
Katie Buchhorn,
Kerrie Mengersen,
Edgar Santos-Fernandez,
Erin E. Peterson,
James M. McGree
Abstract:
Optimal design facilitates intelligent data collection. In this paper, we introduce a fully Bayesian design approach for spatial processes with complex covariance structures, like those typically exhibited in natural ecosystems. Coordinate Exchange algorithms are commonly used to find optimal design points. However, collecting data at specific points is often infeasible in practice. Currently, the…
▽ More
Optimal design facilitates intelligent data collection. In this paper, we introduce a fully Bayesian design approach for spatial processes with complex covariance structures, like those typically exhibited in natural ecosystems. Coordinate Exchange algorithms are commonly used to find optimal design points. However, collecting data at specific points is often infeasible in practice. Currently, there is no provision to allow for flexibility in the choice of design. We also propose an approach to find Bayesian sampling windows, rather than points, via Gaussian process emulation to identify regions of high design efficiency across a multi-dimensional space. These developments are motivated by two ecological case studies: monitoring water temperature in a river network system in the northwestern United States and monitoring submerged coral reefs off the north-west coast of Australia.
△ Less
Submitted 10 June, 2022;
originally announced June 2022.
-
On the intrinsic dimensionality of Covid-19 data: a global perspective
Authors:
Abhishek Varghese,
Edgar Santos-Fernandez,
Francesco Denti,
Antonietta Mira,
Kerrie Mengersen
Abstract:
This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo.…
▽ More
This paper aims to develop a global perspective of the complexity of the relationship between the standardised per-capita growth rate of Covid-19 cases, deaths, and the OxCGRT Covid-19 Stringency Index, a measure describing a country's stringency of lockdown policies. To achieve our goal, we use a heterogeneous intrinsic dimension estimator implemented as a Bayesian mixture model, called Hidalgo. We identify that the Covid-19 dataset may project onto two low-dimensional manifolds without significant information loss. The low dimensionality suggests strong dependency among the standardised growth rates of cases and deaths per capita and the OxCGRT Covid-19 Stringency Index for a country over 2020-2021. Given the low dimensional structure, it may be feasible to model observable Covid-19 dynamics with few parameters. Importantly, we identify spatial autocorrelation in the intrinsic dimension distribution worldwide. Moreover, we highlight that high-income countries are more likely to lie on low-dimensional manifolds, likely arising from aging populations, comorbidities, and increased per capita mortality burden from Covid-19. Finally, we temporally stratify the dataset to examine the intrinsic dimension at a more granular level throughout the Covid-19 pandemic.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
SSNbayes: An R package for Bayesian spatio-temporal modelling on stream networks
Authors:
Edgar Santos-Fernandez,
Jay M. Ver Hoef,
James M. McGree,
Daniel J. Isaak,
Kerrie Mengersen,
Erin E. Peterson
Abstract:
Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multipl…
▽ More
Spatio-temporal models are widely used in many research areas from ecology to epidemiology. However, most covariance functions describe spatial relationships based on Euclidean distance only. In this paper, we introduce the R package SSNbayes for fitting Bayesian spatio-temporal models and making predictions on branching stream networks. SSNbayes provides a linear regression framework with multiple options for incorporating spatial and temporal autocorrelation. Spatial dependence is captured using stream distance and flow connectivity while temporal autocorrelation is modelled using vector autoregression approaches. SSNbayes provides the functionality to make predictions across the whole network, compute exceedance probabilities and other probabilistic estimates such as the proportion of suitable habitat. We illustrate the functionality of the package using a stream temperature dataset collected in Idaho, USA.
△ Less
Submitted 14 February, 2022;
originally announced February 2022.
-
Bayesian spatio-temporal models for stream networks
Authors:
Edgar Santos-Fernandez,
Jay M. Ver Hoef,
Erin E. Peterson,
James McGree,
Daniel Isaak,
Kerrie Mengersen
Abstract:
Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured u…
▽ More
Spatio-temporal models are widely used in many research areas including ecology. The recent proliferation of the use of in-situ sensors in streams and rivers supports space-time water quality modelling and monitoring in near real-time. A new family of spatio-temporal models is introduced. These models incorporate spatial dependence using stream distance while temporal autocorrelation is captured using vector autoregression approaches. Several variations of these novel models are proposed using a Bayesian framework. The results show that our proposed models perform well using spatio-temporal data collected from real stream networks, particularly in terms of out-of-sample RMSPE. This is illustrated considering a case study of water temperature data in the northwestern United States.
△ Less
Submitted 14 February, 2022; v1 submitted 5 March, 2021;
originally announced March 2021.
-
Correcting misclassification errors in crowdsourced ecological data: A Bayesian perspective
Authors:
Edgar Santos-Fernandez,
Erin E. Peterson,
Julie Vercelloni,
Em Rushworth,
Kerrie Mengersen
Abstract:
Many research domains use data elicited from "citizen scientists" when a direct measure of a process is expensive or infeasible. However, participants may report incorrect estimates or classifications due to their lack of skill. We demonstrate how Bayesian hierarchical models can be used to learn about latent variables of interest, while accounting for the participants' abilities. The model is des…
▽ More
Many research domains use data elicited from "citizen scientists" when a direct measure of a process is expensive or infeasible. However, participants may report incorrect estimates or classifications due to their lack of skill. We demonstrate how Bayesian hierarchical models can be used to learn about latent variables of interest, while accounting for the participants' abilities. The model is described in the context of an ecological application that involves crowdsourced classifications of georeferenced coral-reef images from the Great Barrier Reef, Australia. The latent variable of interest is the proportion of coral cover, which is a common indicator of coral reef health. The participants' abilities are expressed in terms of sensitivity and specificity of a correctly classified set of points on the images. The model also incorporates a spatial component, which allows prediction of the latent variable in locations that have not been surveyed. We show that the model outperforms traditional weighted-regression approaches used to account for uncertainty in citizen science data. Our approach produces more accurate regression coefficients and provides a better characterization of the latent process of interest. This new method is implemented in the probabilistic programming language Stan and can be applied to a wide number of problems that rely on uncertain citizen science data.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Bayesian item response models for citizen science ecological data
Authors:
Edgar Santos-Fernandez,
Kerrie Mengersen
Abstract:
So-called 'citizen science' data elicited from crowds has become increasingly popular in many fields including ecology. However, the quality of this information is being frequently debated by many within the scientific community. Therefore, modern citizen science implementations require measures of the users' proficiency that account for the difficulty of the tasks. We introduce a new methodologic…
▽ More
So-called 'citizen science' data elicited from crowds has become increasingly popular in many fields including ecology. However, the quality of this information is being frequently debated by many within the scientific community. Therefore, modern citizen science implementations require measures of the users' proficiency that account for the difficulty of the tasks. We introduce a new methodological framework of item response and linear logistic test models with application to citizen science data used in ecology research. This approach accommodates spatial autocorrelation within the item difficulties and produces relevant ecological measures of species and site-related difficulties, discriminatory power and guessing behavior. These, along with estimates of the subject abilities allow better management of these programs and provide deeper insights. This paper also highlights the fit of item response models to big data via divide-and-conquer. We found that the suggested methods outperform the traditional item response models in terms of RMSE, accuracy, and WAIC based on leave-one-out cross-validation on simulated and empirical data. We present a comprehensive implementation using a case study of species identification in the Serengeti, Tanzania. The R and Stan codes are provided for full reproducibility. Multiple statistical illustrations and visualizations are given which allow practitioners the extrapolation to a wide range of citizen science ecological problems.
△ Less
Submitted 25 May, 2020; v1 submitted 15 March, 2020;
originally announced March 2020.
-
The role of intrinsic dimension in high-resolution player tracking data -- Insights in basketball
Authors:
Edgar Santos-Fernandez,
Francesco Denti,
Kerrie Mengersen,
Antonietta Mira
Abstract:
A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dime…
▽ More
A new range of statistical analysis has emerged in sports after the introduction of the high-resolution player tracking technology, specifically in basketball. However, this high dimensional data is often challenging for statistical inference and decision making. In this article, we employ Hidalgo, a state-of-the-art Bayesian mixture model that allows the estimation of heterogeneous intrinsic dimensions (ID) within a dataset and propose some theoretical enhancements. ID results can be interpreted as indicators of variability and complexity of basketball plays and games. This technique allows classification and clustering of NBA basketball player's movement and shot charts data. Analyzing movement data, Hidalgo identifies key stages of offensive actions such as creating space for passing, preparation/shooting and following through. We found that the ID value spikes reaching a peak between 4 and 8 seconds in the offensive part of the court after which it declines. In shot charts, we obtained groups of shots that produce substantially higher and lower successes. Overall, game-winners tend to have a larger intrinsic dimension which is an indication of more unpredictability and unique shot placements. Similarly, we found higher ID values in plays when the score margin is small compared to large margin ones. These outcomes could be exploited by coaches to obtain better offensive/defensive results.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Monitoring through many eyes: Integrating disparate datasets to improve monitoring of the Great Barrier Reef
Authors:
Erin E Peterson,
Edgar Santos-Fernández,
Carla Chen,
Sam Clifford,
Julie Vercelloni,
Alan Pearse,
Ross Brown,
Bryce Christensen,
Allan James,
Ken Anthony,
Jennifer Loder,
Manuel González-Rivero,
Chris Roelfsema,
M. Julian Caley,
Tomasz Bednarz,
Kerrie Mengersen
Abstract:
Numerous organisations collect data in the Great Barrier Reef (GBR), but they are rarely analysed together due to different program objectives, methods, and data quality. We developed a weighted spatiotemporal Bayesian model and used it to integrate image based hard coral data collected by professional and citizen scientists, who captured and or classified underwater images. We used the model to p…
▽ More
Numerous organisations collect data in the Great Barrier Reef (GBR), but they are rarely analysed together due to different program objectives, methods, and data quality. We developed a weighted spatiotemporal Bayesian model and used it to integrate image based hard coral data collected by professional and citizen scientists, who captured and or classified underwater images. We used the model to predict coral cover across the GBR with estimates of uncertainty; thus filling gaps in space and time where no data exist. Additional data increased the models predictive ability by 43 percent, but did not affect model inferences about pressures (e.g. bleaching and cyclone damage). Thus, effective integration of professional and high-volume citizen data could enhance the capacity and cost efficiency of monitoring programs. This general approach is equally viable for other variables collected in the marine environment or other ecosystems; opening up new opportunities to integrate data and provide pathways for community engagement and stewardship.
△ Less
Submitted 27 March, 2019; v1 submitted 15 August, 2018;
originally announced August 2018.