-
Expected Points Above Average: A Novel NBA Player Metric Based on Bayesian Hierarchical Modeling
Authors:
Benjamin Williams,
Erin M. Schliep,
Bailey Fosdick,
Ryan Elmore
Abstract:
Team and player evaluation in professional sport is extremely important given the financial implications of success/failure. It is especially critical to identify and retain elite shooters in the National Basketball Association (NBA), one of the premier basketball leagues worldwide because the ultimate goal of the game is to score more points than one's opponent. To this end we propose two novel b…
▽ More
Team and player evaluation in professional sport is extremely important given the financial implications of success/failure. It is especially critical to identify and retain elite shooters in the National Basketball Association (NBA), one of the premier basketball leagues worldwide because the ultimate goal of the game is to score more points than one's opponent. To this end we propose two novel basketball metrics: "expected points" for team-based comparisons and "expected points above average (EPAA)" as a player-evaluation tool. Both metrics leverage posterior samples from Bayesian hierarchical modeling framework to cluster teams and players based on their shooting propensities and abilities. We illustrate the concepts for the top 100 shot takers over the last decade and offer our metric as an additional metric for evaluating players.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Analyzing whale calling through Hawkes process modeling
Authors:
Bokgyeong Kang,
Erin M. Schliep,
Alan E. Gelfand,
Tina M. Yack,
Christopher W. Clark,
Robert S. Schick
Abstract:
Sound is assumed to be the primary modality of communication among marine mammal species. Analyzing acoustic recordings helps to understand the function of the acoustic signals as well as the possible impact of anthropogenic noise on acoustic behavior. Motivated by a dataset from a network of hydrophones in Cape Cod Bay, Massachusetts, utilizing automatically detected calls in recordings, we study…
▽ More
Sound is assumed to be the primary modality of communication among marine mammal species. Analyzing acoustic recordings helps to understand the function of the acoustic signals as well as the possible impact of anthropogenic noise on acoustic behavior. Motivated by a dataset from a network of hydrophones in Cape Cod Bay, Massachusetts, utilizing automatically detected calls in recordings, we study the communication process of the endangered North Atlantic right whale. For right whales an "up-call" is known as a contact call, and ensuing counter-calling between individuals is presumed to facilitate group cohesion. We present novel spatiotemporal excitement modeling consisting of a background process and a counter-call process. The background process intensity incorporates the influences of diel patterns and ambient noise on occurrence. The counter-call intensity captures potential excitement, that calling elicits calling behavior. Call incidence is found to be clustered in space and time; a call seems to excite more calls nearer to it in time and space. We find evidence that whales make more calls during twilight hours, respond to other whales nearby, and are likely to remain quiet in the presence of increased ambient noise.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Assessing Marine Mammal Abundance: A Novel Data Fusion
Authors:
Erin M. Schliep,
Alan E. Gelfand,
Christopher W. Clark,
Charles M. Mayo,
Brigid McKenna,
Susan E. Parks,
Tina M. Yack,
Robert S. Schick
Abstract:
Marine mammals are increasingly vulnerable to human disturbance and climate change. Their diving behavior leads to limited visual access during data collection, making studying the abundance and distribution of marine mammals challenging. In theory, using data from more than one observation modality should lead to better informed predictions of abundance and distribution. With focus on North Atlan…
▽ More
Marine mammals are increasingly vulnerable to human disturbance and climate change. Their diving behavior leads to limited visual access during data collection, making studying the abundance and distribution of marine mammals challenging. In theory, using data from more than one observation modality should lead to better informed predictions of abundance and distribution. With focus on North Atlantic right whales, we consider the fusion of two data sources to inform about their abundance and distribution. The first source is aerial distance sampling which provides the spatial locations of whales detected in the region. The second source is passive acoustic monitoring (PAM), returning calls received at hydrophones placed on the ocean floor. Due to limited time on the surface and detection limitations arising from sampling effort, aerial distance sampling only provides a partial realization of locations. With PAM, we never observe numbers or locations of individuals. To address these challenges, we develop a novel thinned point pattern data fusion. Our approach leads to improved inference regarding abundance and distribution of North Atlantic right whales throughout Cape Cod Bay, Massachusetts in the US. We demonstrate performance gains of our approach compared to that from a single source through both simulation and real data.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
A Review of Data-Driven Discovery for Dynamic Systems
Authors:
Joshua S. North,
Christopher K. Wikle,
Erin M. Schliep
Abstract:
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations driving these complex nonlinear dynamic system using data-driven approaches. In this paper we review the current literature on data-driven discovery for dynamic s…
▽ More
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations driving these complex nonlinear dynamic system using data-driven approaches. In this paper we review the current literature on data-driven discovery for dynamic systems. We provide a categorization to the different approaches for data-driven discovery and a unified mathematical framework to show the relationship between the approaches. Importantly, we discuss the role of statistics in the data-driven discovery field, describe a possible approach by which the problem can be cast in a statistical framework, and provide avenues for future work.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
A Bayesian Approach for Spatio-Temporal Data-Driven Dynamic Equation Discovery
Authors:
Joshua S. North,
Christopher K. Wikle,
Erin M. Schliep
Abstract:
Differential equations based on physical principals are used to represent complex dynamic systems in all fields of science and engineering. Through repeated use in both academics and industry, these equations have been shown to represent real-world dynamics well. Since the true dynamics of these complex systems are generally unknown, learning the governing equations can improve our understanding o…
▽ More
Differential equations based on physical principals are used to represent complex dynamic systems in all fields of science and engineering. Through repeated use in both academics and industry, these equations have been shown to represent real-world dynamics well. Since the true dynamics of these complex systems are generally unknown, learning the governing equations can improve our understanding of the mechanisms driving the systems. Here, we develop a Bayesian approach to data-driven discovery of non-linear spatio-temporal dynamic equations. Our approach can accommodate measurement noise and missing data, both of which are common in real-world data, and accounts for parameter uncertainty. The proposed framework is illustrated using three simulated systems with varying amounts of observational uncertainty and missing data and applied to a real-world system to infer the temporal evolution of the vorticity of the streamfunction.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Environmental Data
Authors:
Shirley Rojas-Salazar,
Erin M. Schliep,
Christopher K. Wikle,
Emily H. Stanley,
Stephen R. Carpenter,
Noah R. Lottig
Abstract:
Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant…
▽ More
Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data can violate the conditional independence assumption of the HSMM. We apply the model to high-frequency data collected by an instrumented buoy in Lake Mendota. We model the phycocyanin concentration, which is used in aquatic systems to estimate the relative abundance of blue-green algae, and identify important time-varying effects associated with the duration in each state.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
Correcting spatial Gaussian process parameter and prediction variance estimation under informative sampling
Authors:
Erin M. Schliep,
Christopher K. Wikle,
Ranadeep Daw
Abstract:
Informative sampling designs can impact spatial prediction, or kriging, in two important ways. First, the sampling design can bias spatial covariance parameter estimation, which in turn can bias spatial kriging estimates. Second, even with unbiased estimates of the spatial covariance parameters, since the kriging variance is a function of the observation locations, these estimates will vary based…
▽ More
Informative sampling designs can impact spatial prediction, or kriging, in two important ways. First, the sampling design can bias spatial covariance parameter estimation, which in turn can bias spatial kriging estimates. Second, even with unbiased estimates of the spatial covariance parameters, since the kriging variance is a function of the observation locations, these estimates will vary based on the sample and overestimate the population-based estimates. In this work, we develop a weighted composite likelihood approach to improve spatial covariance parameter estimation under informative sampling designs. Then, given these parameter estimates, we propose three approaches to quantify the effects of the sampling design on the variance estimates in spatial prediction. These results can be used to make informed decisions for population-based inference. We illustrate our approaches using a comprehensive simulation study. Then, we apply our methods to perform spatial prediction on nitrate concentration in wells located throughout central California.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Data from Wearable Devices
Authors:
Shirley Rojas-Salazar,
Erin M. Schliep,
Christopher K. Wikle,
Matthew Hawkey
Abstract:
Data collected by wearable devices in sports provide valuable information about an athlete's behavior such as their activity, performance, and ability. These time series data can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM) for varied purposes including activity recognition and event detection. HSMMs extend the HMM by explicitly modeling the time spent in…
▽ More
Data collected by wearable devices in sports provide valuable information about an athlete's behavior such as their activity, performance, and ability. These time series data can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM) for varied purposes including activity recognition and event detection. HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates derived from the wearable device and observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data from wearable devices can violate the conditional independence assumption of the HSMM. We apply the model to wearable device data collected on a soccer referee in a Major League Soccer game. We model the referee's physiological response to the game demands and identify important time-varying effects of these demands associated with the duration in each state.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.
-
Distributed lag models to identify the cumulative effects of training and recovery in athletes using multivariate ordinal wellness data
Authors:
Erin M. Schliep,
Toryn L. J. Schafer,
Matthew Hawkey
Abstract:
Subjective wellness data can provide important information on the well-being of athletes and be used to maximize player performance and detect and prevent against injury. Wellness data, which are often ordinal and multivariate, include metrics relating to the physical, mental, and emotional status of the athlete. Training and recovery can have significant short- and long-term effects on athlete we…
▽ More
Subjective wellness data can provide important information on the well-being of athletes and be used to maximize player performance and detect and prevent against injury. Wellness data, which are often ordinal and multivariate, include metrics relating to the physical, mental, and emotional status of the athlete. Training and recovery can have significant short- and long-term effects on athlete wellness, and these effects can vary across individual. We develop a joint multivariate latent factor model for ordinal response data to investigate the effects of training and recovery on athlete wellness. We use a latent factor distributed lag model to capture the cumulative effects of training and recovery through time. Current efforts using subjective wellness data have averaged over these metrics to create a univariate summary of wellness, however this approach can mask important information in the data. Our multivariate model leverages each ordinal variable and can be used to identify the relative importance of each in monitoring athlete wellness. The model is applied to athlete daily wellness, training, and recovery data collected across two Major League Soccer seasons.
△ Less
Submitted 18 May, 2020;
originally announced May 2020.
-
On the spatial and temporal shift in the archetypal seasonal temperature cycle as driven by annual and semi-annual harmonics
Authors:
Joshua S. North,
Erin M. Schliep,
Christopher K. Wikle
Abstract:
Statistical methods are required to evaluate and quantify the uncertainty in environmental processes, such as land and sea surface temperature, in a changing climate. Typically, annual harmonics are used to characterize the variation in the seasonal temperature cycle. However, an often overlooked feature of the climate seasonal cycle is the semi-annual harmonic, which can account for a significant…
▽ More
Statistical methods are required to evaluate and quantify the uncertainty in environmental processes, such as land and sea surface temperature, in a changing climate. Typically, annual harmonics are used to characterize the variation in the seasonal temperature cycle. However, an often overlooked feature of the climate seasonal cycle is the semi-annual harmonic, which can account for a significant portion of the variance of the seasonal cycle and varies in amplitude and phase across space. Together, the spatial variation in the annual and semi-annual harmonics can play an important role in driving processes that are tied to seasonality (e.g., ecological and agricultural processes). We propose a multivariate spatio-temporal model to quantify the spatial and temporal change in minimum and maximum temperature seasonal cycles as a function of the annual and semi-annual harmonics. Our approach captures spatial dependence, temporal dynamics, and multivariate dependence of these harmonics through spatially and temporally-varying coefficients. We apply the model to minimum and maximum temperature over North American for the years 1979 to 2018. Formal model inference within the Bayesian paradigm enables the identification of regions experiencing significant changes in minimum and maximum temperature seasonal cycles due to the relative effects of changes in the two harmonics.
△ Less
Submitted 15 March, 2020;
originally announced March 2020.
-
Long-term Spatial Modeling for Characteristics of Extreme Heat Events
Authors:
Erin M. Schliep,
Alan E. Gelfand,
Jesus Abaurrea,
Jesus Asin,
Maria A. Beamonte,
Ana C. Cebrian
Abstract:
There is increasing evidence that global warming manifests itself in more frequent warm days and that heat waves will become more frequent. Presently, a formal definition of a heat wave is not agreed upon in the literature. To avoid this debate, we consider extreme heat events, which, at a given location, are well-defined as a run of consecutive days above an associated local threshold. Characteri…
▽ More
There is increasing evidence that global warming manifests itself in more frequent warm days and that heat waves will become more frequent. Presently, a formal definition of a heat wave is not agreed upon in the literature. To avoid this debate, we consider extreme heat events, which, at a given location, are well-defined as a run of consecutive days above an associated local threshold. Characteristics of EHEs are of primary interest, such as incidence and duration, as well as the magnitude of the average exceedance and maximum exceedance above the threshold during the EHE.
Using approximately 60-year time series of daily maximum temperature data collected at 18 locations in a given region, we propose a spatio-temporal model to study the characteristics of EHEs over time. The model enables prediction of the behavior of EHE characteristics at unobserved locations within the region. Specifically, our approach employs a two-state space-time model for EHEs with local thresholds where one state defines above threshold daily maximum temperatures and the other below threshold temperatures. We show that our model is able to recover the EHE characteristics of interest and outperforms a corresponding autoregressive model that ignores thresholds based on out-of-sample prediction.
△ Less
Submitted 29 June, 2020; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Identifying and characterizing extrapolation in multivariate response data
Authors:
Meridith L Bartley,
Ephraim M Hanks,
Erin M Schliep,
Patricia A Soranno,
Tyler Wagner
Abstract:
Extrapolation is defined as making predictions beyond the range of the data used to estimate a statistical model. In ecological studies, it is not always obvious when and where extrapolation occurs because of the multivariate nature of the data. Previous work on identifying extrapolation has focused on univariate response data, but these methods are not directly applicable to multivariate response…
▽ More
Extrapolation is defined as making predictions beyond the range of the data used to estimate a statistical model. In ecological studies, it is not always obvious when and where extrapolation occurs because of the multivariate nature of the data. Previous work on identifying extrapolation has focused on univariate response data, but these methods are not directly applicable to multivariate response data, which are more and more common in ecological investigations. In this paper, we extend previous work that identified extrapolation by applying the predictive variance from the univariate setting to the multivariate case. We illustrate our approach through an analysis of jointly modeled lake nutrients and indicators of algal biomass and water clarity in over 7000 inland lakes from across the Northeast and Mid-west US. In addition, we illustrate novel exploratory approaches for identifying regions of covariate space where extrapolation is more likely to occur using classification and regression trees.
△ Less
Submitted 12 November, 2019; v1 submitted 17 June, 2019;
originally announced June 2019.
-
Multilevel latent Gaussian process model for mixed discrete and continuous multivariate response data
Authors:
Erin M. Schliep,
Jennifer A. Hoeting
Abstract:
We propose a Bayesian model for mixed ordinal and continuous multivariate data to evaluate a latent spatial Gaussian process. Our proposed model can be used in many contexts where mixed continuous and discrete multivariate responses are observed in an effort to quantify an unobservable continuous measurement. In our example, the latent, or unobservable measurement is wetland condition. While predi…
▽ More
We propose a Bayesian model for mixed ordinal and continuous multivariate data to evaluate a latent spatial Gaussian process. Our proposed model can be used in many contexts where mixed continuous and discrete multivariate responses are observed in an effort to quantify an unobservable continuous measurement. In our example, the latent, or unobservable measurement is wetland condition. While predicted values of the latent wetland condition variable produced by the model at each location do not hold any intrinsic value, the relative magnitudes of the wetland condition values are of interest. In addition, by including point-referenced covariates in the model, we are able to make predictions at new locations for both the latent random variable and the multivariate response. Lastly, the model produces ranks of the multivariate responses in relation to the unobserved latent random field. This is an important result as it allows us to determine which response variables are most closely correlated with the latent variable. Our approach offers an alternative to traditional indices based on best professional judgment that are frequently used in ecology. We apply our model to assess wetland condition in the North Platte and Rio Grande River Basins in Colorado. The model facilitates a comparison of wetland condition at multiple locations and ranks the importance of in-field measurements.
△ Less
Submitted 23 March, 2013; v1 submitted 18 May, 2012;
originally announced May 2012.