-
Model-assisted estimation of domain totals, areas, and densities in two-stage sample survey designs
Authors:
Hans-Erik Andersen,
Göran Ståhl,
Bruce D. Cook,
Douglas C. Morton,
Andrew O. Finley
Abstract:
Model-assisted, two-stage forest survey sampling designs provide a means to combine airborne remote sensing data, collected in a sampling mode, with field plot data to increase the precision of national forest inventory estimates, while maintaining important properties of design-based inventories, such as unbiased estimation and quantification of uncertainty. In this study, we present a comprehens…
▽ More
Model-assisted, two-stage forest survey sampling designs provide a means to combine airborne remote sensing data, collected in a sampling mode, with field plot data to increase the precision of national forest inventory estimates, while maintaining important properties of design-based inventories, such as unbiased estimation and quantification of uncertainty. In this study, we present a comprehensive set of model-assisted estimators for domain-level attributes in a two-stage sampling design, including new estimators for densities, and compare the performance of these estimators with standard poststratified estimators. Simulation was used to assess the statistical properties (bias, variability) of these estimators, with both simple random and systematic sampling configurations, and indicated that 1) all estimators were generally unbiased. and 2) the use of lidar in a sampling mode increased the precision of the estimators at all assessed field sampling intensities, with particularly marked increases in precision at lower field sampling intensities. Variance estimators are generally unbiased for model-assisted estimators without poststratification, while model-assisted estimators with poststratification were increasingly biased as field sampling intensity decreased. In general, these results indicate that airborne remote sensing, collected in a sampling mode, can be used to increase the efficiency of national forest inventories.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
A Tidy Framework and Infrastructure to Systematically Assemble Spatio-temporal Indexes from Multivariate Data
Authors:
H. Sherry Zhang,
Dianne Cook,
Ursula Laa,
Nicolas Langrené,
Patricia Menéndez
Abstract:
Indexes are useful for summarizing multivariate information into single metrics for monitoring, communicating, and decision-making. While most work has focused on defining new indexes for specific purposes, more attention needs to be directed towards making it possible to understand index behavior in different data conditions, and to determine how their structure affects their values and variation…
▽ More
Indexes are useful for summarizing multivariate information into single metrics for monitoring, communicating, and decision-making. While most work has focused on defining new indexes for specific purposes, more attention needs to be directed towards making it possible to understand index behavior in different data conditions, and to determine how their structure affects their values and variation in values. Here we discuss a modular data pipeline recommendation to assemble indexes. It is universally applicable to index computation and allows investigation of index behavior as part of the development procedure. One can compute indexes with different parameter choices, adjust steps in the index definition by adding, removing, and swap** them to experiment with various index designs, calculate uncertainty measures, and assess indexes robustness. The paper presents three examples to illustrate the pipeline framework usage: comparison of two different indexes designed to monitor the spatio-temporal distribution of drought in Queensland, Australia; the effect of dimension reduction choices on the Global Gender Gap Index (GGGI) on countries ranking; and how to calculate bootstrap confidence intervals for the Standardized Precipitation Index (SPI). The methods are supported by a new R package, called tidyindex.
△ Less
Submitted 13 May, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Frame to frame interpolation for high-dimensional data visualisation using the woylier package
Authors:
Zoljargal Batsaikhan,
Dianne Cook,
Ursula Laa
Abstract:
The woylier package implements tour interpolation paths between frames using Givens rotations. This provides an alternative to the geodesic interpolation between planes currently available in the tourr package. Tours are used to visualise high-dimensional data and models, to detect clustering, anomalies and non-linear relationships. Frame-to-frame interpolation can be useful for projection pursuit…
▽ More
The woylier package implements tour interpolation paths between frames using Givens rotations. This provides an alternative to the geodesic interpolation between planes currently available in the tourr package. Tours are used to visualise high-dimensional data and models, to detect clustering, anomalies and non-linear relationships. Frame-to-frame interpolation can be useful for projection pursuit guided tours when the index is not rotationally invariant. It also provides a way to specifically reach a given target frame. We demonstrate the method for exploring non-linear relationships between currency cross-rates.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
A Clustering Algorithm to Organize Satellite Hotspot Data for the Purpose of Tracking Bushfires Remotely
Authors:
Weihao Li,
Emily Dodwell,
Dianne Cook
Abstract:
This paper proposes a spatiotemporal clustering algorithm and its implementation in the R package spotoroo. This work is motivated by the catastrophic bushfires in Australia throughout the summer of 2019-2020 and made possible by the availability of satellite hotspot data. The algorithm is inspired by two existing spatiotemporal clustering algorithms but makes enhancements to cluster points spatia…
▽ More
This paper proposes a spatiotemporal clustering algorithm and its implementation in the R package spotoroo. This work is motivated by the catastrophic bushfires in Australia throughout the summer of 2019-2020 and made possible by the availability of satellite hotspot data. The algorithm is inspired by two existing spatiotemporal clustering algorithms but makes enhancements to cluster points spatially in conjunction with their movement across consecutive time periods. It also allows for the adjustment of key parameters, if required, for different locations and satellite data sources. Bushfire data from Victoria, Australia, is used to illustrate the algorithm and its use within the package.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol
Authors:
Weihao Li,
Dianne Cook,
Emi Tanaka,
Susan VanderPlas
Abstract:
Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conc…
▽ More
Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conclusion would be that the model fit is inadequate. The experiment uses the lineup protocol which puts a residual plot in the context of null plots. This helps generate reliable and consistent reading of residual plots for better model diagnosis. It can also help in an obverse situation where a conventional test would fail to detect a problem with a model due to contaminated data. The lineup protocol also detects a range of departures from good residuals simultaneously. Supplemental materials for the article are available online.
△ Less
Submitted 24 March, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
Multiple Imputation Approaches for Epoch-level Accelerometer data in Trials
Authors:
Mia S. Tackney,
Elizabeth Williamson,
Derek G. Cook,
Elizabeth Limb,
Tess Harris,
James Carpenter
Abstract:
Clinical trials that investigate interventions on physical activity often use accelerometers to measure step count at a very granular level, often in 5-second epochs. Participants typically wear the accelerometer for a week-long period at baseline, and for one or more week-long follow-up periods after the intervention. The data is usually aggregated to provide daily or weekly step counts for the p…
▽ More
Clinical trials that investigate interventions on physical activity often use accelerometers to measure step count at a very granular level, often in 5-second epochs. Participants typically wear the accelerometer for a week-long period at baseline, and for one or more week-long follow-up periods after the intervention. The data is usually aggregated to provide daily or weekly step counts for the primary analysis. Missing data are common as participants may not wear the device as per protocol. Approaches to handling missing data in the literature have largely defined missingness on the day level using a threshold on daily wear time, which leads to loss of information on the time of day when data are missing. We propose an approach to identifying and classifying missingness at the finer epoch-level, and then present two approaches to handling missingness. Firstly, we present a parametric approach which takes into account the number of missing epochs per day. Secondly, we describe a non-parametric approach to Multiple Imputation (MI) where missing periods during the day are replaced by donor data from the same person where possible, or data from a different person who is matched on demographic and physical activity-related variables. Our simulation studies comparing these approaches in a number of settings show that the non-parametric approach leads to estimates of the effect of treatment that are least biased while maintaining small standard errors. We illustrate the application of these different MI strategies to the analysis of the 2017 PACE-UP Trial. The proposed framework of classifying missingness and applying MI at the epoch-level is likely to be applicable to a number of different outcomes and data from other wearable devices.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Performance is not enough: the story told by a Rashomon quartet
Authors:
Przemyslaw Biecek,
Hubert Baniecki,
Mateusz Krzyzinski,
Dianne Cook
Abstract:
The usual goal of supervised learning is to find the best model, the one that optimizes a particular performance measure. However, what if the explanation provided by this model is completely different from another model and different again from another model despite all having similarly good fit statistics? Is it possible that the equally effective models put the spotlight on different relationsh…
▽ More
The usual goal of supervised learning is to find the best model, the one that optimizes a particular performance measure. However, what if the explanation provided by this model is completely different from another model and different again from another model despite all having similarly good fit statistics? Is it possible that the equally effective models put the spotlight on different relationships in the data? Inspired by Anscombe's quartet, this paper introduces a Rashomon Quartet, i.e. a set of four models built on a synthetic dataset which have practically identical predictive performance. However, the visual exploration reveals distinct explanations of the relations in the data. This illustrative example aims to encourage the use of methods for model visualization to compare predictive models beyond their performance.
△ Less
Submitted 11 April, 2024; v1 submitted 26 February, 2023;
originally announced February 2023.
-
Models to support forest inventory and small area estimation using sparsely sampled LiDAR: A case study involving G-LiHT LiDAR in Tanana, Alaska
Authors:
Andrew O. Finley,
Hans-Erik Andersen,
Chad Babcock,
Bruce D. Cook,
Douglas C. Morton,
Sudipto Banerjee
Abstract:
A two-stage hierarchical Bayesian model is developed and implemented to estimate forest biomass density and total given sparsely sampled LiDAR and georeferenced forest inventory plot measurements. The model is motivated by the United States Department of Agriculture (USDA) Forest Service Forest Inventory and Analysis (FIA) objective to provide biomass estimates for the remote Tanana Inventory Unit…
▽ More
A two-stage hierarchical Bayesian model is developed and implemented to estimate forest biomass density and total given sparsely sampled LiDAR and georeferenced forest inventory plot measurements. The model is motivated by the United States Department of Agriculture (USDA) Forest Service Forest Inventory and Analysis (FIA) objective to provide biomass estimates for the remote Tanana Inventory Unit (TIU) in interior Alaska. The proposed model yields stratum-level biomass estimates for arbitrarily sized areas. Model-based estimates are compared with the TIU FIA design-based post-stratified estimates. Model-based small area estimates (SAEs) for two experimental forests within the TIU are compared with each forest's design-based estimates generated using a dense network of independent inventory plots. Model parameter estimates and biomass predictions are informed using FIA plot measurements, LiDAR data that are spatially aligned with a subset of the FIA plots, and complete coverage remotely detected data used to define landuse/landcover stratum and percent forest canopy cover. Results support a model-based approach to estimating forest variables when inventory data are sparse or resources limit collection of enough data to achieve desired accuracy and precision using design-based methods.
△ Less
Submitted 31 January, 2024; v1 submitted 13 February, 2023;
originally announced February 2023.
-
A Study on a User-Controlled Radial Tour for Variable Importance in High-Dimensional Data
Authors:
Nicholas Spyrison,
Dianne Cook,
Kim Marriott
Abstract:
Principal component analysis is a long-standing go-to method for exploring multivariate data. The principal components are linear combinations of the original variables, ordered by descending variance. The first few components typically provide a good visual summary of the data. Tours also make linear projections of the original variables but offer many different views, like examining the data fro…
▽ More
Principal component analysis is a long-standing go-to method for exploring multivariate data. The principal components are linear combinations of the original variables, ordered by descending variance. The first few components typically provide a good visual summary of the data. Tours also make linear projections of the original variables but offer many different views, like examining the data from different directions. The grand tour shows a smooth sequence of projections as an animation following interpolations between random target bases. The manual radial tour rotates the selected variable's contribution into and out of a projection. This allows the importance of the variable to structure in the projection to be assessed. This work describes a mixed-design user study evaluating the radial tour's efficacy compared with principal component analysis and the grand tour. A supervised classification task is assigned to participants who evaluate variable attribution of the separation between two classes. Their accuracy in assigning the variable importance is measured across various factors. Data were collected from 108 crowdsourced participants, who performed two trials with each visual for 648 trials in total. Mixed model regression finds strong evidence that the radial tour results in a large increase in accuracy over the alternatives. Participants also reported a preference for the radial tour in comparison to the other two methods.
△ Less
Submitted 30 December, 2022;
originally announced January 2023.
-
New and simplified manual controls for projection and slice tours, with application to exploring classification boundaries in high dimensions
Authors:
Ursula Laa,
Alex Aumann,
Dianne Cook,
German Valencia
Abstract:
This paper describes new user controls for examining high-dimensional data using low-dimensional linear projections and slices. A user can interactively change the contribution of a given variable to a low-dimensional projection, which is useful for exploring the sensitivity of structure to particular variables. The user can also interactively shift the center of a slice, for example, to explore h…
▽ More
This paper describes new user controls for examining high-dimensional data using low-dimensional linear projections and slices. A user can interactively change the contribution of a given variable to a low-dimensional projection, which is useful for exploring the sensitivity of structure to particular variables. The user can also interactively shift the center of a slice, for example, to explore how structure changes in local subspaces. The Mathematica package as well as example notebooks are provided, which contain functions enabling the user to experiment with these new manual controls, with one specifically for exploring regions and boundaries produced by classification models. The advantage of Mathematica is its linear algebra capabilities, and interactive cursor location controls. Some limited implementation has also been made available in the R package tourr.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Quantifying and correcting geolocation error in spaceborne LiDAR forest canopy observations using high spatial accuracy ALS: A Bayesian model approach
Authors:
Elliot S. Shannon,
Andrew O. Finley,
Daniel J. Hayes,
Sylvia N. Noralez,
Aaron R. Weiskittel,
Bruce D. Cook,
Chad Babcock
Abstract:
Geolocation error in spaceborne sampling light detection and ranging (LiDAR) measurements of forest structure can compromise forest attribute estimates and degrade integration with georeferenced field measurements or other remotely sensed data. Data integration is especially problematic when geolocation error is not well quantified. We propose a general model that uses airborne laser scanning (ALS…
▽ More
Geolocation error in spaceborne sampling light detection and ranging (LiDAR) measurements of forest structure can compromise forest attribute estimates and degrade integration with georeferenced field measurements or other remotely sensed data. Data integration is especially problematic when geolocation error is not well quantified. We propose a general model that uses airborne laser scanning (ALS) data to quantify and correct geolocation error in spaceborne sampling LiDAR. To illustrate the model, LiDAR data from NASA Goddard's LiDAR Hyperspectral & Thermal Imager (G-LiHT) was used with a subset of LiDAR data from NASA's Global Ecosystem Dynamics Investigation (GEDI). The model accommodates multiple canopy height metrics derived from a simulated GEDI footprint kernel using spatially coincident G-LiHT, and incorporates both additive and multiplicative map** between the canopy height metrics generated from both datasets. A Bayesian implementation provides probabilistic uncertainty quantification in both parameter and geolocation error estimates. Results show a systematic geolocation error of 9.62 m in the southwest direction. In addition, estimated geolocation errors within GEDI footprints were highly variable, with results showing a ~0.45 probability the true footprint center is within 20 m. Estimating and correcting geolocation error via the model outlined here can help inform subsequent efforts to integrate spaceborne LiDAR data, like GEDI, with other georeferenced data.
△ Less
Submitted 23 August, 2023; v1 submitted 23 September, 2022;
originally announced September 2022.
-
A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database
Authors:
Dewi Amaliah,
Dianne Cook,
Emi Tanaka,
Kate Hyde,
Nicholas Tierney
Abstract:
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This pa…
▽ More
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper describes the trials and tribulations of refreshing a textbook data set on wages, extracted from the National Longitudinal Survey of Youth (NLSY79) in the early 1990s. The data is useful for teaching modeling and exploratory analysis of longitudinal data. Subsets of NLSY79, including the wages data, can be found in supplementary files from numerous textbooks and research articles. The NLSY79 database has been continuously updated through to 2018, so new records are available. Here we describe our journey to refresh the wages data, and document the process so that the data can be regularly updated into the future. Our journey was difficult because the steps and decisions taken to get from the raw data to the wages textbook subset have not been clearly articulated. We have been diligent to provide a reproducible workflow for others to follow, which also hopefully inspires more attempts at refreshing data for teaching. Three new data sets and the code to produce them are provided in the open source R package called `yowie`.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Exploring Local Explanations of Nonlinear Models Using Animated Linear Projections
Authors:
Nicholas Spyrison,
Dianne Cook,
Przemyslaw Biecek
Abstract:
The increased predictive power of machine learning models comes at the cost of increased complexity and loss of interpretability, particularly in comparison to parametric statistical models. This trade-off has led to the emergence of eXplainable AI (XAI) which provides methods, such as local explanations (LEs) and local variable attributions (LVAs), to shed light on how a model use predictors to a…
▽ More
The increased predictive power of machine learning models comes at the cost of increased complexity and loss of interpretability, particularly in comparison to parametric statistical models. This trade-off has led to the emergence of eXplainable AI (XAI) which provides methods, such as local explanations (LEs) and local variable attributions (LVAs), to shed light on how a model use predictors to arrive at a prediction. These provide a point estimate of the linear variable importance in the vicinity of a single observation. However, LVAs tend not to effectively handle association between predictors. To understand how the interaction between predictors affects the variable importance estimate, we can convert LVAs into linear projections and use the radial tour. This is also useful for learning how a model has made a mistake, or the effect of outliers, or the clustering of observations. The approach is illustrated with examples from categorical (penguin species, chocolate types) and quantitative (soccer/football salaries, house prices) response models. The methods are implemented in the R package cheem, available on CRAN.
△ Less
Submitted 18 January, 2024; v1 submitted 11 May, 2022;
originally announced May 2022.
-
cubble: An R Package for Organizing and Wrangling Multivariate Spatio-temporal Data
Authors:
H. Sherry Zhang,
Dianne Cook,
Ursula Laa,
Nicolas Langrené,
Patricia Menéndez
Abstract:
Multivariate spatio-temporal data refers to multiple measurements taken across space and time. For many analyses, spatial and time components can be separately studied: for example, to explore the temporal trend of one variable for a single spatial location, or to model the spatial distribution of one variable at a given time. However for some studies, it is important to analyse different aspects…
▽ More
Multivariate spatio-temporal data refers to multiple measurements taken across space and time. For many analyses, spatial and time components can be separately studied: for example, to explore the temporal trend of one variable for a single spatial location, or to model the spatial distribution of one variable at a given time. However for some studies, it is important to analyse different aspects of the spatio-temporal data simultaneouly, like for instance, temporal trends of multiple variables across locations. In order to facilitate the study of different portions or combinations of spatio-temporal data, we introduce a new data structure, cubble, with a suite of functions enabling easy slicing and dicing on the different components spatio-temporal components. The proposed cubble structure ensures that all the components of the data are easy to access and manipulate while providing flexibility for data analysis. In addition, cubble facilitates visual and numerical explorations of the data while easing data wrangling and modelling. The cubble structure and the functions provided in the cubble R package equip users with the capability to handle hierarchical spatial and temporal structures. The cubble structure and the tools implemented in the package are illustrated with different examples of Australian climate data.
△ Less
Submitted 10 January, 2024; v1 submitted 30 April, 2022;
originally announced May 2022.
-
Absolute and Relative Bias in Eight Common Observational Study Designs: Evidence from a Meta-analysis
Authors:
Jelena Zurovac,
Thomas D. Cook,
John Deke,
Mariel M. Finucane,
Duncan Chaplin,
Jared S. Coopersmith,
Michael Barna,
Lauren Vollmer Forrow
Abstract:
Observational studies are needed when experiments are not possible. Within study comparisons (WSC) compare observational and experimental estimates that test the same hypothesis using the same treatment group, outcome, and estimand. Meta-analyzing 39 of them, we compare mean bias and its variance for the eight observational designs that result from combining whether there is a pretest measure of t…
▽ More
Observational studies are needed when experiments are not possible. Within study comparisons (WSC) compare observational and experimental estimates that test the same hypothesis using the same treatment group, outcome, and estimand. Meta-analyzing 39 of them, we compare mean bias and its variance for the eight observational designs that result from combining whether there is a pretest measure of the outcome or not, whether the comparison group is local to the treatment group or not, and whether there is a relatively rich set of other covariates or not. Of these eight designs, one combines all three design elements, another has none, and the remainder include any one or two. We found that both the mean and variance of bias decline as design elements are added, with the lowest mean and smallest variance in a design with all three elements. The probability of bias falling within 0.10 standard deviations of the experimental estimate varied from 59 to 83 percent in Bayesian analyses and from 86 to 100 percent in non-Bayesian ones -- the ranges depending on the level of data aggregation. But confounding remains possible due to each of the eight observational study design cells including a different set of WSC studies.
△ Less
Submitted 15 November, 2021; v1 submitted 12 November, 2021;
originally announced November 2021.
-
A Review of the State-of-the-Art on Tours for Dynamic Visualization of High-dimensional Data
Authors:
Stuart Lee,
Dianne Cook,
Natalia da Silva,
Ursula Laa,
Earo Wang,
Nick Spyrison,
H. Sherry Zhang
Abstract:
This article discusses a high-dimensional visualization technique called the tour, which can be used to view data in more than three dimensions. We review the theory and history behind the technique, as well as modern software developments and applications of the tour that are being found across the sciences and machine learning.
This article discusses a high-dimensional visualization technique called the tour, which can be used to view data in more than three dimensions. We review the theory and history behind the technique, as well as modern software developments and applications of the tour that are being found across the sciences and machine learning.
△ Less
Submitted 19 April, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Visual Diagnostics for Constrained Optimisation with Application to Guided Tours
Authors:
H. Sherry Zhang,
Dianne Cook,
Ursula Laa,
Nicolas Langrené,
Patricia Menéndez
Abstract:
A guided tour helps to visualise high-dimensional data by showing low-dimensional projections along a projection pursuit optimisation path. Projection pursuit is a generalisation of principal component analysis, in the sense that different indexes are used to define the interestingness of the projected data. While much work has been done in develo** new indexes in the literature, less has been d…
▽ More
A guided tour helps to visualise high-dimensional data by showing low-dimensional projections along a projection pursuit optimisation path. Projection pursuit is a generalisation of principal component analysis, in the sense that different indexes are used to define the interestingness of the projected data. While much work has been done in develo** new indexes in the literature, less has been done on understanding the optimisation. Index functions can be noisy, might have multiple local maxima as well as an optimal maximum, and are constrained to generate orthonormal projection frames, which complicates the optimization. In addition, projection pursuit is primarily used for exploratory data analysis, and finding the local maxima is also useful. The guided tour is especially useful for exploration, because it conducts geodesic interpolation connecting steps in the optimisation and shows how the projected data changes as a maxima is approached. This work provides new visual diagnostics for examining a choice of optimisation procedure, based on the provision of a new data object which collects information throughout the optimisation. It has helped to diagnose and fix several problems with projection pursuit guided tour. This work might be useful more broadly for diagnosing optimisers, and comparing their performance. The diagnostics are implemented in the R package, ferrn.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Envelopes for multivariate linear regression with linearly constrained coefficients
Authors:
Dennis Cook,
Liliana Forzani,
Lan Liu
Abstract:
A constrained multivariate linear model is a multivariate linear model with the columns of its coefficient matrix constrained to lie in a known subspace. This class of models includes those typically used to study growth curves and longitudinal data. Envelope methods have been proposed to improve estimation efficiency in the class of unconstrained multivariate linear models, but have not yet been…
▽ More
A constrained multivariate linear model is a multivariate linear model with the columns of its coefficient matrix constrained to lie in a known subspace. This class of models includes those typically used to study growth curves and longitudinal data. Envelope methods have been proposed to improve estimation efficiency in the class of unconstrained multivariate linear models, but have not yet been developed for constrained models that we develop in this article. We first compare the standard envelope estimator based on an unconstrained multivariate model with the standard estimator arising from a constrained multivariate model in terms of bias and efficiency. Then, to further improve efficiency, we propose a novel envelope estimator based on a constrained multivariate model. Novel envelope-based testing methods are also proposed. We provide support for our proposals by simulations and by studying the classical dental data and data from the China Health and Nutrition Survey and a study of probiotic capacity to reduced Salmonella infection.
△ Less
Submitted 2 January, 2021;
originally announced January 2021.
-
Casting Multiple Shadows: High-Dimensional Interactive Data Visualisation with Tours and Embeddings
Authors:
Stuart Lee,
Ursula Laa,
Dianne Cook
Abstract:
Non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences, however, the appropriate use of these methods is difficult because of their complex parameterisations; analysts must make trade-offs in order to identify structure in the visualisation of an NLDR technique. We present visual diagnostics for the pra…
▽ More
Non-linear dimensionality reduction (NLDR) methods such as t-distributed stochastic neighbour embedding (t-SNE) are ubiquitous in the natural sciences, however, the appropriate use of these methods is difficult because of their complex parameterisations; analysts must make trade-offs in order to identify structure in the visualisation of an NLDR technique. We present visual diagnostics for the pragmatic usage of NLDR methods by combining them with a technique called the tour. A tour is a sequence of interpolated linear projections of multivariate data onto a lower dimensional space. The sequence is displayed as a dynamic visualisation, allowing a user to see the shadows the high-dimensional data casts in a lower dimensional view. By linking the tour to an NLDR view, we can preserve global structure and through user interactions like linked brushing observe where the NLDR view may be misleading. We display several case studies from both simulations and single cell transcriptomics, that shows our approach is useful for cluster orientation tasks.
△ Less
Submitted 10 December, 2020;
originally announced December 2020.
-
brolgar: An R package to BRowse Over Longitudinal Data Graphically and Analytically in R
Authors:
Nicholas J Tierney,
Dianne Cook,
Tania Prvan
Abstract:
Longitudinal (panel) data provide the opportunity to examine temporal patterns of individuals, because measurements are collected on the same person at different, and often irregular, time points. The data is typically visualised using a "spaghetti plot", where a line plot is drawn for each individual. When overlaid in one plot, it can have the appearance of a bowl of spaghetti. With even a small…
▽ More
Longitudinal (panel) data provide the opportunity to examine temporal patterns of individuals, because measurements are collected on the same person at different, and often irregular, time points. The data is typically visualised using a "spaghetti plot", where a line plot is drawn for each individual. When overlaid in one plot, it can have the appearance of a bowl of spaghetti. With even a small number of subjects, these plots are too overloaded to be read easily. The interesting aspects of individual differences are lost in the noise. Longitudinal data is often modelled with a hierarchical linear model to capture the overall trends, and variation among individuals, while accounting for various levels of dependence. However, these models can be difficult to fit, and can miss unusual individual patterns. Better visual tools can help to diagnose longitudinal models, and better capture the individual experiences. This paper introduces the R package, brolgar (BRowse over Longitudinal data Graphically and Analytically in R), which provides tools to identify and summarise interesting individual patterns in longitudinal data.
△ Less
Submitted 2 December, 2020;
originally announced December 2020.
-
Fundamentals of path analysis in the social sciences
Authors:
R. Dennis Cook,
Liliana Forzani
Abstract:
Motivated by a recent series of diametrically opposed articles on the relative value of statistical methods for the analysis of path diagrams in the social sciences, we discuss from a primarily theoretical perspective selected fundamental aspects of path modeling and analysis based on a common re reflexive setting. Since there is a paucity of technical support evident in the debate, our aim is to…
▽ More
Motivated by a recent series of diametrically opposed articles on the relative value of statistical methods for the analysis of path diagrams in the social sciences, we discuss from a primarily theoretical perspective selected fundamental aspects of path modeling and analysis based on a common re reflexive setting. Since there is a paucity of technical support evident in the debate, our aim is to connect it to mainline statistics literature and to address selected foundational issues that may help move the discourse. We do not intend to advocate for or against a particular method or analysis philosophy.
△ Less
Submitted 12 November, 2020;
originally announced November 2020.
-
Enveloped Huber Regression
Authors:
Le Zhou,
R. Dennis Cook,
Hui Zou
Abstract:
Huber regression (HR) is a popular robust alternative to the least squares regression when the error follows a heavy-tailed distribution. We propose a new method called the enveloped Huber regression (EHR) by considering the envelope assumption that there exists some subspace of the predictors that has no association with the response, which is referred to as the immaterial part. More efficient es…
▽ More
Huber regression (HR) is a popular robust alternative to the least squares regression when the error follows a heavy-tailed distribution. We propose a new method called the enveloped Huber regression (EHR) by considering the envelope assumption that there exists some subspace of the predictors that has no association with the response, which is referred to as the immaterial part. More efficient estimation is achieved via the removal of the immaterial part. Different from the envelope least squares (ENV) model whose estimation is based on maximum normal likelihood, the estimation of the EHR model is through Generalized Method of Moments. The asymptotic normality of the EHR estimator is established, and it is shown that EHR is more efficient than HR. Moreover, EHR is more efficient than ENV when the error distribution is heavy-tailed, while maintaining a small efficiency loss when the error distribution is normal. Moreover, our theory also covers the heteroscedastic case in which the error may depend on the covariates. Extensive simulation studies confirm the messages from the asymptotic theory. EHR is further illustrated on a real dataset.
△ Less
Submitted 30 October, 2020;
originally announced November 2020.
-
Visualizing probability distributions across bivariate cyclic temporal granularities
Authors:
Sayani Gupta,
Rob J Hyndman,
Dianne Cook,
Antony Unwin
Abstract:
Deconstructing a time index into time granularities can assist in exploration and automated analysis of large temporal data sets. This paper describes classes of time deconstructions using linear and cyclic time granularities. Linear granularities respect the linear progression of time such as hours, days, weeks and months. Cyclic granularities can be circular such as hour-of-the-day, quasi-circul…
▽ More
Deconstructing a time index into time granularities can assist in exploration and automated analysis of large temporal data sets. This paper describes classes of time deconstructions using linear and cyclic time granularities. Linear granularities respect the linear progression of time such as hours, days, weeks and months. Cyclic granularities can be circular such as hour-of-the-day, quasi-circular such as day-of-the-month, and aperiodic such as public holidays. The hierarchical structure of granularities creates a nested ordering: hour-of-the-day and second-of-the-minute are single-order-up. Hour-of-the-week is multiple-order-up, because it passes over day-of-the-week. Methods are provided for creating all possible granularities for a time index. A recommendation algorithm provides an indication whether a pair of granularities can be meaningfully examined together (a "harmony"), or when they cannot (a "clash").
Time granularities can be used to create data visualizations to explore for periodicities, associations and anomalies. The granularities form categorical variables (ordered or unordered) which induce grou**s of the observations. Assuming a numeric response variable, the resulting graphics are then displays of distributions compared across combinations of categorical variables.
The methods implemented in the open source R package `gravitas` are consistent with a tidy workflow, with probability distributions examined using the range of graphics available in `ggplot2`.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
Burning sage: Reversing the curse of dimensionality in the visualization of high-dimensional data
Authors:
Ursula Laa,
Dianne Cook,
Stuart Lee
Abstract:
In high-dimensional data analysis the curse of dimensionality reasons that points tend to be far away from the center of the distribution and on the edge of high-dimensional space. Contrary to this, is that projected data tends to clump at the center. This gives a sense that any structure near the center of the projection is obscured, whether this is true or not. A transformation to reverse the cu…
▽ More
In high-dimensional data analysis the curse of dimensionality reasons that points tend to be far away from the center of the distribution and on the edge of high-dimensional space. Contrary to this, is that projected data tends to clump at the center. This gives a sense that any structure near the center of the projection is obscured, whether this is true or not. A transformation to reverse the curse, is defined in this paper, which uses radial transformations on the projected data. It is integrated seamlessly into the grand tour algorithm, and we have called it a burning sage tour, to indicate that it reverses the curse. The work is implemented into the tourr package in R. Several case studies are included that show how the sage visualizations enhance exploratory clustering and classification problems.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
Transfer Learning for Activity Recognition in Mobile Health
Authors:
Yuchao Ma,
Andrew T. Campbell,
Diane J. Cook,
John Lach,
Shwetak N. Patel,
Thomas Ploetz,
Majid Sarrafzadeh,
Donna Spruijt-Metz,
Hassan Ghasemzadeh
Abstract:
While activity recognition from inertial sensors holds potential for mobile health, differences in sensing platforms and user movement patterns cause performance degradation. Aiming to address these challenges, we propose a transfer learning framework, TransFall, for sensor-based activity recognition. TransFall's design contains a two-tier data transformation, a label estimation layer, and a model…
▽ More
While activity recognition from inertial sensors holds potential for mobile health, differences in sensing platforms and user movement patterns cause performance degradation. Aiming to address these challenges, we propose a transfer learning framework, TransFall, for sensor-based activity recognition. TransFall's design contains a two-tier data transformation, a label estimation layer, and a model generation layer to recognize activities for the new scenario. We validate TransFall analytically and empirically.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
Multi-Source Deep Domain Adaptation with Weak Supervision for Time-Series Sensor Data
Authors:
Garrett Wilson,
Janardhan Rao Doppa,
Diane J. Cook
Abstract:
Domain adaptation (DA) offers a valuable means to reuse data and models for new problem domains. However, robust techniques have not yet been considered for time series data with varying amounts of data availability. In this paper, we make three main contributions to fill this gap. First, we propose a novel Convolutional deep Domain Adaptation model for Time Series data (CoDATS) that significantly…
▽ More
Domain adaptation (DA) offers a valuable means to reuse data and models for new problem domains. However, robust techniques have not yet been considered for time series data with varying amounts of data availability. In this paper, we make three main contributions to fill this gap. First, we propose a novel Convolutional deep Domain Adaptation model for Time Series data (CoDATS) that significantly improves accuracy and training time over state-of-the-art DA strategies on real-world sensor data benchmarks. By utilizing data from multiple source domains, we increase the usefulness of CoDATS to further improve accuracy over prior single-source methods, particularly on complex time series datasets that have high variability between domains. Second, we propose a novel Domain Adaptation with Weak Supervision (DA-WS) method by utilizing weak supervision in the form of target-domain label distributions, which may be easier to collect than additional data labels. Third, we perform comprehensive experiments on diverse real-world datasets to evaluate the effectiveness of our domain adaptation and weak supervision methods. Results show that CoDATS for single-source DA significantly improves over the state-of-the-art methods, and we achieve additional improvements in accuracy using data from multiple source domains and weakly supervised signals. Code is available at: https://github.com/floft/codats
△ Less
Submitted 22 May, 2020;
originally announced May 2020.
-
Hole or grain? A Section Pursuit Index for Finding Hidden Structure in Multiple Dimensions
Authors:
Ursula Laa,
Dianne Cook,
Andreas Buja,
German Valencia
Abstract:
Multivariate data is often visualized using linear projections, produced by techniques such as principal component analysis, linear discriminant analysis, and projection pursuit. A problem with projections is that they obscure low and high density regions near the center of the distribution. Sections, or slices, can help to reveal them. This paper develops a section pursuit method, building on the…
▽ More
Multivariate data is often visualized using linear projections, produced by techniques such as principal component analysis, linear discriminant analysis, and projection pursuit. A problem with projections is that they obscure low and high density regions near the center of the distribution. Sections, or slices, can help to reveal them. This paper develops a section pursuit method, building on the extensive work in projection pursuit, to search for interesting slices of the data. Linear projections are used to define sections of the parameter space, and to calculate interestingness by comparing the distribution of observations, inside and outside a section. By optimizing this index, it is possible to reveal features such as holes (low density) or grains (high density). The optimization is incorporated into a guided tour so that the search for structure can be dynamic. The approach can be useful for problems when data distributions depart from uniform or normal, as in visually exploring nonlinear manifolds, and functions in multivariate space. Two applications of section pursuit are shown: exploring decision boundaries from classification models, and exploring subspaces induced by complex inequality conditions from multiple parameter model. The new methods are available in R, in the tourr package.
△ Less
Submitted 10 March, 2022; v1 submitted 28 April, 2020;
originally announced April 2020.
-
A slice tour for finding hollowness in high-dimensional data
Authors:
Ursula Laa,
Dianne Cook,
German Valencia
Abstract:
Taking projections of high-dimensional data is a common analytical and visualisation technique in statistics for working with high-dimensional problems. Sectioning, or slicing, through high dimensions is less common, but can be useful for visualising data with concavities, or non-linear structure. It is associated with conditional distributions in statistics, and also linked brushing between plots…
▽ More
Taking projections of high-dimensional data is a common analytical and visualisation technique in statistics for working with high-dimensional problems. Sectioning, or slicing, through high dimensions is less common, but can be useful for visualising data with concavities, or non-linear structure. It is associated with conditional distributions in statistics, and also linked brushing between plots in interactive data visualisation. This short technical note describes a simple approach for slicing in the orthogonal space of projections obtained when running a tour, thus presenting the viewer with an interpolated sequence of sliced projections. The method has been implemented in R as an extension to the tourr package, and can be used to explore for concave and non-linear structures in multivariate distributions.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
A Survey of Techniques All Classifiers Can Learn from Deep Networks: Models, Optimizations, and Regularization
Authors:
Alireza Ghods,
Diane J Cook
Abstract:
Deep neural networks have introduced novel and useful tools to the machine learning community. Other types of classifiers can potentially make use of these tools as well to improve their performance and generality. This paper reviews the current state of the art for deep learning classifier technologies that are being used outside of deep neural networks. Non-network classifiers can employ many co…
▽ More
Deep neural networks have introduced novel and useful tools to the machine learning community. Other types of classifiers can potentially make use of these tools as well to improve their performance and generality. This paper reviews the current state of the art for deep learning classifier technologies that are being used outside of deep neural networks. Non-network classifiers can employ many components found in deep neural network architectures. In this paper, we review the feature learning, optimization, and regularization methods that form a core of deep network technologies. We then survey non-neural network learning algorithms that make innovative use of these methods to improve classification. Because many opportunities and challenges still exist, we discuss directions that can be pursued to expand the area of deep learning for a variety of classification algorithms.
△ Less
Submitted 27 September, 2019; v1 submitted 10 September, 2019;
originally announced September 2019.
-
Conjugate Nearest Neighbor Gaussian Process Models for Efficient Statistical Interpolation of Large Spatial Data
Authors:
Shinichiro Shirota,
Andrew O. Finley,
Bruce D. Cook,
Sudipto Banerjee
Abstract:
A key challenge in spatial statistics is the analysis for massive spatially-referenced data sets. Such analyses often proceed from Gaussian process specifications that can produce rich and robust inference, but involve dense covariance matrices that lack computationally exploitable structures. The matrix computations required for fitting such models involve floating point operations in cubic order…
▽ More
A key challenge in spatial statistics is the analysis for massive spatially-referenced data sets. Such analyses often proceed from Gaussian process specifications that can produce rich and robust inference, but involve dense covariance matrices that lack computationally exploitable structures. The matrix computations required for fitting such models involve floating point operations in cubic order of the number of spatial locations and dynamic memory storage in quadratic order. Recent developments in spatial statistics offer a variety of massively scalable approaches. Bayesian inference and hierarchical models, in particular, have gained popularity due to their richness and flexibility in accommodating spatial processes. Our current contribution is to provide computationally efficient exact algorithms for spatial interpolation of massive data sets using scalable spatial processes. We combine low-rank Gaussian processes with efficient sparse approximations. Following recent work by [1], we model the low-rank process using a Gaussian predictive process (GPP) and the residual process as a sparsity-inducing nearest-neighbor Gaussian process (NNGP). A key contribution here is to implement these models using exact conjugate Bayesian modeling to avoid expensive iterative algorithms. Through the simulation studies, we evaluate performance of the proposed approach and the robustness of our models, especially for long range prediction. We implement our approaches for remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Multi-Purposing Domain Adaptation Discriminators for Pseudo Labeling Confidence
Authors:
Garrett Wilson,
Diane J. Cook
Abstract:
Often domain adaptation is performed using a discriminator (domain classifier) to learn domain-invariant feature representations so that a classifier trained on labeled source data will generalize well to unlabeled target data. A line of research stemming from semi-supervised learning uses pseudo labeling to directly generate "pseudo labels" for the unlabeled target data and trains a classifier on…
▽ More
Often domain adaptation is performed using a discriminator (domain classifier) to learn domain-invariant feature representations so that a classifier trained on labeled source data will generalize well to unlabeled target data. A line of research stemming from semi-supervised learning uses pseudo labeling to directly generate "pseudo labels" for the unlabeled target data and trains a classifier on the now-labeled target data, where the samples are selected or weighted based on some measure of confidence. In this paper, we propose multi-purposing the discriminator to not only aid in producing domain-invariant representations but also to provide pseudo labeling confidence.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics
Authors:
Ursula Laa,
Dianne Cook
Abstract:
Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using…
▽ More
Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, scagnostics, minerva and mbgraphic.
△ Less
Submitted 13 January, 2020; v1 submitted 31 January, 2019;
originally announced February 2019.
-
A new tidy data structure to support exploration and modeling of temporal data
Authors:
Earo Wang,
Dianne Cook,
Rob J Hyndman
Abstract:
Mining temporal data for information is often inhibited by a multitude of formats: irregular or multiple time intervals, point events that need aggregating, multiple observational units or repeated measurements on multiple individuals, and heterogeneous data types. On the other hand, the software supporting time series modeling and forecasting, makes strict assumptions on the data to be provided,…
▽ More
Mining temporal data for information is often inhibited by a multitude of formats: irregular or multiple time intervals, point events that need aggregating, multiple observational units or repeated measurements on multiple individuals, and heterogeneous data types. On the other hand, the software supporting time series modeling and forecasting, makes strict assumptions on the data to be provided, typically requiring a matrix of numeric data with implicit time indexes. Going from raw data to model-ready data is painful. This work presents a cohesive and conceptual framework for organizing and manipulating temporal data, which in turn flows into visualization, modeling and forecasting routines. Tidy data principles are extended to temporal data by: (1) map** the semantics of a dataset into its physical layout; (2) including an explicitly declared index variable representing time; (3) incorporating a "key" comprising single or multiple variables to uniquely identify units over time. This tidy data representation most naturally supports thinking of operations on the data as building blocks, forming part of a "data pipeline" in time-based contexts. A sound data pipeline facilitates a fluent workflow for analyzing temporal data. The infrastructure of tidy temporal data has been implemented in the R package "tsibble".
△ Less
Submitted 13 February, 2019; v1 submitted 29 January, 2019;
originally announced January 2019.
-
A Survey of Unsupervised Deep Domain Adaptation
Authors:
Garrett Wilson,
Diane J. Cook
Abstract:
Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled da…
▽ More
Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain with the goal of performing well at test-time on the target domain. Many single-source and typically homogeneous unsupervised deep domain adaptation approaches have thus been developed, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially-costly target data labels. This survey will compare these approaches by examining alternative methods, the unique and common elements, results, and theoretical insights. We follow this with a look at application areas and open research directions.
△ Less
Submitted 6 February, 2020; v1 submitted 6 December, 2018;
originally announced December 2018.
-
Calendar-based graphics for visualizing people's daily schedules
Authors:
Earo Wang,
Dianne Cook,
Rob J Hyndman
Abstract:
Calendars are broadly used in society to display temporal information, and events. This paper describes a new R package with functionality to organize and display temporal data, collected on sub-daily resolution, into a calendar layout. The function `frame_calendar` uses linear algebra on the date variable to restructure data into a format lending itself to calendar layouts. The user can apply the…
▽ More
Calendars are broadly used in society to display temporal information, and events. This paper describes a new R package with functionality to organize and display temporal data, collected on sub-daily resolution, into a calendar layout. The function `frame_calendar` uses linear algebra on the date variable to restructure data into a format lending itself to calendar layouts. The user can apply the grammar of graphics to create plots inside each calendar cell, and thus the displays synchronize neatly with ggplot2 graphics. The motivating application is studying pedestrian behavior in Melbourne, Australia, based on counts which are captured at hourly intervals by sensors scattered around the city. Faceting by the usual features such as day and month, was insufficient to examine the behavior. Making displays on a monthly calendar format helps to understand pedestrian patterns relative to events such as work days, weekends, holidays, and special events. The layout algorithm has several format options and variations. It is implemented in the R package sugrrants.
△ Less
Submitted 22 October, 2018;
originally announced October 2018.
-
A convex formulation for high-dimensional sparse sliced inverse regression
Authors:
Kean Ming Tan,
Zhaoran Wang,
Tong Zhang,
Han Liu,
R. Dennis Cook
Abstract:
Sliced inverse regression is a popular tool for sufficient dimension reduction, which replaces covariates with a minimal set of their linear combinations without loss of information on the conditional distribution of the response given the covariates. The estimated linear combinations include all covariates, making results difficult to interpret and perhaps unnecessarily variable, particularly whe…
▽ More
Sliced inverse regression is a popular tool for sufficient dimension reduction, which replaces covariates with a minimal set of their linear combinations without loss of information on the conditional distribution of the response given the covariates. The estimated linear combinations include all covariates, making results difficult to interpret and perhaps unnecessarily variable, particularly when the number of covariates is large. In this paper, we propose a convex formulation for fitting sparse sliced inverse regression in high dimensions. Our proposal estimates the subspace of the linear combinations of the covariates directly and performs variable selection simultaneously. We solve the resulting convex optimization problem via the linearized alternating direction methods of multiplier algorithm, and establish an upper bound on the subspace distance between the estimated and the true subspaces. Through numerical studies, we show that our proposal is able to identify the correct covariates in the high-dimensional setting.
△ Less
Submitted 17 September, 2018;
originally announced September 2018.
-
Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations
Authors:
Nicholas J Tierney,
Dianne H Cook
Abstract:
Despite the large body of research on missing value distributions and imputation, there is comparatively little literature with a focus on how to make it easy to handle, explore, and impute missing values in data. This paper addresses this gap. The new methodology builds upon tidy data principles, with the goal of integrating missing value handling as a key part of data analysis workflows. We defi…
▽ More
Despite the large body of research on missing value distributions and imputation, there is comparatively little literature with a focus on how to make it easy to handle, explore, and impute missing values in data. This paper addresses this gap. The new methodology builds upon tidy data principles, with the goal of integrating missing value handling as a key part of data analysis workflows. We define a new data structure, and a suite of new operations. Together, these provide a connected framework for handling, exploring, and imputing missing values. These methods are available in the R package `naniar`.
△ Less
Submitted 14 May, 2020; v1 submitted 6 September, 2018;
originally announced September 2018.
-
A Projection Pursuit Forest Algorithm for Supervised Classification
Authors:
Natalia da Silva,
Dianne Cook,
Eun-Kyung Lee
Abstract:
This paper presents a new ensemble learning method for classification problems called projection pursuit random forest (PPF). PPF uses the PPtree algorithm introduced in Lee et al. (2013). In PPF, trees are constructed by splitting on linear combinations of randomly chosen variables. Projection pursuit is used to choose a projection of the variables that best separates the classes. Utilizing linea…
▽ More
This paper presents a new ensemble learning method for classification problems called projection pursuit random forest (PPF). PPF uses the PPtree algorithm introduced in Lee et al. (2013). In PPF, trees are constructed by splitting on linear combinations of randomly chosen variables. Projection pursuit is used to choose a projection of the variables that best separates the classes. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account which allows PPF to outperform a traditional random forest when separations between groups occurs in combinations of variables.
The method presented here can be used in multi-class problems and is implemented into an R (R Core Team, 2018) package, PPforest, which is available on CRAN, with development versions at https://github.com/natydasilva/PPforest.
△ Less
Submitted 25 July, 2018; v1 submitted 18 July, 2018;
originally announced July 2018.
-
Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Map**
Authors:
Daniel Taylor-Rodriguez,
Andrew O. Finley,
Abhirup Datta,
Chad Babcock,
Hans-Erik Andersen,
Bruce D. Cook,
Douglas C. Morton,
Sudipto Banerjee
Abstract:
Gathering information about forest variables is an expensive and arduous activity. As such, directly collecting the data required to produce high-resolution maps over large spatial domains is infeasible. Next generation collection initiatives of remotely sensed Light Detection and Ranging (LiDAR) data are specifically aimed at producing complete-coverage maps over large spatial domains. Given that…
▽ More
Gathering information about forest variables is an expensive and arduous activity. As such, directly collecting the data required to produce high-resolution maps over large spatial domains is infeasible. Next generation collection initiatives of remotely sensed Light Detection and Ranging (LiDAR) data are specifically aimed at producing complete-coverage maps over large spatial domains. Given that LiDAR data and forest characteristics are often strongly correlated, it is possible to make use of the former to model, predict, and map forest variables over regions of interest. This entails dealing with the high-dimensional ($\sim$$10^2$) spatially dependent LiDAR outcomes over a large number of locations (~10^5-10^6). With this in mind, we develop the Spatial Factor Nearest Neighbor Gaussian Process (SF-NNGP) model, and embed it in a two-stage approach that connects the spatial structure found in LiDAR signals with forest variables. We provide a simulation experiment that demonstrates inferential and predictive performance of the SF-NNGP, and use the two-stage modeling strategy to generate complete-coverage maps of forest variables with associated uncertainty over a large region of boreal forests in interior Alaska.
△ Less
Submitted 8 November, 2018; v1 submitted 6 January, 2018;
originally announced January 2018.
-
Multivariate Design of Experiments for Engineering Dimensional Analysis
Authors:
Daniel J. Eck,
Christopher J. Nachtsheim,
R. Dennis Cook,
Thomas A. Albrecht
Abstract:
We consider the design of dimensional analysis experiments when there is more than a single response. We first give a brief overview of dimensional analysis experiments and the dimensional analysis (DA) procedure. The validity of the DA method for univariate responses was established by the Buckingham $Π$-Theorem in the early 20th century. We extend the theorem to the multivariate case, develop ba…
▽ More
We consider the design of dimensional analysis experiments when there is more than a single response. We first give a brief overview of dimensional analysis experiments and the dimensional analysis (DA) procedure. The validity of the DA method for univariate responses was established by the Buckingham $Π$-Theorem in the early 20th century. We extend the theorem to the multivariate case, develop basic criteria for multivariate design of DA and give guidelines for design construction. Finally, we illustrate the construction of designs for DA experiments for an example involving the design of a heat exchanger.
△ Less
Submitted 7 August, 2018; v1 submitted 4 August, 2017;
originally announced August 2017.
-
Geostatistical estimation of forest biomass in interior Alaska combining Landsat-derived tree cover, sampled airborne lidar and field observations
Authors:
Chad Babcock,
Andrew O. Finley,
Hans-Erik Andersen,
Robert Pattison,
Bruce D. Cook,
Douglas C. Morton,
Michael Alonzo,
Ross Nelson,
Timothy Gregoire,
Liviu Ene,
Terje Gobakken,
Erik Næsset
Abstract:
The goal of this research was to develop and examine the performance of a geostatistical coregionalization modeling approach for combining field inventory measurements, strip samples of airborne lidar and Landsat-based remote sensing data products to predict aboveground biomass (AGB) in interior Alaska's Tanana Valley. The proposed modeling strategy facilitates pixel-level map** of AGB density p…
▽ More
The goal of this research was to develop and examine the performance of a geostatistical coregionalization modeling approach for combining field inventory measurements, strip samples of airborne lidar and Landsat-based remote sensing data products to predict aboveground biomass (AGB) in interior Alaska's Tanana Valley. The proposed modeling strategy facilitates pixel-level map** of AGB density predictions across the entire spatial domain. Additionally, the coregionalization framework allows for statistically sound estimation of total AGB for arbitrary areal units within the study area---a key advance to support diverse management objectives in interior Alaska. This research focuses on appropriate characterization of prediction uncertainty in the form of posterior predictive coverage intervals and standard deviations. Using the framework detailed here, it is possible to quantify estimation uncertainty for any spatial extent, ranging from pixel-level predictions of AGB density to estimates of AGB stocks for the full domain. The lidar-informed coregionalization models consistently outperformed their counterpart lidar-free models in terms of point-level predictive performance and total AGB precision. Additionally, the inclusion of Landsat-derived forest cover as a covariate further improved estimation precision in regions with lower lidar sampling intensity. Our findings also demonstrate that model-based approaches that do not explicitly account for residual spatial dependence can grossly underestimate uncertainty, resulting in falsely precise estimates of AGB. On the other hand, in a geostatistical setting, residual spatial structure can be modeled within a Bayesian hierarchical framework to obtain statistically defensible assessments of uncertainty for AGB estimates.
△ Less
Submitted 20 December, 2017; v1 submitted 9 May, 2017;
originally announced May 2017.
-
Interactive Graphics for Visually Diagnosing Forest Classifiers in R
Authors:
Natalia da Silva,
Dianne Cook,
Eun-Kyung Lee
Abstract:
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions…
▽ More
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble, produced by bagging multiple trees. The process of bagging and combining results from multiple trees, produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm, and to the projection pursuit forest, but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R, using the ggplot2, plotly, and shiny packages.
△ Less
Submitted 8 April, 2017;
originally announced April 2017.
-
Combining Envelope Methodology and Aster Models for Variance Reduction in Life History Analyses
Authors:
Daniel J. Eck,
Charles J. Geyer,
R. Dennis Cook
Abstract:
Precise estimation of expected Darwinian fitness, the expected lifetime number of offspring of organism, is a central component of life history analysis. The aster model serves as a defensible statistical model for distributions of Darwinian fitness. The aster model is equipped to incorporate the major life stages an organism travels through which separately may effect Darwinian fitness. Envelope…
▽ More
Precise estimation of expected Darwinian fitness, the expected lifetime number of offspring of organism, is a central component of life history analysis. The aster model serves as a defensible statistical model for distributions of Darwinian fitness. The aster model is equipped to incorporate the major life stages an organism travels through which separately may effect Darwinian fitness. Envelope methodology reduces asymptotic variability by establishing a link between unknown parameters of interest and the asymptotic covariance matrices of their estimators. It is known both theoretically and in applications that incorporation of envelope methodology reduces asymptotic variability. We develop an envelope framework, including a new envelope estimator, that is appropriate for aster analyses. The level of precision provided from our methods allows researchers to draw stronger conclusions about the driving forces of Darwinian fitness from their life history analyses than they could with the aster model alone. Our methods are illustrated on a simulated dataset and a life history analysis of \emph{Mimulus guttatus} flowers is provided. Useful variance reduction is obtained in both analyses.
△ Less
Submitted 27 February, 2018; v1 submitted 26 January, 2017;
originally announced January 2017.
-
Weighted envelope estimation to handle variability in model selection
Authors:
Daniel J. Eck,
R. Dennis Cook
Abstract:
Envelope methodology can provide substantial efficiency gains in multivariate statistical problems, but in some applications the estimation of the envelope dimension can induce selection volatility that may mitigate those gains. Current envelope methodology does not account for the added variance that can result from this selection. In this article, we circumvent dimension selection volatility thr…
▽ More
Envelope methodology can provide substantial efficiency gains in multivariate statistical problems, but in some applications the estimation of the envelope dimension can induce selection volatility that may mitigate those gains. Current envelope methodology does not account for the added variance that can result from this selection. In this article, we circumvent dimension selection volatility through the development of a weighted envelope estimator. Theoretical justification is given for our estimator and validity of the residual bootstrap for estimating its asymptotic variance is established. A simulation study and an analysis on a real data set illustrate the utility of our weighted envelope estimator.
△ Less
Submitted 14 April, 2017; v1 submitted 3 January, 2017;
originally announced January 2017.
-
Matrix-Variate Regressions and Envelope Models
Authors:
Shanshan Ding,
R. Dennis Cook
Abstract:
Modern technology often generates data with complex structures in which both response and explanatory variables are matrix-valued. Existing methods in the literature are able to tackle matrix-valued predictors but are rather limited for matrix-valued responses. In this article, we study matrix-variate regressions for such data, where the response Y on each experimental unit is a random matrix and…
▽ More
Modern technology often generates data with complex structures in which both response and explanatory variables are matrix-valued. Existing methods in the literature are able to tackle matrix-valued predictors but are rather limited for matrix-valued responses. In this article, we study matrix-variate regressions for such data, where the response Y on each experimental unit is a random matrix and the predictor X can be either a scalar, a vector, or a matrix, treated as non-stochastic in terms of the conditional distribution Y|X. We propose models for matrix-variate regressions and then develop envelope extensions of these models. Under the envelope framework, redundant variation can be eliminated in estimation and the number of parameters can be notably reduced when the matrix-variate dimension is large, possibly resulting in significant gains in efficiency. The proposed methods are applicable to high dimensional settings.
△ Less
Submitted 30 July, 2017; v1 submitted 5 May, 2016;
originally announced May 2016.
-
Joint hierarchical models for sparsely sampled high-dimensional LiDAR and forest variables
Authors:
Andrew O. Finley,
Sudipto Banerjee,
Yuzhen Zhou,
Bruce D. Cook,
Chad Babcock
Abstract:
Recent advancements in remote sensing technology, specifically Light Detection and Ranging (LiDAR) sensors, provide the data needed to quantify forest characteristics at a fine spatial resolution over large geographic domains. From an inferential standpoint, there is interest in prediction and interpolation of the often sparsely sampled and spatially misaligned LiDAR signals and forest variables.…
▽ More
Recent advancements in remote sensing technology, specifically Light Detection and Ranging (LiDAR) sensors, provide the data needed to quantify forest characteristics at a fine spatial resolution over large geographic domains. From an inferential standpoint, there is interest in prediction and interpolation of the often sparsely sampled and spatially misaligned LiDAR signals and forest variables. We propose a fully process-based Bayesian hierarchical model for above ground biomass (AGB) and LiDAR signals. The process-based framework offers richness in inferential capabilities, e.g., inference on the entire underlying processes instead of estimates only at pre-specified points. Key challenges we obviate include misalignment between the AGB observations and LiDAR signals and the high-dimensionality in the model emerging from LiDAR signals in conjunction with the large number of spatial locations. We offer simulation experiments to evaluate our proposed models and also apply them to a challenging dataset comprising LiDAR and spatially coinciding forest inventory variables collected on the Penobscot Experimental Forest (PEF), Maine. Our key substantive contributions include AGB data products with associated measures of uncertainty for the PEF and, more broadly, a methodology that should find use in a variety of current and upcoming forest variable map** efforts using sparsely sampled remotely sensed high-dimensional data.
△ Less
Submitted 5 December, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Algorithms for Envelope Estimation II
Authors:
Dennis Cook,
Liliana Forzani,
Zhihua Su
Abstract:
We propose a new algorithm for envelope estimation, along with a new root n consistent method for computing starting values. The new algorithm, which does not require optimization over a Grassmannian, is shown by simulation to be much faster and typically more accurate that the best existing algorithm proposed by Cook and Zhang (2015c).
We propose a new algorithm for envelope estimation, along with a new root n consistent method for computing starting values. The new algorithm, which does not require optimization over a Grassmannian, is shown by simulation to be much faster and typically more accurate that the best existing algorithm proposed by Cook and Zhang (2015c).
△ Less
Submitted 12 September, 2015;
originally announced September 2015.
-
Model Choice and Diagnostics for Linear Mixed-Effects Models Using Statistics on Street Corners
Authors:
Adam Loy,
Heike Hofmann,
Dianne Cook
Abstract:
The complexity of linear mixed-effects (LME) models means that traditional diagnostics are rendered less effective. This is due to a breakdown of asymptotic results, boundary issues, and visible patterns in residual plots that are introduced by the model fitting process. Some of these issues are well known and adjustments have been proposed. Working with LME models typically requires that the anal…
▽ More
The complexity of linear mixed-effects (LME) models means that traditional diagnostics are rendered less effective. This is due to a breakdown of asymptotic results, boundary issues, and visible patterns in residual plots that are introduced by the model fitting process. Some of these issues are well known and adjustments have been proposed. Working with LME models typically requires that the analyst keeps track of all the special circumstances that may arise. In this paper we illustrate a simpler but generally applicable approach to diagnosing LME models. We explain how to use new visual inference methods for these purposes. The approach provides a unified framework for diagnosing LME fits and for model selection. We illustrate the use of this approach on several commonly available data sets. A large-scale Amazon Turk study was used to validate the methods. R code is provided for the analyses.
△ Less
Submitted 6 December, 2016; v1 submitted 24 February, 2015;
originally announced February 2015.
-
Enabling Interactivity on Displays of Multivariate Time Series and Longitudinal Data
Authors:
Xiaoyue Cheng,
Dianne Cook,
Heike Hofmann
Abstract:
Temporal data is information measured in the context of time. This contextual structure provides components that need to be explored to understand the data and that can form the basis of interactions applied to the plots. In multivariate time series we expect to see temporal dependence, long term and seasonal trends and cross-correlations. In longitudinal data we also expect within and between sub…
▽ More
Temporal data is information measured in the context of time. This contextual structure provides components that need to be explored to understand the data and that can form the basis of interactions applied to the plots. In multivariate time series we expect to see temporal dependence, long term and seasonal trends and cross-correlations. In longitudinal data we also expect within and between subject dependence. Time series and longitudinal data, although analyzed differently, are often plotted using similar displays. We provide a taxonomy of interactions on plots that can enable exploring temporal components of these data types, and describe how to build these interactions using data transformations. Because temporal data is often accompanied other types of data we also describe how to link the temporal plots with other displays of data. The ideas are conceptualized into a data pipeline for temporal data, and implemented into the R package cranvas. This package provides many different types of interactive graphics that can be used together to explore data or diagnose a model fit.
△ Less
Submitted 20 December, 2014;
originally announced December 2014.
-
Dynamic spatial regression models for space-varying forest stand tables
Authors:
Andrew O. Finley,
Sudipto Banerjee,
Aaron R. Weiskittel,
Chad Babcock,
Bruce D. Cook
Abstract:
Many forest management planning decisions are based on information about the number of trees by species and diameter per unit area. This information is commonly summarized in a stand table, where a stand is defined as a group of forest trees of sufficiently uniform species composition, age, condition, or productivity to be considered a homogeneous unit for planning purposes. Typically information…
▽ More
Many forest management planning decisions are based on information about the number of trees by species and diameter per unit area. This information is commonly summarized in a stand table, where a stand is defined as a group of forest trees of sufficiently uniform species composition, age, condition, or productivity to be considered a homogeneous unit for planning purposes. Typically information used to construct stand tables is gleaned from observed subsets of the forest selected using a probability-based sampling design. Such sampling campaigns are expensive and hence only a small number of sample units are typically observed. This data paucity means that stand tables can only be estimated for relatively large areal units. Contemporary forest management planning and spatially explicit ecosystem models require stand table input at higher spatial resolution than can be affordably provided using traditional approaches. We propose a dynamic multivariate Poisson spatial regression model that accommodates both spatial correlation between observed diameter distributions and also correlation between tree counts across diameter classes within each location. To improve fit and prediction at unobserved locations, diameter specific intensities can be estimated using auxiliary data such as management history or remotely sensed information. The proposed model is used to analyze a diverse forest inventory dataset collected on the United States Forest Service Penobscot Experimental Forest in Bradley, Maine. Results demonstrate that explicitly modeling the residual spatial structure via a multivariate Gaussian process and incorporating information about forest structure from LiDAR covariates improve model fit and can provide high spatial resolution stand table maps with associated estimates of uncertainty.
△ Less
Submitted 3 November, 2014;
originally announced November 2014.