-
Testing for a difference in means of a single feature after clustering
Authors:
Yiqun T. Chen,
Lucy L. Gao
Abstract:
For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a s…
▽ More
For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a single feature between a pair of clusters obtained using hierarchical or $k$-means clustering. The test based on the proposed $p$-value controls the selective Type I error rate in finite samples and can be efficiently computed. We further illustrate the validity and power of our proposal in simulation and demonstrate its use on single-cell RNA-sequencing data.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Negative binomial count splitting for single-cell RNA sequencing data
Authors:
Anna Neufeld,
Joshua Popp,
Lucy L. Gao,
Alexis Battle,
Daniela Witten
Abstract:
The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cell…
▽ More
The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cells twice. Poisson count splitting was recently proposed as a way to work backwards from a single observed Poisson data matrix to obtain independent Poisson training and test matrices that could have arisen from two independent sequencing experiments conducted on the same set of cells. However, the Poisson count splitting approach requires that the original data are exactly Poisson distributed: in the presence of any overdispersion, the resulting training and test datasets are not independent. In this paper, we introduce negative binomial count splitting, which extends Poisson count splitting to the more flexible negative binomial setting. Given an $n \times p$ dataset from a negative binomial distribution, we use Dirichlet-multinomial sampling to create two or more independent $n \times p$ negative binomial datasets. We show that this procedure outperforms Poisson count splitting in simulation, and apply it to validate clusters of kidney cells from a human fetal cell atlas.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Generalized Data Thinning Using Sufficient Statistics
Authors:
Ameer Dharamshi,
Anna Neufeld,
Keshav Motwani,
Lucy L. Gao,
Daniela Witten,
Jacob Bien
Abstract:
Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent r…
▽ More
Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct $X$. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.
△ Less
Submitted 11 June, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Necessary and sufficient conditions for multiple objective optimal regression designs
Authors:
Lucy L. Gao,
Jane J. Ye,
Shangzhi Zeng,
Julie Zhou
Abstract:
We typically construct optimal designs based on a single objective function. To better capture the breadth of an experiment's goals, we could instead construct a multiple objective optimal design based on multiple objective functions. While algorithms have been developed to find multi-objective optimal designs (e.g. efficiency-constrained and maximin optimal designs), it is far less clear how to v…
▽ More
We typically construct optimal designs based on a single objective function. To better capture the breadth of an experiment's goals, we could instead construct a multiple objective optimal design based on multiple objective functions. While algorithms have been developed to find multi-objective optimal designs (e.g. efficiency-constrained and maximin optimal designs), it is far less clear how to verify the optimality of a solution obtained from an algorithm. In this paper, we provide theoretical results characterizing optimality for efficiency-constrained and maximin optimal designs on a discrete design space. We demonstrate how to use our results in conjunction with linear programming algorithms to verify optimality.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Data thinning for convolution-closed distributions
Authors:
Anna Neufeld,
Ameer Dharamshi,
Lucy L. Gao,
Daniela Witten
Abstract:
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, a…
▽ More
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
△ Less
Submitted 20 November, 2023; v1 submitted 17 January, 2023;
originally announced January 2023.
-
Inference after latent variable estimation for single-cell RNA sequencing data
Authors:
Anna Neufeld,
Lucy L. Gao,
Joshua Popp,
Alexis Battle,
Daniela Witten
Abstract:
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-…
▽ More
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this paper, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells differentiating to cardiomyocytes.
△ Less
Submitted 18 October, 2022; v1 submitted 1 July, 2022;
originally announced July 2022.
-
Tree-Values: selective inference for regression trees
Authors:
Anna C. Neufeld,
Lucy L. Gao,
Daniela M. Witten
Abstract:
We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting infer…
▽ More
We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
△ Less
Submitted 17 October, 2022; v1 submitted 14 June, 2021;
originally announced June 2021.
-
Selective Inference for Hierarchical Clustering
Authors:
Lucy L. Gao,
Jacob Bien,
Daniela Witten
Abstract:
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their mean…
▽ More
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
△ Less
Submitted 31 October, 2022; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Minimax D-optimal designs for multivariate regression models with multi-factors
Authors:
Lucy L. Gao,
Julie Zhou
Abstract:
In multi-response regression models, the error covariance matrix is never known in practice. Thus, there is a need for optimal designs which are robust against possible misspecification of the error covariance matrix. In this paper, we approximate the error covariance matrix with a neighbourhood of covariance matrices, in order to define minimax D-optimal designs which are robust against small dep…
▽ More
In multi-response regression models, the error covariance matrix is never known in practice. Thus, there is a need for optimal designs which are robust against possible misspecification of the error covariance matrix. In this paper, we approximate the error covariance matrix with a neighbourhood of covariance matrices, in order to define minimax D-optimal designs which are robust against small departures from an assumed error covariance matrix. It is well known that the optimization problems associated with robust designs are non-convex, which makes it challenging to construct robust designs analytically or numerically, even for one-response regression models. We show that the objective function for the minimax D-optimal design is a difference of two convex functions. This leads us to develop a flexible algorithm for computing minimax D-optimal designs, which can be applied to any multi-response model with a discrete design space. We also derive several theoretical results for minimax D-optimal designs, including scale invariance and reflection symmetry.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
Testing for Association in Multi-View Network Data
Authors:
Lucy L. Gao,
Daniela Witten,
Jacob Bien
Abstract:
In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a sto…
▽ More
In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a stochastic block model, is there an association between the latent community memberships of the nodes in the two networks? To answer this question, we extend the stochastic block model for a single network view to the two-view setting, and develop a new hypothesis test for the null hypothesis that the latent community memberships in the two data views are independent. We apply our test to protein-protein interaction data from the HINT database (Das and Hint, 2012). We find evidence of a weak association between the latent community memberships of proteins defined with respect to binary interaction data and the latent community memberships of proteins defined with respect to co-complex association data. We also extend this proposal to the setting of a network with node covariates.
△ Less
Submitted 22 March, 2021; v1 submitted 25 September, 2019;
originally announced September 2019.
-
Are Clusterings of Multiple Data Views Independent?
Authors:
Lucy L. Gao,
Jacob Bien,
Daniela Witten
Abstract:
In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster th…
▽ More
In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information. However, clustering the participants based on multiple data views implicitly assumes that a single underlying clustering of the participants is shared across all data views. If this assumption does not hold, then clustering the participants using multiple data views may lead to spurious results. In this paper, we seek to evaluate the assumption that there is some underlying relationship among the clusterings from the different data views, by asking the question: are the clusters within each data view dependent or independent? We develop a new test for answering this question, which we then apply to clinical, proteomic, and metabolomic data, across two distinct timepoints, from the P100 study. We find that while the subgroups of the participants defined with respect to any single data type seem to be dependent across time, the clustering among the participants based on one data type (e.g. proteomic data) appears not to be associated with the clustering based on another data type (e.g. clinical data).
△ Less
Submitted 12 January, 2019;
originally announced January 2019.