Skip to main content

Showing 1–11 of 11 results for author: Gao, L L

Searching in archive stat. Search in all archives.
.
  1. arXiv:2311.16375  [pdf, other

    stat.ME q-bio.QM stat.AP

    Testing for a difference in means of a single feature after clustering

    Authors: Yiqun T. Chen, Lucy L. Gao

    Abstract: For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a s… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    MSC Class: 62H30; 62H15; 62P10

  2. arXiv:2307.12985  [pdf, other

    stat.ME stat.AP

    Negative binomial count splitting for single-cell RNA sequencing data

    Authors: Anna Neufeld, Joshua Popp, Lucy L. Gao, Alexis Battle, Daniela Witten

    Abstract: The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. In reality, we cannot sequence the same set of cell… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  3. arXiv:2303.12931  [pdf, other

    stat.ME math.ST stat.ML

    Generalized Data Thinning Using Sufficient Statistics

    Authors: Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such that $X = \sum_{k=1}^K X^{(k)}$. These independent r… ▽ More

    Submitted 11 June, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

  4. arXiv:2303.04746  [pdf, other

    stat.ME

    Necessary and sufficient conditions for multiple objective optimal regression designs

    Authors: Lucy L. Gao, Jane J. Ye, Shangzhi Zeng, Julie Zhou

    Abstract: We typically construct optimal designs based on a single objective function. To better capture the breadth of an experiment's goals, we could instead construct a multiple objective optimal design based on multiple objective functions. While algorithms have been developed to find multi-objective optimal designs (e.g. efficiency-constrained and maximin optimal designs), it is far less clear how to v… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  5. arXiv:2301.07276  [pdf, other

    stat.ME stat.ML

    Data thinning for convolution-closed distributions

    Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten

    Abstract: We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, a… ▽ More

    Submitted 20 November, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

  6. arXiv:2207.00554  [pdf, ps, other

    stat.ME stat.AP

    Inference after latent variable estimation for single-cell RNA sequencing data

    Authors: Anna Neufeld, Lucy L. Gao, Joshua Popp, Alexis Battle, Daniela Witten

    Abstract: In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-… ▽ More

    Submitted 18 October, 2022; v1 submitted 1 July, 2022; originally announced July 2022.

    Comments: 43 pages, 7 figures

  7. arXiv:2106.07816  [pdf, other

    stat.ME stat.ML

    Tree-Values: selective inference for regression trees

    Authors: Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten

    Abstract: We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting infer… ▽ More

    Submitted 17 October, 2022; v1 submitted 14 June, 2021; originally announced June 2021.

  8. arXiv:2012.02936  [pdf, other

    stat.ME stat.ML

    Selective Inference for Hierarchical Clustering

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their mean… ▽ More

    Submitted 31 October, 2022; v1 submitted 4 December, 2020; originally announced December 2020.

    Comments: Final accepted version

  9. arXiv:1910.00745  [pdf, other

    stat.ME

    Minimax D-optimal designs for multivariate regression models with multi-factors

    Authors: Lucy L. Gao, Julie Zhou

    Abstract: In multi-response regression models, the error covariance matrix is never known in practice. Thus, there is a need for optimal designs which are robust against possible misspecification of the error covariance matrix. In this paper, we approximate the error covariance matrix with a neighbourhood of covariance matrices, in order to define minimax D-optimal designs which are robust against small dep… ▽ More

    Submitted 1 October, 2019; originally announced October 2019.

  10. arXiv:1909.11640  [pdf, other

    stat.ME stat.ML

    Testing for Association in Multi-View Network Data

    Authors: Lucy L. Gao, Daniela Witten, Jacob Bien

    Abstract: In this paper, we consider data consisting of multiple networks, each comprised of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multi-view network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption. In particular, we ask: given two networks that each follow a sto… ▽ More

    Submitted 22 March, 2021; v1 submitted 25 September, 2019; originally announced September 2019.

  11. arXiv:1901.03905  [pdf, other

    stat.ME stat.ML

    Are Clusterings of Multiple Data Views Independent?

    Authors: Lucy L. Gao, Jacob Bien, Daniela Witten

    Abstract: In the Pioneer 100 (P100) Wellness Project (Price and others, 2017), multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster th… ▽ More

    Submitted 12 January, 2019; originally announced January 2019.

    Comments: 20 pages, 4 figures, 1 table (main text); 15 pages, 9 figures (supplement)