Skip to main content

Showing 1–12 of 12 results for author: Tan, Y S

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.19958  [pdf, other

    stat.ML cs.LG math.ST

    The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis

    Authors: Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In th… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 65C40

  2. arXiv:2309.09880  [pdf, ps, other

    stat.ML cs.LG

    Error Reduction from Stacked Regressions

    Authors: Xin Chen, Jason M. Klusowski, Yan Shuo Tan

    Abstract: Stacking regressions is an ensemble technique that forms linear combinations of different regression estimators to enhance predictive accuracy. The conventional approach uses cross-validation data to generate predictions from the constituent estimators, and least-squares with nonnegativity constraints to learn the combination weights. In this paper, we learn these weights analogously by minimizing… ▽ More

    Submitted 26 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

  3. arXiv:2307.01932  [pdf, other

    stat.ME cs.AI cs.LG stat.ML

    MDI+: A Flexible Random Forest-Based Feature Importance Framework

    Authors: Abhineet Agarwal, Ana M. Kenney, Yan Shuo Tan, Tiffany M. Tang, Bin Yu

    Abstract: Mean decrease in impurity (MDI) is a popular feature importance measure for random forests (RFs). We show that the MDI for a feature $X_k$ in each tree in an RF is equivalent to the unnormalized $R^2$ value in a linear regression of the response on the collection of decision stumps that split on $X_k$. We use this interpretation to propose a flexible feature importance framework called MDI+. Speci… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  4. arXiv:2210.09352  [pdf, other

    stat.ML cs.AI cs.LG math.ST

    A Mixing Time Lower Bound for a Simplified Version of BART

    Authors: Omer Ronen, Theo Saarinen, Yan Shuo Tan, James Duncan, Bin Yu

    Abstract: Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, bios… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  5. arXiv:2202.00858  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods

    Authors: Abhineet Agarwal, Yan Shuo Tan, Omer Ronen, Chandan Singh, Bin Yu

    Abstract: Tree-based models such as decision trees and random forests (RF) are a cornerstone of modern machine-learning practice. To mitigate overfitting, trees are typically regularized by a variety of techniques that modify their structure (e.g. pruning). We introduce Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking th… ▽ More

    Submitted 1 February, 2022; originally announced February 2022.

  6. arXiv:2201.11931  [pdf, other

    cs.LG cs.AI stat.AP stat.ME stat.ML

    Fast Interpretable Greedy-Tree Sums

    Authors: Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, Bin Yu

    Abstract: Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FI… ▽ More

    Submitted 8 July, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

  7. arXiv:2110.09626  [pdf, other

    stat.ML cs.IT cs.LG

    A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds

    Authors: Yan Shuo Tan, Abhineet Agarwal, Bin Yu

    Abstract: Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We ta… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  8. arXiv:2008.10109  [pdf, other

    stat.ME cs.LG stat.AP

    Stable discovery of interpretable subgroups via calibration in causal studies

    Authors: Raaz Dwivedi, Yan Shuo Tan, Briton Park, Mian Wei, Kevin Horgan, David Madigan, Bin Yu

    Abstract: Building on Yu and Kumbier's PCS framework and for randomized experiments, we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects. StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial (RCT), that compared the risk of adverse events from a… ▽ More

    Submitted 28 September, 2020; v1 submitted 23 August, 2020; originally announced August 2020.

    Comments: Raaz Dwivedi and Yan Shuo Tan are joint first authors and contributed equally to this work. 52 pages, 8 Figures, 9 Tables. To appear in International Statistical Review, 2020

  9. Curating a COVID-19 data repository and forecasting county-level death counts in the United States

    Authors: Nick Altieri, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu

    Abstract: As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de… ▽ More

    Submitted 9 August, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: Authors ordered alphabetically. All authors contributed significantly to this work. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}

    Journal ref: Published in Harvard Data Science Review, 2020

  10. arXiv:1910.12837  [pdf, other

    stat.ML cs.IT cs.LG math.NA math.OC

    Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

    Authors: Yan Shuo Tan, Roman Vershynin

    Abstract: In recent literature, a general two step procedure has been formulated for solving the problem of phase retrieval. First, a spectral technique is used to obtain a constant-error initial estimate, following which, the estimate is refined to arbitrary precision by first-order optimization of a non-convex loss function. Numerical experiments, however, seem to suggest that simply running the iterative… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    MSC Class: 65K10

    Journal ref: Journal of Machine Learning Research, 24(58), 1-47 (2023)

  11. arXiv:1709.04744  [pdf, ps, other

    cs.CV cs.LG stat.ML

    Subspace Clustering using Ensembles of $K$-Subspaces

    Authors: John Lipor, David Hong, Yan Shuo Tan, Laura Balzano

    Abstract: Subspace clustering is the unsupervised grou** of points lying near a union of low-dimensional linear subspaces. Algorithms based directly on geometric properties of such data tend to either provide poor empirical performance, lack theoretical guarantees, or depend heavily on their initialization. We present a novel geometric approach to the subspace clustering problem that leverages ensembles o… ▽ More

    Submitted 6 January, 2021; v1 submitted 14 September, 2017; originally announced September 2017.

  12. arXiv:1704.01041  [pdf, ps, other

    cs.LG math.PR stat.ML

    Polynomial Time and Sample Complexity for Non-Gaussian Component Analysis: Spectral Methods

    Authors: Yan Shuo Tan, Roman Vershynin

    Abstract: The problem of Non-Gaussian Component Analysis (NGCA) is about finding a maximal low-dimensional subspace $E$ in $\mathbb{R}^n$ so that data points projected onto $E$ follow a non-gaussian distribution. Although this is an appropriate model for some real world data analysis problems, there has been little progress on this problem over the last decade. In this paper, we attempt to address this st… ▽ More

    Submitted 4 April, 2017; originally announced April 2017.

    MSC Class: 68Q87