Skip to main content

Showing 1–50 of 69 results for author: Hu, L

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.19531  [pdf, other

    stat.ML cs.LG

    Forward and Backward State Abstractions for Off-policy Evaluation

    Authors: Meiling Hao, **fan Su, Liyuan Hu, Zoltan Szabo, Qingyuan Zhao, Chengchun Shi

    Abstract: Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstracti… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 42 pages, 5 figures

    ACM Class: G.3; I.2.6; G.1.2

  2. arXiv:2404.13503  [pdf, other

    cs.LG cs.DS stat.ML

    Predict to Minimize Swap Regret for All Payoff-Bounded Tasks

    Authors: Lunjia Hu, Yifan Wu

    Abstract: A sequence of predictions is calibrated if and only if it induces no swap regret to all down-stream decision tasks. We study the Maximum Swap Regret (MSR) of predictions for binary events: the swap regret maximized over all downstream tasks with bounded payoffs. Previously, the best online prediction algorithm for minimizing MSR is obtained by minimizing the K1 calibration error, which upper bound… ▽ More

    Submitted 24 April, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

  3. arXiv:2402.13187  [pdf, other

    cs.LG cs.DS stat.CO stat.ML

    Testing Calibration in Nearly-Linear Time

    Authors: Lunjia Hu, Arun Jambulapati, Kevin Tian, Chutong Yang

    Abstract: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

  4. arXiv:2402.07821  [pdf, other

    cs.LG cs.CC cs.DS math.ST stat.ML

    On Computationally Efficient Multi-Class Calibration

    Authors: Parikshit Gopalan, Lunjia Hu, Guy N. Rothblum

    Abstract: Consider a multi-class labelling problem, where the labels can take values in $[k]$, and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in $k$? Prior notions of calibration… ▽ More

    Submitted 8 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: In COLT 2024

  5. arXiv:2402.02306  [pdf, other

    stat.ME stat.CO stat.ML

    A flexible Bayesian g-formula for causal survival analyses with time-dependent confounding

    Authors: Xinyuan Chen, Liangyuan Hu, Fan Li

    Abstract: In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator, which… ▽ More

    Submitted 28 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  6. arXiv:2309.02426  [pdf

    stat.ML cs.LG

    Monotone Tree-Based GAMI Models by Adapting XGBoost

    Authors: Linwei Hu, Soroush Aramideh, Jie Chen, Vijayan N. Nair

    Abstract: Recent papers have used machine learning architecture to fit low-order functional ANOVA models with main effects and second-order interactions. These GAMI (GAM + Interaction) models are directly interpretable as the functional main effects and interactions can be easily plotted and visualized. Unfortunately, it is not easy to incorporate the monotonicity requirement into the existing GAMI models b… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: 12 pages

  7. arXiv:2309.02417  [pdf

    stat.ML cs.LG

    Computing SHAP Efficiently Using Model Structure Information

    Authors: Linwei Hu, Ke Wang

    Abstract: SHAP (SHapley Additive exPlanations) has become a popular method to attribute the prediction of a machine learning model on an input to its features. One main challenge of SHAP is the computation time. An exact computation of Shapley values requires exponential time complexity. Therefore, many approximation methods are proposed in the literature. In this paper, we propose methods that can compute… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: 15 pages

  8. arXiv:2307.07346  [pdf, other

    cs.LG stat.AP

    A testing-based approach to assess the clusterability of categorical data

    Authors: Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He

    Abstract: The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existi… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

    Comments: 19 pages, 13 figures

  9. arXiv:2305.18764  [pdf, other

    cs.LG math.ST stat.ML

    When Does Optimizing a Proper Loss Yield Calibration?

    Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Preetum Nakkiran

    Abstract: Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the… ▽ More

    Submitted 8 December, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: In NeurIPS 2023. Selected for spotlight presentation

  10. arXiv:2305.15670  [pdf

    stat.ML cs.LG

    Interpretable Machine Learning based on Functional ANOVA Framework: Algorithms and Comparisons

    Authors: Linwei Hu, Vijayan N. Nair, Agus Sudjianto, Aijun Zhang, Jie Chen

    Abstract: In the early days of machine learning (ML), the emphasis was on develo** complex algorithms to achieve best predictive performance. To understand and explain the model results, one had to rely on post hoc explainability techniques, which are known to have limitations. Recently, with the recognition that interpretability is just as important, researchers are compromising on small increases in pre… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 24 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2207.06950

  11. arXiv:2305.05276  [pdf, other

    cs.LG stat.ME

    Causal Discovery from Subsampled Time Series with Proxy Variables

    Authors: Mingzhou Liu, Xinwei Sun, Ling**g Hu, Yizhou Wang

    Abstract: Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i.e., the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this… ▽ More

    Submitted 24 December, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  12. arXiv:2304.09424  [pdf, other

    cs.LG cs.AI stat.ML

    Loss Minimization Yields Multicalibration for Large Neural Networks

    Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Adam Tauman Kalai, Preetum Nakkiran

    Abstract: Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the… ▽ More

    Submitted 7 December, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: In ITCS 2024

  13. arXiv:2211.09101  [pdf, other

    cs.LG cs.CC cs.DS stat.ML

    Comparative Learning: A Sample Complexity Theory for Two Hypothesis Classes

    Authors: Lunjia Hu, Charlotte Peale

    Abstract: In many learning theory problems, a central role is played by a hypothesis class: we might assume that the data is labeled according to a hypothesis in the class (usually referred to as the realizable setting), or we might evaluate the learned model by comparing it with the best hypothesis in the class (the agnostic setting). Taking a step beyond these classic setups that involve only a single h… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: In ITCS 2023

  14. arXiv:2211.03983  [pdf, other

    stat.ML cs.AI cs.LG

    Doubly Inhomogeneous Reinforcement Learning

    Authors: Liyuan Hu, Mengbing Li, Chengchun Shi, Zhenke Wu, Piotr Fryzlewicz

    Abstract: This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal… ▽ More

    Submitted 12 November, 2022; v1 submitted 7 November, 2022; originally announced November 2022.

  15. arXiv:2211.03956  [pdf, other

    cs.LG stat.AP

    Significance-Based Categorical Data Clustering

    Authors: Lianyu Hu, Mudi Jiang, Yan Liu, Zengyou He

    Abstract: Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: 36 pages, 6 figures

  16. arXiv:2210.13497  [pdf, other

    cs.LG cs.IT math.ST stat.ML

    Subspace Recovery from Heterogeneous Data with Non-isotropic Noise

    Authors: John Duchi, Vitaly Feldman, Lunjia Hu, Kunal Talwar

    Abstract: Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distrib… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: In NeurIPS 2022

  17. arXiv:2210.04100  [pdf, other

    stat.ME

    Doubly robust estimation and sensitivity analysis for marginal structural quantile models

    Authors: Chao Cheng, Liangyuan Hu, Fan Li

    Abstract: The marginal structure quantile model (MSQM) provides a unique lens to understand the causal effect of a time-varying treatment on the full distribution of potential outcomes. Under the semiparametric framework, we derive the efficiency influence function for the MSQM, from which a new doubly robust estimator is proposed for point estimation and inference. We show that the doubly robust estimator… ▽ More

    Submitted 10 February, 2024; v1 submitted 8 October, 2022; originally announced October 2022.

  18. arXiv:2207.06950  [pdf

    stat.ML cs.LG

    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models

    Authors: Linwei Hu, Jie Chen, Vijayan N. Nair

    Abstract: Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI… ▽ More

    Submitted 15 December, 2023; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: 25 pages plus appendix

  19. arXiv:2207.05214  [pdf

    stat.ML cs.LG

    Shapley Computations Using Surrogate Model-Based Trees

    Authors: Zhipu Zhou, Jie Chen, Linwei Hu

    Abstract: Shapley-related techniques have gained attention as both global and local interpretation tools because of their desirable properties. However, their computation using conditional expectations is computationally expensive. Approximation methods suggested in the literature have limitations. This paper proposes the use of a surrogate model-based tree to compute Shapley and SHAP values based on condit… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

  20. arXiv:2206.08271  [pdf, other

    stat.ME stat.AP

    A new method for clustered survival data]{A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

    Authors: Liangyuan Hu

    Abstract: We recently developed a new method riAFT-BART to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important s… ▽ More

    Submitted 11 August, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: 38 pages, 14 figures, 8 tables

  21. arXiv:2206.00381  [pdf, ps, other

    physics.soc-ph cs.SI stat.AP

    The statistical nature of h-index of a network node

    Authors: Yan Liu, Mudi Jiang, Lianyu Hu, Zengyou He

    Abstract: Evaluating the importance of a network node is a crucial task in network science and graph data mining. H-index is a popular centrality measure for this task, however, there is still a lack of its interpretation from a rigorous statistical aspect. Here we show the statistical nature of h-index from the perspective of order statistics, and we obtain a new family of centrality indices by generalizin… ▽ More

    Submitted 19 May, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

  22. arXiv:2204.12868  [pdf

    stat.ML cs.LG

    Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study

    Authors: Alice J. Liu, Arpita Mukherjee, Linwei Hu, Jie Chen, Vijayan N. Nair

    Abstract: This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of extreme gradient boosting machines (XGB) and random forests (RFs), and feedforward neural networks (FFNNs) from TensorFlow. The paper is organized in a findings-base… ▽ More

    Submitted 5 May, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

  23. arXiv:2204.12365  [pdf

    stat.ML cs.LG

    Explaining Adverse Actions in Credit Decisions Using Shapley Decomposition

    Authors: Vijayan N. Nair, Tianshu Feng, Linwei Hu, Zach Zhang, Jie Chen, Agus Sudjianto

    Abstract: When a financial institution declines an application for credit, an adverse action (AA) is said to occur. The applicant is then entitled to an explanation for the negative decision. This paper focuses on credit decisions based on a predictive model for probability of default and proposes a methodology for AA explanation. The problem involves identifying the important predictors responsible for the… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 20 pages, 8 figures

  24. arXiv:2203.04536  [pdf, other

    cs.LG cs.CC cs.DS stat.ML

    Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability

    Authors: Lunjia Hu, Charlotte Peale, Omer Reingold

    Abstract: We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcome… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: 37 pages. To appear in ALT 2022

  25. arXiv:2202.08318  [pdf, other

    stat.ME stat.AP

    A flexible approach for causal inference with multiple treatments and clustered survival outcomes

    Authors: Liangyuan Hu, Jiayi Ji, Ronald D. Ennis, Joseph W. Hogan

    Abstract: When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring and unmeasured confounding for causal analyses. Few off-the-shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random-int… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: 33 pages, 10 figures; 6 tables

  26. arXiv:2110.10276  [pdf, other

    stat.ME stat.AP stat.CO

    CIMTx: An R package for causal inference with multiple treatments using observational data

    Authors: Liangyuan Hu, Jiayi Ji

    Abstract: CIMTx provides efficient and unified functions to implement modern methods for causal inferences with multiple treatments using observational data with a focus on binary outcomes. The methods include regression adjustment, inverse probability of treatment weighting, Bayesian additive regression trees, regression adjustment with multivariate spline of the generalized propensity score, vector matchi… ▽ More

    Submitted 14 September, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: 17 pages, 5 figures, 2 tables

  27. arXiv:2110.09697  [pdf, other

    stat.ML cs.LG stat.CO

    abess: A Fast Best Subset Selection Library in Python and R

    Authors: ** Zhu, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, Junxian Zhu

    Abstract: We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to a… ▽ More

    Submitted 16 June, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

    Journal ref: Journal of Machine Learning Research (2022)

  28. arXiv:2109.13368  [pdf, other

    stat.ME stat.AP

    Estimating the causal effects of multiple intermittent treatments with application to COVID-19

    Authors: Liangyuan Hu, Jiayi Ji, Himanshu Joshi, Erick Scott, Fan Li

    Abstract: To draw real-world evidence about the comparative effectiveness of multiple time-varying treatments on patient survival, we develop a joint marginal structural survival model and a novel weighting strategy to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple start/stop switches as the recurrent events with discontinuous interval… ▽ More

    Submitted 4 August, 2023; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 79 pages

  29. arXiv:2108.02836  [pdf, ps, other

    stat.ME stat.OT

    Discussion on "Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects" by Hahn, Murray and Carvalho

    Authors: Liangyuan Hu

    Abstract: Hahn et al. (2020) offers an extensive study to explicate and evaluate the performance of the BCF model in different settings and provides a detailed discussion about its utility in causal inference. It is a welcomed addition to the causal machine learning literature. I will emphasize the contribution of the BCF model to the field of causal inference through discussions on two topics: 1) the diffe… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Journal ref: Bayesian Analysis 2020: 15 (3), 1020-1023

  30. arXiv:2108.00331  [pdf, other

    cs.LG cs.CR math.OC stat.ML

    Faster Rates of Private Stochastic Convex Optimization

    Authors: **yan Su, Lijie Hu, Di Wang

    Abstract: In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) w… ▽ More

    Submitted 16 January, 2022; v1 submitted 31 July, 2021; originally announced August 2021.

    Comments: To appear in The 33rd International Conference on Algorithmic Learning Theory. In this version, we fixed some typos and correct the prove of lower bound

  31. arXiv:2107.11136  [pdf, other

    cs.LG cs.CR stat.ML

    High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data

    Authors: Lijie Hu, Shuo Ni, Hanshen Xiao, Di Wang

    Abstract: As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distribution or irregular data in the low dimensional space case. To better understand the challenges arising from irregul… ▽ More

    Submitted 9 August, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

  32. arXiv:2107.09730  [pdf, ps, other

    stat.ME stat.AP

    A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data

    Authors: Jung-Yi Joyce Lin, Liangyuan Hu, Chuyue Huang, Steven Lawrence, Usha Govindarajulu

    Abstract: Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method, called RR-BART, which leverages… ▽ More

    Submitted 13 April, 2022; v1 submitted 20 July, 2021; originally announced July 2021.

    Comments: 16 pages, 3 figures, 3 tables

  33. arXiv:2106.15566  [pdf, other

    cs.LG cs.CG cs.DS stat.ML

    Near-Optimal Explainable $k$-Means for All Dimensions

    Authors: Moses Charikar, Lunjia Hu

    Abstract: Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020) introduced explainable clustering, where the cluster boundaries are axis-par… ▽ More

    Submitted 4 November, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

    Comments: 34 pages, 2 figures, to appear in SODA 2022

  34. arXiv:2104.02769  [pdf, other

    stat.ME stat.AP stat.ML

    Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

    Authors: Liangyuan Hu, Jung-Yi Joyce Lin, Jiayi Ji

    Abstract: The missing data issue is ubiquitous in health studies. Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic but has been less studied. Existing literature focuses on parametric regression techniques that provide direct parameter estimates of the regression model. Flexible nonparametric machine learning methods considerably mitigate… ▽ More

    Submitted 7 July, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: 29 pages, 17 figures, 4 tables

  35. arXiv:2103.00605  [pdf, other

    stat.ME

    Propensity Score Weighting Analysis of Survival Outcomes Using Pseudo-observations

    Authors: Shuxi Zeng, Fan Li, Liangyuan Hu, Fan Li

    Abstract: Survival outcomes are common in comparative effectiveness studies and require unique handling because they are usually incompletely observed due to right-censoring. A ``once for all'' approach for causal inference with survival outcomes constructs pseudo-observations and allows standard methods such as propensity score weighting to proceed as if the outcomes are completely observed. For a general… ▽ More

    Submitted 18 December, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

    Comments: 40 pages, 2 figures, 1 table

  36. arXiv:2012.06093  [pdf, other

    stat.ME

    A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data

    Authors: Liangyuan Hu, Jungang Zou, Chenyang Gu, Jiayi Ji, Michael Lopez, Minal Kale

    Abstract: In the absence of a randomized experiment, a key assumption for drawing causal inference about treatment effects is the ignorable treatment assignment. Violations of the ignorability assumption may lead to biased treatment effect estimates. Sensitivity analysis helps gauge how causal conclusions will be altered in response to the potential magnitude of departure from the ignorability assumption. H… ▽ More

    Submitted 13 August, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

    Comments: 36 pages, 12 figures, 9 table

  37. arXiv:2010.13520  [pdf, other

    cs.LG cs.CR stat.ML

    Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees

    Authors: Di Wang, Jiahao Ding, Lijie Hu, Zejun Xie, Miao Pan, **hui Xu

    Abstract: (Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. How… ▽ More

    Submitted 16 January, 2022; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Submiited. arXiv admin note: text overlap with arXiv:2010.09576

  38. arXiv:2008.10351  [pdf, other

    cs.CV cs.LG stat.ML

    Model Generalization in Deep Learning Applications for Land Cover Map**

    Authors: Lucas Hu, Caleb Robinson, Bistra Dilkina

    Abstract: Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one… ▽ More

    Submitted 17 June, 2021; v1 submitted 8 August, 2020; originally announced August 2020.

    Comments: 9 pages, 7 figures, 5 tables

    ACM Class: I.4.6; I.2.10

  39. arXiv:2008.08071  [pdf, other

    cs.DS cs.LG math.ST stat.ML

    Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers

    Authors: Lunjia Hu, Omer Reingold

    Abstract: We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretica… ▽ More

    Submitted 3 May, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

    Comments: 29 pages, 2 figures. Published in AISTATS 2021. More details in the proof of Claim 14

  40. arXiv:2008.07687  [pdf, other

    stat.ME cs.CY stat.ML

    Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes

    Authors: Liangyuan Hu, Chenyang Gu

    Abstract: The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that ha… ▽ More

    Submitted 2 October, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

    Comments: 15 pages, 3 tables, 2 figures

  41. arXiv:2008.07044  [pdf, other

    stat.AP stat.ML

    Estimating heterogeneous survival treatment effect in observational data using machine learning

    Authors: Liangyuan Hu, Jiayi Ji, Fan Li

    Abstract: Methods for estimating heterogeneous treatment effect in observational data have largely focused on continuous or binary outcomes, and have been relatively less vetted with survival outcomes. Using flexible machine learning methods in the counterfactual framework is a promising approach to address challenges due to complex individual characteristics, to which treatments need to be tailored. To eva… ▽ More

    Submitted 19 May, 2021; v1 submitted 16 August, 2020; originally announced August 2020.

    Comments: 23 pages, 5 figures, 3 tables

    Journal ref: Statistics in Medicine,2021;00:1-23 (2021)

  42. arXiv:2008.04059  [pdf

    q-fin.GN cs.LG stat.ML

    Supervised Machine Learning Techniques: An Overview with Applications to Banking

    Authors: Linwei Hu, Jie Chen, Joel Vaughan, Hanyu Yang, Kelly Wang, Agus Sudjianto, Vijayan N. Nair

    Abstract: This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging wit… ▽ More

    Submitted 28 July, 2020; originally announced August 2020.

  43. arXiv:2007.14528  [pdf

    stat.ML cs.LG

    Surrogate Locally-Interpretable Models with Supervised Machine Learning Algorithms

    Authors: Linwei Hu, Jie Chen, Vijayan N. Nair, Agus Sudjianto

    Abstract: Supervised Machine Learning (SML) algorithms, such as Gradient Boosting, Random Forest, and Neural Networks, have become popular in recent years due to their superior predictive performance over traditional statistical methods. However, their complexity makes the results hard to interpret without additional tools. There has been a lot of recent work in develo** global and local diagnostics for i… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

  44. arXiv:2007.01784  [pdf, ps, other

    stat.ME math.ST

    Unified statistical inference for a novel nonlinear dynamic functional/longitudinal data model

    Authors: Lixia Hu, Tao Huang, **hong You

    Abstract: In light of recent work studying massive functional/longitudinal data, such as the resulting data from the COVID-19 pandemic, we propose a novel functional/longitudinal data model which is a combination of the popular varying coefficient (VC) model and additive model. We call it Semi-VCAM in which the response could be a functional/longitudinal variable, and the explanatory variables could be a mi… ▽ More

    Submitted 3 July, 2020; originally announced July 2020.

    Comments: 29 pages; 4 figures

  45. arXiv:2006.08292  [pdf, ps, other

    cs.LG stat.ML

    Robust Locality-Aware Regression for Labeled Data Classification

    Authors: Liangchen Hu, Wensheng Zhang

    Abstract: With the dramatic increase of dimensions in the data representation, extracting latent low-dimensional features becomes of the utmost importance for efficient classification. Aiming at the problems of unclear margin representation and difficulty in revealing the data manifold structure in most of the existing linear discriminant methods, we propose a new discriminant feature extraction framework,… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

  46. arXiv:2001.06483  [pdf, other

    stat.ME stat.AP

    Estimation of Causal Effects of Multiple Treatments in Observational Studies with a Binary Outcome

    Authors: Liangyuan Hu, Chenyang Gu, Michael Lopez, Jiayi Ji, Juan Wisnivesky

    Abstract: There is a dearth of robust methods to estimate the causal effects of multiple treatments when the outcome is binary. This paper uses two unique sets of simulations to propose and evaluate the use of Bayesian Additive Regression Trees (BART) in such settings. First, we compare BART to several approaches that have been proposed for continuous outcomes, including inverse probability of treatment wei… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

    Comments: 3 figures, 3 tables. arXiv admin note: text overlap with arXiv:1901.04312

  47. arXiv:1911.13238  [pdf, other

    cs.IT cs.LG eess.SP stat.ML

    Machine Learning-based Signal Detection for PMH Signals in Load-modulated MIMO System

    Authors: **le Zhu, Qiang Li, Li Hu, Hongyang Chen, Nirwan Ansari

    Abstract: Phase Modulation on the Hypersphere (PMH) is a power efficient modulation scheme for the \textit{load-modulated} multiple-input multiple-output (MIMO) transmitters with central power amplifiers (CPA). However, it is difficult to obtain the precise channel state information (CSI), and the traditional optimal maximum likelihood (ML) detection scheme incurs high complexity which increases exponential… ▽ More

    Submitted 24 November, 2019; originally announced November 2019.

    Comments: with example

  48. arXiv:1910.14025  [pdf, ps, other

    cs.IR cs.CL cs.LG stat.ML

    Graph Neural News Recommendation with Long-term and Short-term Interest Modeling

    Authors: Linmei Hu, Chen Li, Chuan Shi, Cheng Yang, Chao Shao

    Abstract: With the information explosion of news articles, personalized news recommendation has become important for users to quickly find news that they are interested in. Existing methods on news recommendation mainly include collaborative filtering methods which rely on direct user-item interactions and content based methods which characterize the content of user reading history. Although these methods h… ▽ More

    Submitted 7 November, 2019; v1 submitted 30 October, 2019; originally announced October 2019.

  49. arXiv:1910.00482  [pdf, other

    cs.LG cs.CR stat.ML

    Estimating Smooth GLM in Non-interactive Local Differential Privacy Model with Public Unlabeled Data

    Authors: Di Wang, Lijie Hu, Huanyu Zhang, Marco Gaboardi, **hui Xu

    Abstract: In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Different from its classical setting, our model allows the server to access some additional public but unlabeled data. In the first part of the paper we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. samp… ▽ More

    Submitted 20 August, 2022; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: Revised version, fix some errors in the first version

  50. arXiv:1908.01901  [pdf, other

    cs.LG eess.IV stat.ML

    Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images, including Supplementary Information

    Authors: Charles B. Delahunt, Mayoore S. Jaiswal, Matthew P. Horning, Samantha Janko, Clay M. Thompson, Sourabh Kulhare, Liming Hu, Travis Ostbye, Grace Yun, Roman Gebrehiwot, Benjamin K. Wilson, Earl Long, Stephane Proux, Dionicia Gamboa, Peter Chiodini, Jane Carter, Mehul Dhorda, David Isaboke, Bernhards Ogutu, Wellington Oyibo, Elizabeth Villasis, Kyaw Myo Tun, Christine Bachman, David Bell, Courosh Mehanian

    Abstract: Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumb… ▽ More

    Submitted 11 September, 2022; v1 submitted 5 August, 2019; originally announced August 2019.

    Comments: 16 pages, 13 figures

    MSC Class: 68T10 ACM Class: I.5.0