-
Forward and Backward State Abstractions for Off-policy Evaluation
Authors:
Meiling Hao,
**fan Su,
Liyuan Hu,
Zoltan Szabo,
Qingyuan Zhao,
Chengchun Shi
Abstract:
Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstracti…
▽ More
Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Predict to Minimize Swap Regret for All Payoff-Bounded Tasks
Authors:
Lunjia Hu,
Yifan Wu
Abstract:
A sequence of predictions is calibrated if and only if it induces no swap regret to all down-stream decision tasks. We study the Maximum Swap Regret (MSR) of predictions for binary events: the swap regret maximized over all downstream tasks with bounded payoffs. Previously, the best online prediction algorithm for minimizing MSR is obtained by minimizing the K1 calibration error, which upper bound…
▽ More
A sequence of predictions is calibrated if and only if it induces no swap regret to all down-stream decision tasks. We study the Maximum Swap Regret (MSR) of predictions for binary events: the swap regret maximized over all downstream tasks with bounded payoffs. Previously, the best online prediction algorithm for minimizing MSR is obtained by minimizing the K1 calibration error, which upper bounds MSR up to a constant factor. However, recent work (Qiao and Valiant, 2021) gives an $Ω(T^{0.528})$ lower bound for the worst-case expected $K_1$ calibration error incurred by any randomized algorithm in T rounds, presenting a barrier to achieving better rates for MSR. Several relaxations of MSR have been considered to overcome this barrier, via external regret (Kleinberg et al., 2023) and regret bounds depending polynomially on the number of actions in downstream tasks (Noarov et al., 2023; Roth and Shi, 2024). We show that the barrier can be surpassed without any relaxations: we give an efficient randomized prediction algorithm that guarantees $O(\sqrt{T}logT)$ expected MSR. We also discuss the economic utility of calibration by viewing MSR as a decision-theoretic calibration error metric and study its relationship to existing metrics.
△ Less
Submitted 24 April, 2024; v1 submitted 20 April, 2024;
originally announced April 2024.
-
Testing Calibration in Nearly-Linear Time
Authors:
Lunjia Hu,
Arun Jambulapati,
Kevin Tian,
Chutong Yang
Abstract:
In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we…
▽ More
In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration.
We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $Ω(n^ω)$ time, where $ω> 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.
△ Less
Submitted 21 June, 2024; v1 submitted 20 February, 2024;
originally announced February 2024.
-
On Computationally Efficient Multi-Class Calibration
Authors:
Parikshit Gopalan,
Lunjia Hu,
Guy N. Rothblum
Abstract:
Consider a multi-class labelling problem, where the labels can take values in $[k]$, and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in $k$? Prior notions of calibration…
▽ More
Consider a multi-class labelling problem, where the labels can take values in $[k]$, and a predictor predicts a distribution over the labels. In this work, we study the following foundational question: Are there notions of multi-class calibration that give strong guarantees of meaningful predictions and can be achieved in time and sample complexities polynomial in $k$? Prior notions of calibration exhibit a tradeoff between computational efficiency and expressivity: they either suffer from having sample complexity exponential in $k$, or needing to solve computationally intractable problems, or give rather weak guarantees.
Our main contribution is a notion of calibration that achieves all these desiderata: we formulate a robust notion of projected smooth calibration for multi-class predictions, and give new recalibration algorithms for efficiently calibrating predictors under this definition with complexity polynomial in $k$. Projected smooth calibration gives strong guarantees for all downstream decision makers who want to use the predictor for binary classification problems of the form: does the label belong to a subset $T \subseteq [k]$: e.g. is this an image of an animal? It ensures that the probabilities predicted by summing the probabilities assigned to labels in $T$ are close to some perfectly calibrated binary predictor for that task. We also show that natural strengthenings of our definition are computationally hard to achieve: they run into information theoretic barriers or computational intractability. Underlying both our upper and lower bounds is a tight connection that we prove between multi-class calibration and the well-studied problem of agnostic learning in the (standard) binary prediction setting.
△ Less
Submitted 8 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A flexible Bayesian g-formula for causal survival analyses with time-dependent confounding
Authors:
Xinyuan Chen,
Liangyuan Hu,
Fan Li
Abstract:
In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator, which…
▽ More
In longitudinal observational studies with a time-to-event outcome, a common objective in causal analysis is to estimate the causal survival curve under hypothetical intervention scenarios within the study cohort. The g-formula is a particularly useful tool for this analysis. To enhance the traditional parametric g-formula approach, we developed a more adaptable Bayesian g-formula estimator, which incorporates the Bayesian additive regression trees (BART) in the modeling of the time-evolving generative components, aiming to mitigate bias due to model misspecification. Specifically, we introduce a more general class of g-formulas for discrete survival data that can incorporate the longitudinal balancing scores, which serve as an effective method for dimension reduction and are vital when dealing with an expanding array of time-varying confounders. The minimum sufficient formulation of these longitudinal balancing scores is linked to the nature of treatment regimes, whether static or dynamic. For each type of treatment regime, we provide posterior sampling algorithms grounded in the BART framework. We have conducted simulation studies to illustrate the empirical performance of the proposed method and further demonstrate its practical utility using data from the Yale New Haven Health System's (YNHHS) electronic health records.
△ Less
Submitted 28 June, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Monotone Tree-Based GAMI Models by Adapting XGBoost
Authors:
Linwei Hu,
Soroush Aramideh,
Jie Chen,
Vijayan N. Nair
Abstract:
Recent papers have used machine learning architecture to fit low-order functional ANOVA models with main effects and second-order interactions. These GAMI (GAM + Interaction) models are directly interpretable as the functional main effects and interactions can be easily plotted and visualized. Unfortunately, it is not easy to incorporate the monotonicity requirement into the existing GAMI models b…
▽ More
Recent papers have used machine learning architecture to fit low-order functional ANOVA models with main effects and second-order interactions. These GAMI (GAM + Interaction) models are directly interpretable as the functional main effects and interactions can be easily plotted and visualized. Unfortunately, it is not easy to incorporate the monotonicity requirement into the existing GAMI models based on boosted trees, such as EBM (Lou et al. 2013) and GAMI-Lin-T (Hu et al. 2022). This paper considers models of the form $f(x)=\sum_{j,k}f_{j,k}(x_j, x_k)$ and develops monotone tree-based GAMI models, called monotone GAMI-Tree, by adapting the XGBoost algorithm. It is straightforward to fit a monotone model to $f(x)$ using the options in XGBoost. However, the fitted model is still a black box. We take a different approach: i) use a filtering technique to determine the important interactions, ii) fit a monotone XGBoost algorithm with the selected interactions, and finally iii) parse and purify the results to get a monotone GAMI model. Simulated datasets are used to demonstrate the behaviors of mono-GAMI-Tree and EBM, both of which use piecewise constant fits. Note that the monotonicity requirement is for the full model. Under certain situations, the main effects will also be monotone. But, as seen in the examples, the interactions will not be monotone.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Computing SHAP Efficiently Using Model Structure Information
Authors:
Linwei Hu,
Ke Wang
Abstract:
SHAP (SHapley Additive exPlanations) has become a popular method to attribute the prediction of a machine learning model on an input to its features. One main challenge of SHAP is the computation time. An exact computation of Shapley values requires exponential time complexity. Therefore, many approximation methods are proposed in the literature. In this paper, we propose methods that can compute…
▽ More
SHAP (SHapley Additive exPlanations) has become a popular method to attribute the prediction of a machine learning model on an input to its features. One main challenge of SHAP is the computation time. An exact computation of Shapley values requires exponential time complexity. Therefore, many approximation methods are proposed in the literature. In this paper, we propose methods that can compute SHAP exactly in polynomial time or even faster for SHAP definitions that satisfy our additivity and dummy assumptions (eg, kernal SHAP and baseline SHAP). We develop different strategies for models with different levels of model structure information: known functional decomposition, known order of model (defined as highest order of interaction in the model), or unknown order. For the first case, we demonstrate an additive property and a way to compute SHAP from the lower-order functional components. For the second case, we derive formulas that can compute SHAP in polynomial time. Both methods yield exact SHAP results. Finally, if even the order of model is unknown, we propose an iterative way to approximate Shapley values. The three methods we propose are computationally efficient when the order of model is not high which is typically the case in practice. We compare with sampling approach proposed in Castor & Gomez (2008) using simulation studies to demonstrate the efficacy of our proposed methods.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
A testing-based approach to assess the clusterability of categorical data
Authors:
Lianyu Hu,
Junjie Dong,
Mudi Jiang,
Yan Liu,
Zengyou He
Abstract:
The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existi…
▽ More
The objective of clusterability evaluation is to check whether a clustering structure exists within the data set. As a crucial yet often-overlooked issue in cluster analysis, it is essential to conduct such a test before applying any clustering algorithm. If a data set is unclusterable, any subsequent clustering analysis would not yield valid results. Despite its importance, the majority of existing studies focus on numerical data, leaving the clusterability evaluation issue for categorical data as an open problem. Here we present TestCat, a testing-based approach to assess the clusterability of categorical data in terms of an analytical $p$-value. The key idea underlying TestCat is that clusterable categorical data possess many strongly correlated attribute pairs and hence the sum of chi-squared statistics of all attribute pairs is employed as the test statistic for $p$-value calculation. We apply our method to a set of benchmark categorical data sets, showing that TestCat outperforms those solutions based on existing clusterability evaluation methods for numeric data. To the best of our knowledge, our work provides the first way to effectively recognize the clusterability of categorical data in a statistically sound manner.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
When Does Optimizing a Proper Loss Yield Calibration?
Authors:
Jarosław Błasiok,
Parikshit Gopalan,
Lunjia Hu,
Preetum Nakkiran
Abstract:
Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the…
▽ More
Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the ground truth. Under what circumstances does optimizing proper loss over a restricted family yield calibrated models? What precise calibration guarantees does it give? In this work, we provide a rigorous answer to these questions. We replace the global optimality with a local optimality condition stipulating that the (proper) loss of the predictor cannot be reduced much by post-processing its predictions with a certain family of Lipschitz functions. We show that any predictor with this local optimality satisfies smooth calibration as defined in Kakade-Foster (2008), Błasiok et al. (2023). Local optimality is plausibly satisfied by well-trained DNNs, which suggests an explanation for why they are calibrated from proper loss minimization alone. Finally, we show that the connection between local optimality and calibration error goes both ways: nearly calibrated predictors are also nearly locally optimal.
△ Less
Submitted 8 December, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
Interpretable Machine Learning based on Functional ANOVA Framework: Algorithms and Comparisons
Authors:
Linwei Hu,
Vijayan N. Nair,
Agus Sudjianto,
Aijun Zhang,
Jie Chen
Abstract:
In the early days of machine learning (ML), the emphasis was on develo** complex algorithms to achieve best predictive performance. To understand and explain the model results, one had to rely on post hoc explainability techniques, which are known to have limitations. Recently, with the recognition that interpretability is just as important, researchers are compromising on small increases in pre…
▽ More
In the early days of machine learning (ML), the emphasis was on develo** complex algorithms to achieve best predictive performance. To understand and explain the model results, one had to rely on post hoc explainability techniques, which are known to have limitations. Recently, with the recognition that interpretability is just as important, researchers are compromising on small increases in predictive performance to develop algorithms that are inherently interpretable. While doing so, the ML community has rediscovered the use of low-order functional ANOVA (fANOVA) models that have been known in the statistical literature for some time. This paper starts with a description of challenges with post hoc explainability and reviews the fANOVA framework with a focus on main effects and second-order interactions. This is followed by an overview of two recently developed techniques: Explainable Boosting Machines or EBM (Lou et al., 2013) and GAMI-Net (Yang et al., 2021b). The paper proposes a new algorithm, called GAMI-Lin-T, that also uses trees like EBM, but it does linear fits instead of piecewise constants within the partitions. There are many other differences, including the development of a new interaction filtering algorithm. Finally, the paper uses simulated and real datasets to compare selected ML algorithms. The results show that GAMI-Lin-T and GAMI-Net have comparable performances, and both are generally better than EBM.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Causal Discovery from Subsampled Time Series with Proxy Variables
Authors:
Mingzhou Liu,
Xinwei Sun,
Ling**g Hu,
Yizhou Wang
Abstract:
Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i.e., the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this…
▽ More
Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i.e., the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this paper, we propose a constraint-based algorithm that can identify the entire causal structure from subsampled time series, without any parametric constraint. Our observation is that the challenge of subsampling arises mainly from hidden variables at the unobserved time steps. Meanwhile, every hidden variable has an observed proxy, which is essentially itself at some observable time in the future, benefiting from the temporal structure. Based on these, we can leverage the proxies to remove the bias induced by the hidden variables and hence achieve identifiability. Following this intuition, we propose a proxy-based causal discovery algorithm. Our algorithm is nonparametric and can achieve full causal identification. Theoretical advantages are reflected in synthetic and real-world experiments.
△ Less
Submitted 24 December, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Loss Minimization Yields Multicalibration for Large Neural Networks
Authors:
Jarosław Błasiok,
Parikshit Gopalan,
Lunjia Hu,
Adam Tauman Kalai,
Preetum Nakkiran
Abstract:
Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions.
In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the…
▽ More
Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions.
In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the predictors are neural networks of size $n > k$. We show that minimizing the squared loss over all neural nets of size $n$ implies multicalibration for all but a bounded number of unlucky values of $n$. We also give evidence that our bound on the number of unlucky values is tight, given our proof technique. Previously, results of the flavor that loss minimization yields multicalibration were known only for predictors that were near the ground truth, hence were rather limited in applicability. Unlike these, our results rely on the expressivity of neural nets and utilize the representation of the predictor.
△ Less
Submitted 7 December, 2023; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Comparative Learning: A Sample Complexity Theory for Two Hypothesis Classes
Authors:
Lunjia Hu,
Charlotte Peale
Abstract:
In many learning theory problems, a central role is played by a hypothesis class: we might assume that the data is labeled according to a hypothesis in the class (usually referred to as the realizable setting), or we might evaluate the learned model by comparing it with the best hypothesis in the class (the agnostic setting).
Taking a step beyond these classic setups that involve only a single h…
▽ More
In many learning theory problems, a central role is played by a hypothesis class: we might assume that the data is labeled according to a hypothesis in the class (usually referred to as the realizable setting), or we might evaluate the learned model by comparing it with the best hypothesis in the class (the agnostic setting).
Taking a step beyond these classic setups that involve only a single hypothesis class, we introduce comparative learning as a combination of the realizable and agnostic settings in PAC learning: given two binary hypothesis classes $S$ and $B$, we assume that the data is labeled according to a hypothesis in the source class $S$ and require the learned model to achieve an accuracy comparable to the best hypothesis in the benchmark class $B$. Even when both $S$ and $B$ have infinite VC dimensions, comparative learning can still have a small sample complexity. We show that the sample complexity of comparative learning is characterized by the mutual VC dimension $\mathsf{VC}(S,B)$ which we define to be the maximum size of a subset shattered by both $S$ and $B$. We also show a similar result in the online setting, where we give a regret characterization in terms of the mutual Littlestone dimension $\mathsf{Ldim}(S,B)$. These results also hold for partial hypotheses.
We additionally show that the insights necessary to characterize the sample complexity of comparative learning can be applied to characterize the sample complexity of realizable multiaccuracy and multicalibration using the mutual fat-shattering dimension, an analogue of the mutual VC dimension for real-valued hypotheses. This not only solves an open problem proposed by Hu, Peale, Reingold (2022), but also leads to independently interesting results extending classic ones about regression, boosting, and covering number to our two-hypothesis-class setting.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
Doubly Inhomogeneous Reinforcement Learning
Authors:
Liyuan Hu,
Mengbing Li,
Chengchun Shi,
Zhenke Wu,
Piotr Fryzlewicz
Abstract:
This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal…
▽ More
This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.
△ Less
Submitted 12 November, 2022; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Significance-Based Categorical Data Clustering
Authors:
Lianyu Hu,
Mudi Jiang,
Yan Liu,
Zengyou He
Abstract:
Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering…
▽ More
Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Subspace Recovery from Heterogeneous Data with Non-isotropic Noise
Authors:
John Duchi,
Vitaly Feldman,
Lunjia Hu,
Kunal Talwar
Abstract:
Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distrib…
▽ More
Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $μ_i$. Our goal is to recover the linear subspace shared by $μ_1,\ldots,μ_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $μ_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Doubly robust estimation and sensitivity analysis for marginal structural quantile models
Authors:
Chao Cheng,
Liangyuan Hu,
Fan Li
Abstract:
The marginal structure quantile model (MSQM) provides a unique lens to understand the causal effect of a time-varying treatment on the full distribution of potential outcomes. Under the semiparametric framework, we derive the efficiency influence function for the MSQM, from which a new doubly robust estimator is proposed for point estimation and inference. We show that the doubly robust estimator…
▽ More
The marginal structure quantile model (MSQM) provides a unique lens to understand the causal effect of a time-varying treatment on the full distribution of potential outcomes. Under the semiparametric framework, we derive the efficiency influence function for the MSQM, from which a new doubly robust estimator is proposed for point estimation and inference. We show that the doubly robust estimator is consistent if either of the models associated with treatment assignment or the potential outcome distributions is correctly specified, and is semiparametric efficient if both models are correct. To implement the doubly robust MSQM estimator, we propose to solve a smoothed estimating equation to facilitate efficient computation of the point and variance estimates. In addition, we develop a confounding function approach to investigate the sensitivity of several MSQM estimators when the sequential ignorability assumption is violated. Extensive simulations are conducted to examine the finite-sample performance characteristics of the proposed methods. We apply the proposed methods to the Yale New Haven Health System Electronic Health Record data to study the effect of antihypertensive medications to patients with severe hypertension and assess the robustness of findings to unmeasured baseline and time-varying confounding.
△ Less
Submitted 10 February, 2024; v1 submitted 8 October, 2022;
originally announced October 2022.
-
Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models
Authors:
Linwei Hu,
Jie Chen,
Vijayan N. Nair
Abstract:
Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI…
▽ More
Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.
△ Less
Submitted 15 December, 2023; v1 submitted 14 July, 2022;
originally announced July 2022.
-
Shapley Computations Using Surrogate Model-Based Trees
Authors:
Zhipu Zhou,
Jie Chen,
Linwei Hu
Abstract:
Shapley-related techniques have gained attention as both global and local interpretation tools because of their desirable properties. However, their computation using conditional expectations is computationally expensive. Approximation methods suggested in the literature have limitations. This paper proposes the use of a surrogate model-based tree to compute Shapley and SHAP values based on condit…
▽ More
Shapley-related techniques have gained attention as both global and local interpretation tools because of their desirable properties. However, their computation using conditional expectations is computationally expensive. Approximation methods suggested in the literature have limitations. This paper proposes the use of a surrogate model-based tree to compute Shapley and SHAP values based on conditional expectation. Simulation studies show that the proposed algorithm provides improvements in accuracy, unifies global Shapley and SHAP interpretation, and the thresholding method provides a way to trade-off running time and accuracy.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
A new method for clustered survival data]{A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection
Authors:
Liangyuan Hu
Abstract:
We recently developed a new method riAFT-BART to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important s…
▽ More
We recently developed a new method riAFT-BART to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection. Leveraging the likelihood-based machine learning, we describe a way in which we can draw posterior samples of the individual survival treatment effect from riAFT-BART model runs, and use the drawn posterior samples to perform an exploratory treatment effect heterogeneity analysis to identify subpopulations who may experience differential treatment effects than population average effects. There is sparse literature on methods for variable selection among clustered and censored survival data, particularly ones using flexible modeling techniques. We propose a permutation based approach using the predictor's variable inclusion proportion supplied by the riAFT-BART model for variable selection. To address the missing data issue frequently encountered in health databases, we propose a strategy to combine bootstrap imputation and riAFT-BART for variable selection among incomplete clustered survival data. We conduct an expansive simulation study to examine the practical operating characteristics of our proposed methods. Finally, we demonstrate the methods via a case study of predictors for in-hospital mortality among severe COVID-19 patients and estimating the heterogeneous treatment effects of three COVID-specific medications. The methods developed in this work are readily available in the $\R$ package $\textsf{riAFTBART}$.
△ Less
Submitted 11 August, 2023; v1 submitted 16 June, 2022;
originally announced June 2022.
-
The statistical nature of h-index of a network node
Authors:
Yan Liu,
Mudi Jiang,
Lianyu Hu,
Zengyou He
Abstract:
Evaluating the importance of a network node is a crucial task in network science and graph data mining. H-index is a popular centrality measure for this task, however, there is still a lack of its interpretation from a rigorous statistical aspect. Here we show the statistical nature of h-index from the perspective of order statistics, and we obtain a new family of centrality indices by generalizin…
▽ More
Evaluating the importance of a network node is a crucial task in network science and graph data mining. H-index is a popular centrality measure for this task, however, there is still a lack of its interpretation from a rigorous statistical aspect. Here we show the statistical nature of h-index from the perspective of order statistics, and we obtain a new family of centrality indices by generalizing the h-index along this direction. The theoretical and empirical evidences show that such a statistical interpretation enables us to obtain a general and versatile framework for quantifying the importance of a network node. Under this framework, many new centrality indices can be derived and some of which can be more accurate and robust than h-index. We believe that this research opens up new avenues for develo** more effective indices for node importance quantification from a viewpoint that still remains unexplored.
△ Less
Submitted 19 May, 2023; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Performance and Interpretability Comparisons of Supervised Machine Learning Algorithms: An Empirical Study
Authors:
Alice J. Liu,
Arpita Mukherjee,
Linwei Hu,
Jie Chen,
Vijayan N. Nair
Abstract:
This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of extreme gradient boosting machines (XGB) and random forests (RFs), and feedforward neural networks (FFNNs) from TensorFlow. The paper is organized in a findings-base…
▽ More
This paper compares the performances of three supervised machine learning algorithms in terms of predictive ability and model interpretation on structured or tabular data. The algorithms considered were scikit-learn implementations of extreme gradient boosting machines (XGB) and random forests (RFs), and feedforward neural networks (FFNNs) from TensorFlow. The paper is organized in a findings-based manner, with each section providing general conclusions supported by empirical results from simulation studies that cover a wide range of model complexity and correlation structures among predictors. We considered both continuous and binary responses of different sample sizes.
Overall, XGB and FFNNs were competitive, with FFNNs showing better performance in smooth models and tree-based boosting algorithms performing better in non-smooth models. This conclusion held generally for predictive performance, identification of important variables, and determining correct input-output relationships as measured by partial dependence plots (PDPs). FFNNs generally had less over-fitting, as measured by the difference in performance between training and testing datasets. However, the difference with XGB was often small. RFs did not perform well in general, confirming the findings in the literature. All models exhibited different degrees of bias seen in PDPs, but the bias was especially problematic for RFs. The extent of the biases varied with correlation among predictors, response type, and data set sample size. In general, tree-based models tended to over-regularize the fitted model in the tails of predictor distributions. Finally, as to be expected, performances were better for continuous responses compared to binary data and with larger samples.
△ Less
Submitted 5 May, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Explaining Adverse Actions in Credit Decisions Using Shapley Decomposition
Authors:
Vijayan N. Nair,
Tianshu Feng,
Linwei Hu,
Zach Zhang,
Jie Chen,
Agus Sudjianto
Abstract:
When a financial institution declines an application for credit, an adverse action (AA) is said to occur. The applicant is then entitled to an explanation for the negative decision. This paper focuses on credit decisions based on a predictive model for probability of default and proposes a methodology for AA explanation. The problem involves identifying the important predictors responsible for the…
▽ More
When a financial institution declines an application for credit, an adverse action (AA) is said to occur. The applicant is then entitled to an explanation for the negative decision. This paper focuses on credit decisions based on a predictive model for probability of default and proposes a methodology for AA explanation. The problem involves identifying the important predictors responsible for the negative decision and is straightforward when the underlying model is additive. However, it becomes non-trivial even for linear models with interactions. We consider models with low-order interactions and develop a simple and intuitive approach based on first principles. We then show how the methodology generalizes to the well-known Shapely decomposition and the recently proposed concept of Baseline Shapley (B-Shap). Unlike other Shapley techniques in the literature for local interpretability of machine learning results, B-Shap is computationally tractable since it involves just function evaluations. An illustrative case study is used to demonstrate the usefulness of the method. The paper also discusses situations with highly correlated predictors and desirable properties of fitted models in the credit-lending context, such as monotonicity and continuity.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability
Authors:
Lunjia Hu,
Charlotte Peale,
Omer Reingold
Abstract:
We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcome…
▽ More
We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class $D$ of distinguishers examining the outcomes generated according to the predictors' predictions.
In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class $P$ containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of $P$ w.r.t. the dual Minkowski norm defined by $D$, and equivalently by the metric entropy of $D$ w.r.t. the dual Minkowski norm defined by $P$. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where $P$ contains all possible predictors, hence the sample complexity only depends on $D$. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of $D$.
We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
A flexible approach for causal inference with multiple treatments and clustered survival outcomes
Authors:
Liangyuan Hu,
Jiayi Ji,
Ronald D. Ennis,
Joseph W. Hogan
Abstract:
When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring and unmeasured confounding for causal analyses. Few off-the-shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random-int…
▽ More
When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring and unmeasured confounding for causal analyses. Few off-the-shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random-intercept accelerated failure time model, in which we use Bayesian additive regression trees to capture arbitrarily complex relationships between censored survival times and pre-treatment covariates and use the random intercepts to capture cluster-specific main effects. We develop an efficient Markov chain Monte Carlo algorithm to draw posterior inferences about the population survival effects of multiple treatments and examine the variability in cluster-level effects. We further propose an interpretable sensitivity analysis approach to evaluate the sensitivity of drawn causal inferences about treatment effect to the potential magnitude of departure from the causal assumption of no unmeasured confounding. Expansive simulations empirically validate and demonstrate good practical operating characteristics of our proposed methods. Applying the proposed methods to a dataset on older high-risk localized prostate cancer patients drawn from the National Cancer Database, we evaluate the comparative effects of three treatment approaches on patient survival, and assess the ramifications of potential unmeasured confounding. The methods developed in this work are readily available in the $\textsf{R}$ package $\textsf{riAFTBART}$.
△ Less
Submitted 16 February, 2022;
originally announced February 2022.
-
CIMTx: An R package for causal inference with multiple treatments using observational data
Authors:
Liangyuan Hu,
Jiayi Ji
Abstract:
CIMTx provides efficient and unified functions to implement modern methods for causal inferences with multiple treatments using observational data with a focus on binary outcomes. The methods include regression adjustment, inverse probability of treatment weighting, Bayesian additive regression trees, regression adjustment with multivariate spline of the generalized propensity score, vector matchi…
▽ More
CIMTx provides efficient and unified functions to implement modern methods for causal inferences with multiple treatments using observational data with a focus on binary outcomes. The methods include regression adjustment, inverse probability of treatment weighting, Bayesian additive regression trees, regression adjustment with multivariate spline of the generalized propensity score, vector matching and targeted maximum likelihood estimation. In addition, CIMTx illustrates ways in which users can simulate data adhering to the complex data structures in the multiple treatment setting. Furthermore, the CIMTx package offers a unique set of features to address the key causal assumptions: positivity and ignorability. For the positivity assumption, CIMTx demonstrates techniques to identify the common support region for retaining inferential units using inverse probability of treatment weighting, Bayesian additive regression trees and vector matching}. To handle the ignorability assumption, CIMTx provides a flexible Monte Carlo sensitivity analysis approach to evaluate how causal conclusions would be altered in response to different magnitude of departure from ignorable treatment assignment.
△ Less
Submitted 14 September, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
abess: A Fast Best Subset Selection Library in Python and R
Authors:
** Zhu,
Xueqin Wang,
Liyuan Hu,
Junhao Huang,
Kangkang Jiang,
Yanhang Zhang,
Shiyun Lin,
Junxian Zhu
Abstract:
We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to a…
▽ More
We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best group subset selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for conveniently integrating with scikit-learn, and it can be installed from the Python library Index. In addition, a user-friendly R library is available at the Comprehensive R Archive Network. The source code is available at: https://github.com/abess-team/abess.
△ Less
Submitted 16 June, 2022; v1 submitted 18 October, 2021;
originally announced October 2021.
-
Estimating the causal effects of multiple intermittent treatments with application to COVID-19
Authors:
Liangyuan Hu,
Jiayi Ji,
Himanshu Joshi,
Erick Scott,
Fan Li
Abstract:
To draw real-world evidence about the comparative effectiveness of multiple time-varying treatments on patient survival, we develop a joint marginal structural survival model and a novel weighting strategy to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple start/stop switches as the recurrent events with discontinuous interval…
▽ More
To draw real-world evidence about the comparative effectiveness of multiple time-varying treatments on patient survival, we develop a joint marginal structural survival model and a novel weighting strategy to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple start/stop switches as the recurrent events with discontinuous intervals of treatment eligibility. We derive the weights in continuous time to handle a complex longitudinal dataset without the need to discretize or artificially align the measurement times. We further use machine learning models designed for censored survival data with time-varying covariates and the kernel function estimator of the baseline intensity to efficiently estimate the continuous-time weights. Our simulations demonstrate that the proposed methods provide better bias reduction and nominal coverage probability when analyzing observational longitudinal survival data with irregularly spaced time intervals, compared to conventional methods that require aligned measurement time points. We apply the proposed methods to a large-scale COVID-19 dataset to estimate the causal effects of several COVID-19 treatments on the composite of in-hospital mortality and ICU admission.
△ Less
Submitted 4 August, 2023; v1 submitted 27 September, 2021;
originally announced September 2021.
-
Discussion on "Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects" by Hahn, Murray and Carvalho
Authors:
Liangyuan Hu
Abstract:
Hahn et al. (2020) offers an extensive study to explicate and evaluate the performance of the BCF model in different settings and provides a detailed discussion about its utility in causal inference. It is a welcomed addition to the causal machine learning literature. I will emphasize the contribution of the BCF model to the field of causal inference through discussions on two topics: 1) the diffe…
▽ More
Hahn et al. (2020) offers an extensive study to explicate and evaluate the performance of the BCF model in different settings and provides a detailed discussion about its utility in causal inference. It is a welcomed addition to the causal machine learning literature. I will emphasize the contribution of the BCF model to the field of causal inference through discussions on two topics: 1) the difference between the PS in the BCF model and the Bayesian PS in a Bayesian updating approach, 2) an alternative exposition of the role of the PS in outcome modeling based methods for the estimation of causal effects. I will conclude with comments on avenues for future research involving BCF that will be important and much needed in the era of Big data.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Faster Rates of Private Stochastic Convex Optimization
Authors:
**yan Su,
Lijie Hu,
Di Wang
Abstract:
In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) w…
▽ More
In this paper, we revisit the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) and provide excess population risks for some special classes of functions that are faster than the previous results of general convex and strongly convex functions. In the first part of the paper, we study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $θ>1$. Specifically, we first show that under some mild assumptions on the loss functions, there is an algorithm whose output could achieve an upper bound of $\tilde{O}((\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log \frac{1}δ}}{nε})^\fracθ{θ-1})$ for $(ε, δ)$-DP when $θ\geq 2$, here $n$ is the sample size and $d$ is the dimension of the space. Then we address the inefficiency issue, improve the upper bounds by $\text{Poly}(\log n)$ factors and extend to the case where $θ\geq \barθ>1$ for some known $\barθ$. Next we show that the excess population risk of population functions satisfying TNC with parameter $θ\geq 2$ is always lower bounded by $Ω((\frac{d}{nε})^\fracθ{θ-1}) $ and $Ω((\frac{\sqrt{d\log \frac{1}δ}}{nε})^\fracθ{θ-1})$ for $ε$-DP and $(ε, δ)$-DP, respectively. In the second part, we focus on a special case where the population risk function is strongly convex. Unlike the previous studies, here we assume the loss function is {\em non-negative} and {\em the optimal value of population risk is sufficiently small}. With these additional assumptions, we propose a new method whose output could achieve an upper bound of $O(\frac{d\log\frac{1}δ}{n^2ε^2}+\frac{1}{n^τ})$ for any $τ\geq 1$ in $(ε,δ)$-DP model if the sample size $n$ is sufficiently large.
△ Less
Submitted 16 January, 2022; v1 submitted 31 July, 2021;
originally announced August 2021.
-
High Dimensional Differentially Private Stochastic Optimization with Heavy-tailed Data
Authors:
Lijie Hu,
Shuo Ni,
Hanshen Xiao,
Di Wang
Abstract:
As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distribution or irregular data in the low dimensional space case. To better understand the challenges arising from irregul…
▽ More
As one of the most fundamental problems in machine learning, statistics and differential privacy, Differentially Private Stochastic Convex Optimization (DP-SCO) has been extensively studied in recent years. However, most of the previous work can only handle either regular data distribution or irregular data in the low dimensional space case. To better understand the challenges arising from irregular data distribution, in this paper we provide the first study on the problem of DP-SCO with heavy-tailed data in the high dimensional space. In the first part we focus on the problem over some polytope constraint (such as the $\ell_1$-norm ball). We show that if the loss function is smooth and its gradient has bounded second order moment, it is possible to get a (high probability) error bound (excess population risk) of $\tilde{O}(\frac{\log d}{(nε)^\frac{1}{3}})$ in the $ε$-DP model, where $n$ is the sample size and $d$ is the dimensionality of the underlying space. Next, for LASSO, if the data distribution that has bounded fourth-order moments, we improve the bound to $\tilde{O}(\frac{\log d}{(nε)^\frac{2}{5}})$ in the $(ε, δ)$-DP model. In the second part of the paper, we study sparse learning with heavy-tailed data. We first revisit the sparse linear model and propose a truncated DP-IHT method whose output could achieve an error of $\tilde{O}(\frac{s^{*2}\log d}{nε})$, where $s^*$ is the sparsity of the underlying parameter. Then we study a more general problem over the sparsity ({\em i.e.,} $\ell_0$-norm) constraint, and show that it is possible to achieve an error of $\tilde{O}(\frac{s^{*\frac{3}{2}}\log d}{nε})$, which is also near optimal up to a factor of $\tilde{O}{(\sqrt{s^*})}$, if the loss function is smooth and strongly convex.
△ Less
Submitted 9 August, 2021; v1 submitted 23 July, 2021;
originally announced July 2021.
-
A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data
Authors:
Jung-Yi Joyce Lin,
Liangyuan Hu,
Chuyue Huang,
Steven Lawrence,
Usha Govindarajulu
Abstract:
Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method, called RR-BART, which leverages…
▽ More
Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN). The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications.
△ Less
Submitted 13 April, 2022; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Near-Optimal Explainable $k$-Means for All Dimensions
Authors:
Moses Charikar,
Lunjia Hu
Abstract:
Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020) introduced explainable clustering, where the cluster boundaries are axis-par…
▽ More
Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020) introduced explainable clustering, where the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. The central question here is: how much does the explainability constraint increase the value of the cost function?
Given $d$-dimensional data points, we show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k^{1 - 2/d}\,\mathrm{poly}(d\log k)$ times the minimum cost achievable by a clustering without the explainability constraint, assuming $k,d\ge 2$. Taking the minimum of this bound and the $k\,\mathrm{polylog} (k)$ bound in independent work by Makarychev-Shan (ICML 2021), Gamlath-Jia-Polak-Svensson (2021), or Esfandiari-Mirrokni-Narayanan (2021), we get an improved bound of $k^{1 - 2/d}\,\mathrm{polylog}(k)$, which we show is optimal for every choice of $k,d\ge 2$ up to a poly-logarithmic factor in $k$. For $d = 2$ in particular, we show an $O(\log k\log\log k)$ bound, improving near-exponentially over the previous best bound of $O(k\log k)$ by Laber and Murtinho (ICML 2021).
△ Less
Submitted 4 November, 2021; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning
Authors:
Liangyuan Hu,
Jung-Yi Joyce Lin,
Jiayi Ji
Abstract:
The missing data issue is ubiquitous in health studies. Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic but has been less studied. Existing literature focuses on parametric regression techniques that provide direct parameter estimates of the regression model. Flexible nonparametric machine learning methods considerably mitigate…
▽ More
The missing data issue is ubiquitous in health studies. Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic but has been less studied. Existing literature focuses on parametric regression techniques that provide direct parameter estimates of the regression model. Flexible nonparametric machine learning methods considerably mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning modeling techniques and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, XGBoost, Random Forests, Bayesian Additive Regression Trees (BART) and Conditional Random Forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that when combined with bootstrap imputation, XGBoost and BART have the overall best variable selection performance with respect to the $F_1$ score and Type I error across various settings. In general, there is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
△ Less
Submitted 7 July, 2021; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Propensity Score Weighting Analysis of Survival Outcomes Using Pseudo-observations
Authors:
Shuxi Zeng,
Fan Li,
Liangyuan Hu,
Fan Li
Abstract:
Survival outcomes are common in comparative effectiveness studies and require unique handling because they are usually incompletely observed due to right-censoring. A ``once for all'' approach for causal inference with survival outcomes constructs pseudo-observations and allows standard methods such as propensity score weighting to proceed as if the outcomes are completely observed. For a general…
▽ More
Survival outcomes are common in comparative effectiveness studies and require unique handling because they are usually incompletely observed due to right-censoring. A ``once for all'' approach for causal inference with survival outcomes constructs pseudo-observations and allows standard methods such as propensity score weighting to proceed as if the outcomes are completely observed. For a general class of model-free causal estimands with survival outcomes on user-specified target populations, we develop corresponding propensity score weighting estimators based on the pseudo-observations and establish their asymptotic properties. In particular, utilizing the functional delta-method and the von Mises expansion, we derive a new closed-form variance of the weighting estimator that takes into account the uncertainty due to both pseudo-observation calculation and propensity score estimation. This allows valid and computationally efficient inference without resampling. We also prove the optimal efficiency property of the overlap weights within the class of balancing weights for survival outcomes. The proposed methods are applicable to both binary and multiple treatments. Extensive simulations are conducted to explore the operating characteristics of the proposed method versus other commonly used alternatives. We apply the proposed method to compare the causal effects of three popular treatment approaches for prostate cancer patients.
△ Less
Submitted 18 December, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data
Authors:
Liangyuan Hu,
Jungang Zou,
Chenyang Gu,
Jiayi Ji,
Michael Lopez,
Minal Kale
Abstract:
In the absence of a randomized experiment, a key assumption for drawing causal inference about treatment effects is the ignorable treatment assignment. Violations of the ignorability assumption may lead to biased treatment effect estimates. Sensitivity analysis helps gauge how causal conclusions will be altered in response to the potential magnitude of departure from the ignorability assumption. H…
▽ More
In the absence of a randomized experiment, a key assumption for drawing causal inference about treatment effects is the ignorable treatment assignment. Violations of the ignorability assumption may lead to biased treatment effect estimates. Sensitivity analysis helps gauge how causal conclusions will be altered in response to the potential magnitude of departure from the ignorability assumption. However, sensitivity analysis approaches for unmeasured confounding in the context of multiple treatments and binary outcomes are scarce. We propose a flexible Monte Carlo sensitivity analysis approach for causal inference in such settings. We first derive the general form of the bias introduced by unmeasured confounding, with emphasis on theoretical properties uniquely relevant to multiple treatments. We then propose methods to encode the impact of unmeasured confounding on potential outcomes and adjust the estimates of causal effects in which the presumed unmeasured confounding is removed. Our proposed methods embed nested multiple imputation within the Bayesian framework, which allow for seamless integration of the uncertainty about the values of the sensitivity parameters and the sampling variability, as well as use of the Bayesian Additive Regression Trees for modeling flexibility. Expansive simulations validate our methods and gain insight into sensitivity analysis with multiple treatments. We use the SEER-Medicare data to demonstrate sensitivity analysis using three treatments for early stage non-small cell lung cancer. The methods developed in this work are readily available in the R package SAMTx.
△ Less
Submitted 13 August, 2021; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees
Authors:
Di Wang,
Jiahao Ding,
Lijie Hu,
Zejun Xie,
Miao Pan,
**hui Xu
Abstract:
(Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. How…
▽ More
(Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.
△ Less
Submitted 16 January, 2022; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Model Generalization in Deep Learning Applications for Land Cover Map**
Authors:
Lucas Hu,
Caleb Robinson,
Bistra Dilkina
Abstract:
Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one…
▽ More
Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one continent or season does not mean that the model will accurately predict land-use classes in a different continent or season. We then use clustering techniques on satellite imagery from different continents to visualize the differences in landscapes that make geospatial generalization particularly difficult, and summarize our takeaways for future satellite imagery-related applications.
△ Less
Submitted 17 June, 2021; v1 submitted 8 August, 2020;
originally announced August 2020.
-
Robust Mean Estimation on Highly Incomplete Data with Arbitrary Outliers
Authors:
Lunjia Hu,
Omer Reingold
Abstract:
We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretica…
▽ More
We study the problem of robustly estimating the mean of a $d$-dimensional distribution given $N$ examples, where most coordinates of every example may be missing and $\varepsilon N$ examples may be arbitrarily corrupted. Assuming each coordinate appears in a constant factor more than $\varepsilon N$ examples, we show algorithms that estimate the mean of the distribution with information-theoretically optimal dimension-independent error guarantees in nearly-linear time $\widetilde O(Nd)$. Our results extend recent work on computationally-efficient robust estimation to a more widely applicable incomplete-data setting.
△ Less
Submitted 3 May, 2021; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes
Authors:
Liangyuan Hu,
Chenyang Gu
Abstract:
The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that ha…
▽ More
The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that has been treated sparingly in the literature. This paper designs three sets of simulations, representative of the structure of our healthcare database study, and propose causal analysis strategies for such settings. We investigate and compare the operating characteristics of three types of methods and their variants: Bayesian Additive Regression Trees (BART), regression adjustment on multivariate spline of generalized propensity scores (RAMS) and inverse probability of treatment weighting (IPTW) with multinomial logistic regression or generalized boosted models. Our results suggest that BART and RAMS provide lower bias and mean squared error, and the widely used IPTW methods deliver unfavorable operating characteristics. We illustrate the methods using a case study evaluating the comparative effectiveness of robotic-assisted surgery, video-assisted thoracoscopic surgery and open thoracotomy for treating non-small cell lung cancer.
△ Less
Submitted 2 October, 2020; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Estimating heterogeneous survival treatment effect in observational data using machine learning
Authors:
Liangyuan Hu,
Jiayi Ji,
Fan Li
Abstract:
Methods for estimating heterogeneous treatment effect in observational data have largely focused on continuous or binary outcomes, and have been relatively less vetted with survival outcomes. Using flexible machine learning methods in the counterfactual framework is a promising approach to address challenges due to complex individual characteristics, to which treatments need to be tailored. To eva…
▽ More
Methods for estimating heterogeneous treatment effect in observational data have largely focused on continuous or binary outcomes, and have been relatively less vetted with survival outcomes. Using flexible machine learning methods in the counterfactual framework is a promising approach to address challenges due to complex individual characteristics, to which treatments need to be tailored. To evaluate the operating characteristics of recent survival machine learning methods for the estimation of treatment effect heterogeneity and inform better practice, we carry out a comprehensive simulation study presenting a wide range of settings describing confounded heterogeneous survival treatment effects and varying degrees of covariate overlap. Our results suggest that the nonparametric Bayesian Additive Regression Trees within the framework of accelerated failure time model (AFT-BART-NP) consistently yields the best performance, in terms of bias, precision and expected regret. Moreover, the credible interval estimators from AFT-BART-NP provide close to nominal frequentist coverage for the individual survival treatment effect when the covariate overlap is at least moderate. Including a non-parametrically estimated propensity score as an additional fixed covariate in the AFT-BART-NP model formulation can further improve its efficiency and frequentist coverage. Finally, we demonstrate the application of flexible causal machine learning estimators through a comprehensive case study examining the heterogeneous survival effects of two radiotherapy approaches for localized high-risk prostate cancer.
△ Less
Submitted 19 May, 2021; v1 submitted 16 August, 2020;
originally announced August 2020.
-
Supervised Machine Learning Techniques: An Overview with Applications to Banking
Authors:
Linwei Hu,
Jie Chen,
Joel Vaughan,
Hanyu Yang,
Kelly Wang,
Agus Sudjianto,
Vijayan N. Nair
Abstract:
This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging wit…
▽ More
This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging with RF and Boosting with GBMs, ii) Feedforward NNs, iii) a discussion of hyper-parameter optimization techniques, and iv) machine learning interpretability. The paper concludes with a comparison of the features of different ML algorithms. Examples taken from credit risk modeling in banking are used throughout the paper to illustrate the techniques and interpret the results of the algorithms.
△ Less
Submitted 28 July, 2020;
originally announced August 2020.
-
Surrogate Locally-Interpretable Models with Supervised Machine Learning Algorithms
Authors:
Linwei Hu,
Jie Chen,
Vijayan N. Nair,
Agus Sudjianto
Abstract:
Supervised Machine Learning (SML) algorithms, such as Gradient Boosting, Random Forest, and Neural Networks, have become popular in recent years due to their superior predictive performance over traditional statistical methods. However, their complexity makes the results hard to interpret without additional tools. There has been a lot of recent work in develo** global and local diagnostics for i…
▽ More
Supervised Machine Learning (SML) algorithms, such as Gradient Boosting, Random Forest, and Neural Networks, have become popular in recent years due to their superior predictive performance over traditional statistical methods. However, their complexity makes the results hard to interpret without additional tools. There has been a lot of recent work in develo** global and local diagnostics for interpreting SML models. In this paper, we propose a locally-interpretable model that takes the fitted ML response surface, partitions the predictor space using model-based regression trees, and fits interpretable main-effects models at each of the nodes. We adapt the algorithm to be efficient in dealing with high-dimensional predictors. While the main focus is on interpretability, the resulting surrogate model also has reasonably good predictive performance.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Unified statistical inference for a novel nonlinear dynamic functional/longitudinal data model
Authors:
Lixia Hu,
Tao Huang,
**hong You
Abstract:
In light of recent work studying massive functional/longitudinal data, such as the resulting data from the COVID-19 pandemic, we propose a novel functional/longitudinal data model which is a combination of the popular varying coefficient (VC) model and additive model. We call it Semi-VCAM in which the response could be a functional/longitudinal variable, and the explanatory variables could be a mi…
▽ More
In light of recent work studying massive functional/longitudinal data, such as the resulting data from the COVID-19 pandemic, we propose a novel functional/longitudinal data model which is a combination of the popular varying coefficient (VC) model and additive model. We call it Semi-VCAM in which the response could be a functional/longitudinal variable, and the explanatory variables could be a mixture of functional/longitudinal and scalar variables. Notably some of the scalar variables could be categorical variables as well. The Semi-VCAM simultaneously allows for both substantial flexibility and the maintaining of one-dimensional rates of convergence. A local linear smoothing with the aid of an initial B spline series approximation is developed to estimate the unknown functional effects in the model. To avoid the subjective choice between the sparse and dense cases of the data, we establish the asymptotic theories of the resultant Pilot Estimation Based Local Linear Estimators (PEBLLE) on a unified framework of sparse, dense and ultra-dense cases of the data. Moreover, we construct unified consistent tests to justify whether a parsimony submodel is sufficient or not. These test methods also avoid the subjective choice between the sparse, dense and ultra dense cases of the data. Extensive Monte Carlo simulation studies investigating the finite sample performance of the proposed methodologies confirm our asymptotic results. We further illustrate our methodologies via analyzing the COVID-19 data from China and the CD4 data.
△ Less
Submitted 3 July, 2020;
originally announced July 2020.
-
Robust Locality-Aware Regression for Labeled Data Classification
Authors:
Liangchen Hu,
Wensheng Zhang
Abstract:
With the dramatic increase of dimensions in the data representation, extracting latent low-dimensional features becomes of the utmost importance for efficient classification. Aiming at the problems of unclear margin representation and difficulty in revealing the data manifold structure in most of the existing linear discriminant methods, we propose a new discriminant feature extraction framework,…
▽ More
With the dramatic increase of dimensions in the data representation, extracting latent low-dimensional features becomes of the utmost importance for efficient classification. Aiming at the problems of unclear margin representation and difficulty in revealing the data manifold structure in most of the existing linear discriminant methods, we propose a new discriminant feature extraction framework, namely Robust Locality-Aware Regression (RLAR). In our model, we introduce a retargeted regression to perform the marginal representation learning adaptively instead of using the general average inter-class margin. Besides, we formulate a new strategy for enhancing the local intra-class compactness of the data manifold, which can achieve the joint learning of locality-aware graph structure and desirable projection matrix. To alleviate the disturbance of outliers and prevent overfitting, we measure the regression term and locality-aware term together with the regularization term by the L2,1 norm. Further, forcing the row sparsity on the projection matrix through the L2,1 norm achieves the cooperation of feature selection and feature extraction. Then, we derive an effective iterative algorithm for solving the proposed model. The experimental results over a range of UCI data sets and other benchmark databases demonstrate that the proposed RLAR outperforms some state-of-the-art approaches.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Estimation of Causal Effects of Multiple Treatments in Observational Studies with a Binary Outcome
Authors:
Liangyuan Hu,
Chenyang Gu,
Michael Lopez,
Jiayi Ji,
Juan Wisnivesky
Abstract:
There is a dearth of robust methods to estimate the causal effects of multiple treatments when the outcome is binary. This paper uses two unique sets of simulations to propose and evaluate the use of Bayesian Additive Regression Trees (BART) in such settings. First, we compare BART to several approaches that have been proposed for continuous outcomes, including inverse probability of treatment wei…
▽ More
There is a dearth of robust methods to estimate the causal effects of multiple treatments when the outcome is binary. This paper uses two unique sets of simulations to propose and evaluate the use of Bayesian Additive Regression Trees (BART) in such settings. First, we compare BART to several approaches that have been proposed for continuous outcomes, including inverse probability of treatment weighting (IPTW), targeted maximum likelihood estimator (TMLE), vector matching and regression adjustment. Results suggest that under conditions of non-linearity and non-additivity of both the treatment assignment and outcome generating mechanisms, BART, TMLE and IPTW using generalized boosted models (GBM) provide better bias reduction and smaller root mean squared error. BART and TMLE provide more consistent 95 per cent CI coverage and better large-sample convergence property. Second, we supply BART with a strategy to identify a common support region for retaining inferential units and for avoiding extrapolating over areas of the covariate space where common support does not exist. BART retains more inferential units than the generalized propensity score based strategy, and shows lower bias, compared to TMLE or GBM, in a variety of scenarios differing by the degree of covariate overlap. A case study examining the effects of three surgical approaches for non-small cell lung cancer demonstrates the methods.
△ Less
Submitted 16 January, 2020;
originally announced January 2020.
-
Machine Learning-based Signal Detection for PMH Signals in Load-modulated MIMO System
Authors:
**le Zhu,
Qiang Li,
Li Hu,
Hongyang Chen,
Nirwan Ansari
Abstract:
Phase Modulation on the Hypersphere (PMH) is a power efficient modulation scheme for the \textit{load-modulated} multiple-input multiple-output (MIMO) transmitters with central power amplifiers (CPA). However, it is difficult to obtain the precise channel state information (CSI), and the traditional optimal maximum likelihood (ML) detection scheme incurs high complexity which increases exponential…
▽ More
Phase Modulation on the Hypersphere (PMH) is a power efficient modulation scheme for the \textit{load-modulated} multiple-input multiple-output (MIMO) transmitters with central power amplifiers (CPA). However, it is difficult to obtain the precise channel state information (CSI), and the traditional optimal maximum likelihood (ML) detection scheme incurs high complexity which increases exponentially with the number of antennas and the number of bits carried per antenna in the PMH modulation. To detect the PMH signals without knowing the prior CSI, we first propose a signal detection scheme, termed as the hypersphere clustering scheme based on the expectation maximization (EM) algorithm with maximum likelihood detection (HEM-ML). By leveraging machine learning, the proposed detection scheme can accurately obtain information of the channel from a few of the received symbols with little resource cost and achieve comparable detection results as that of the optimal ML detector. To further reduce the computational complexity in the ML detection in HEM-ML, we also propose the second signal detection scheme, termed as the hypersphere clustering scheme based on the EM algorithm with KD-tree detection (HEM-KD). The CSI obtained from the EM algorithm is used to build a spatial KD-tree receiver codebook and the signal detection problem can be transformed into a nearest neighbor search (NNS) problem. The detection complexity of HEM-KD is significantly reduced without any detection performance loss as compared to HEM-ML. Extensive simulation results verify the effectiveness of our proposed detection schemes.
△ Less
Submitted 24 November, 2019;
originally announced November 2019.
-
Graph Neural News Recommendation with Long-term and Short-term Interest Modeling
Authors:
Linmei Hu,
Chen Li,
Chuan Shi,
Cheng Yang,
Chao Shao
Abstract:
With the information explosion of news articles, personalized news recommendation has become important for users to quickly find news that they are interested in. Existing methods on news recommendation mainly include collaborative filtering methods which rely on direct user-item interactions and content based methods which characterize the content of user reading history. Although these methods h…
▽ More
With the information explosion of news articles, personalized news recommendation has become important for users to quickly find news that they are interested in. Existing methods on news recommendation mainly include collaborative filtering methods which rely on direct user-item interactions and content based methods which characterize the content of user reading history. Although these methods have achieved good performances, they still suffer from data sparse problem, since most of them fail to extensively exploit high-order structure information (similar users tend to read similar news articles) in news recommendation systems. In this paper, we propose to build a heterogeneous graph to explicitly model the interactions among users, news and latent topics. The incorporated topic information would help indicate a user's interest and alleviate the sparsity of user-item interactions. Then we take advantage of graph neural networks to learn user and news representations that encode high-order structure information by propagating embeddings over the graph. The learned user embeddings with complete historic user clicks capture the users' long-term interests. We also consider a user's short-term interest using the recent reading history with an attention based LSTM model. Experimental results on real-world datasets show that our proposed model significantly outperforms state-of-the-art methods on news recommendation.
△ Less
Submitted 7 November, 2019; v1 submitted 30 October, 2019;
originally announced October 2019.
-
Estimating Smooth GLM in Non-interactive Local Differential Privacy Model with Public Unlabeled Data
Authors:
Di Wang,
Lijie Hu,
Huanyu Zhang,
Marco Gaboardi,
**hui Xu
Abstract:
In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Different from its classical setting, our model allows the server to access some additional public but unlabeled data. In the first part of the paper we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. samp…
▽ More
In this paper, we study the problem of estimating smooth Generalized Linear Models (GLMs) in the Non-interactive Local Differential Privacy (NLDP) model. Different from its classical setting, our model allows the server to access some additional public but unlabeled data. In the first part of the paper we focus on GLMs. Specifically, we first consider the case where each data record is i.i.d. sampled from a zero-mean multivariate Gaussian distribution. Motivated by the Stein's lemma, we present an $(ε, δ)$-NLDP algorithm for
GLMs. Moreover, the sample complexity of public and private data for the algorithm to achieve an $\ell_2$-norm estimation error of $α$ (with high probability) is ${O}(p α^{-2})$ and $\tilde{O}(p^3α^{-2}ε^{-2})$ respectively, where $p$ is the dimension of the feature vector. This is a significant improvement over the previously known exponential or quasi-polynomial in $α^{-1}$, or exponential in $p$ sample complexities of GLMs with no public data. Then we consider a more general setting where each data record is i.i.d. sampled from some sub-Gaussian distribution with bounded $\ell_1$-norm. Based on a variant of Stein's lemma, we propose an $(ε, δ)$-NLDP algorithm for
GLMs whose sample complexity of public and private data to achieve an $\ell_\infty$-norm estimation error of $α$ is ${O}(p^2α^{-2})$ and $\tilde{O}(p^2α^{-2}ε^{-2})$ respectively, under some mild assumptions and if $α$ is not too small ({\em i.e.,} $α\geq Ω(\frac{1}{\sqrt{p}})$). In the second part of the paper, we extend our idea to the problem of estimating non-linear regressions and show similar results as in GLMs for both multivariate Gaussian and sub-Gaussian cases. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real-world datasets.
△ Less
Submitted 20 August, 2022; v1 submitted 1 October, 2019;
originally announced October 2019.
-
Fully-automated patient-level malaria assessment on field-prepared thin blood film microscopy images, including Supplementary Information
Authors:
Charles B. Delahunt,
Mayoore S. Jaiswal,
Matthew P. Horning,
Samantha Janko,
Clay M. Thompson,
Sourabh Kulhare,
Liming Hu,
Travis Ostbye,
Grace Yun,
Roman Gebrehiwot,
Benjamin K. Wilson,
Earl Long,
Stephane Proux,
Dionicia Gamboa,
Peter Chiodini,
Jane Carter,
Mehul Dhorda,
David Isaboke,
Bernhards Ogutu,
Wellington Oyibo,
Elizabeth Villasis,
Kyaw Myo Tun,
Christine Bachman,
David Bell,
Courosh Mehanian
Abstract:
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumb…
▽ More
Malaria is a life-threatening disease affecting millions. Microscopy-based assessment of thin blood films is a standard method to (i) determine malaria species and (ii) quantitate high-parasitemia infections. Full automation of malaria microscopy by machine learning (ML) is a challenging task because field-prepared slides vary widely in quality and presentation, and artifacts often heavily outnumber relatively rare parasites. In this work, we describe a complete, fully-automated framework for thin film malaria analysis that applies ML methods, including convolutional neural nets (CNNs), trained on a large and diverse dataset of field-prepared thin blood films. Quantitation and species identification results are close to sufficiently accurate for the concrete needs of drug resistance monitoring and clinical use-cases on field-prepared samples. We focus our methods and our performance metrics on the field use-case requirements. We discuss key issues and important metrics for the application of ML methods to malaria microscopy.
△ Less
Submitted 11 September, 2022; v1 submitted 5 August, 2019;
originally announced August 2019.