Search | arXiv e-print repository

Spatio-temporal Joint Analysis of PM2.5 and Ozone in California with INLA

Authors: Jianan Pan, Kunyang He, Kai Wang, Qing Mu, Chengxiu Ling

Abstract: The substantial threat of concurrent air pollutants to public health is increasingly severe under climate change. To identify the common drivers and extent of spatio-temporal similarity of PM2.5 and ozone, this paper proposed a log Gaussian-Gumbel Bayesian hierarchical model allowing for sharing a SPDE-AR(1) spatio-temporal interaction structure. The proposed model outperforms in terms of estimati… ▽ More The substantial threat of concurrent air pollutants to public health is increasingly severe under climate change. To identify the common drivers and extent of spatio-temporal similarity of PM2.5 and ozone, this paper proposed a log Gaussian-Gumbel Bayesian hierarchical model allowing for sharing a SPDE-AR(1) spatio-temporal interaction structure. The proposed model outperforms in terms of estimation accuracy and prediction capacity for its increased parsimony and reduced uncertainty, especially for the shared ozone sub-model. Besides the consistently significant influence of temperature (positive), extreme drought (positive), fire burnt area (positive), and wind speed (negative) on both PM2.5 and ozone, surface pressure and GDP per capita (precipitation) demonstrate only positive associations with PM2.5 (ozone), while population density relates to neither. In addition, our results show the distinct spatio-temporal interactions and different seasonal patterns of PM2.5 and ozone, with peaks of PM2.5 and ozone in cold and hot seasons, respectively. Finally, with the aid of the excursion function, we see that the areas around the intersection of San Luis Obispo and Santa Barbara counties are likely to exceed the unhealthy ozone level for sensitive groups throughout the year. Our findings provide new insights for regional and seasonal strategies in the co-control of PM2.5 and ozone. Our methodology is expected to be utilized when interest lies in multiple interrelated processes in the fields of environment and epidemiology. △ Less

Submitted 20 April, 2024; originally announced April 2024.

arXiv:2402.05438 [pdf, other]

Penalized spline estimation of principal components for sparse functional data: rates of convergence

Authors: Shiyuan He, Jianhua Z. Huang, Kejun He

Abstract: This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functi… ▽ More This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functions motivated by the matrix Bregman divergence, and the penalty term is the integrated squared derivative. The theory reveals that the asymptotic behavior of penalized spline estimators depends on the interesting interplay between several factors, i.e., the smoothness of the unknown functions, the spline degree, the spline knot number, the penalty order, and the penalty parameter. The theory also classifies the asymptotic behavior into seven scenarios and characterizes whether and how the minimax optimal rates of convergence are achievable in each scenario. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2401.02650 [pdf, other]

Improving sample efficiency of high dimensional Bayesian optimization with MCMC

Authors: Zeji Yi, Yunyue Wei, Chu Xin Cheng, Kaibo He, Yanan Sui

Abstract: Sequential optimization methods are often confronted with the curse of dimensionality in high-dimensional spaces. Current approaches under the Gaussian process framework are still burdened by the computational complexity of tracking Gaussian process posteriors and need to partition the optimization problem into small regions to ensure exploration or assume an underlying low-dimensional structure.… ▽ More Sequential optimization methods are often confronted with the curse of dimensionality in high-dimensional spaces. Current approaches under the Gaussian process framework are still burdened by the computational complexity of tracking Gaussian process posteriors and need to partition the optimization problem into small regions to ensure exploration or assume an underlying low-dimensional structure. With the idea of transiting the candidate points towards more promising positions, we propose a new method based on Markov Chain Monte Carlo to efficiently sample from an approximated posterior. We provide theoretical guarantees of its convergence in the Gaussian process Thompson sampling setting. We also show experimentally that both the Metropolis-Hastings and the Langevin Dynamics version of our algorithm outperform state-of-the-art methods in high-dimensional sequential optimization and reinforcement learning benchmarks. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2310.01753 [pdf, other]

CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery

Authors: Yuxiao Cheng, Ziqian Wang, Tingxiong Xiao, Qin Zhong, **li Suo, Kunlun He

Abstract: Time-series causal discovery (TSCD) is a fundamental problem of machine learning. However, existing synthetic datasets cannot properly evaluate or predict the algorithms' performance on real data. This study introduces the CausalTime pipeline to generate time-series that highly resemble the real data and with ground truth causal graphs for quantitative performance evaluation. The pipeline starts f… ▽ More Time-series causal discovery (TSCD) is a fundamental problem of machine learning. However, existing synthetic datasets cannot properly evaluate or predict the algorithms' performance on real data. This study introduces the CausalTime pipeline to generate time-series that highly resemble the real data and with ground truth causal graphs for quantitative performance evaluation. The pipeline starts from real observations in a specific scenario and produces a matching benchmark dataset. Firstly, we harness deep neural networks along with normalizing flow to accurately capture realistic dynamics. Secondly, we extract hypothesized causal graphs by performing importance analysis on the neural network or leveraging prior knowledge. Thirdly, we derive the ground truth causal graphs by splitting the causal model into causal term, residual term, and noise term. Lastly, using the fitted network and the derived causal graph, we generate corresponding versatile time-series proper for algorithm assessment. In the experiments, we validate the fidelity of the generated data through qualitative and quantitative experiments, followed by a benchmarking of existing TSCD algorithms using these generated datasets. CausalTime offers a feasible solution to evaluating TSCD algorithms in real applications and can be generalized to a wide range of fields. For easy use of the proposed approach, we also provide a user-friendly website, hosted on www.causaltime.cc. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2305.06208 [pdf, other]

Robust Privacy-Preserving Models for Cluster-Level Confounding: Recognizing Disparities in Access to Transplantation

Authors: Nicholas Hartman, Kevin He

Abstract: In applications where the study data are collected within cluster units (e.g., patients within transplant centers), it is often of interest to estimate and perform inference on the treatment effects of the cluster units. However, it is well-established that cluster-level confounding variables can bias these assessments, and many of these confounding factors may be unobservable. In healthcare setti… ▽ More In applications where the study data are collected within cluster units (e.g., patients within transplant centers), it is often of interest to estimate and perform inference on the treatment effects of the cluster units. However, it is well-established that cluster-level confounding variables can bias these assessments, and many of these confounding factors may be unobservable. In healthcare settings, data sharing restrictions often make it impossible to directly fit conventional risk-adjustment models on patient-level data, and existing privacy-preserving approaches cannot adequately adjust for both observed and unobserved cluster-level confounding factors. In this paper, we propose a privacy-preserving model for cluster-level confounding that only depends on publicly-available summary statistics, can be fit using a single optimization routine, and is robust to outlying cluster unit effects. In addition, we develop a Pseudo-Bayesian inference procedure that accounts for the estimated cluster-level confounding effects and corrects for the impact of unobservable factors. Simulations show that our estimates are robust and accurate, and the proposed inference approach has better Frequentist properties than existing methods. Motivated by efforts to improve equity in transplant care, we apply these methods to evaluate transplant centers while adjusting for observed geographic disparities in donor organ availability and unobservable confounders. △ Less

Submitted 10 May, 2023; originally announced May 2023.

arXiv:2305.05890 [pdf, other]

CUTS+: High-dimensional Causal Discovery from Irregular Time-series

Authors: Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Qin Zhong, **li Suo, Kunlun He

Abstract: Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and hug… ▽ More Causal discovery in time-series is a fundamental problem in the machine learning community, enabling causal reasoning and decision-making in complex scenarios. Recently, researchers successfully discover causality by combining neural networks with Granger causality, but their performances degrade largely when encountering high-dimensional data because of the highly redundant network design and huge causal graphs. Moreover, the missing entries in the observations further hamper the causal structural learning. To overcome these limitations, We propose CUTS+, which is built on the Granger-causality-based causal discovery method CUTS and raises the scalability by introducing a technique called Coarse-to-fine-discovery (C2FD) and leveraging a message-passing-based graph neural network (MPGNN). Compared to previous methods on simulated, quasi-real, and real datasets, we show that CUTS+ largely improves the causal discovery performance on high-dimensional data with different types of irregular sampling. △ Less

Submitted 16 August, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: Submit to AAAI-24

arXiv:2304.10866 [pdf, other]

Joint Mirror Procedure: Controlling False Discovery Rate for Identifying Simultaneous Signals

Authors: Linsui Deng, Kejun He, Xianyang Zhang

Abstract: In many applications, the process of identifying a specific feature of interest often involves testing multiple hypotheses for their joint statistical significance. Examples include mediation analysis which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis aiming to identify simultaneous signals that exhibit statistical sign… ▽ More In many applications, the process of identifying a specific feature of interest often involves testing multiple hypotheses for their joint statistical significance. Examples include mediation analysis which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis aiming to identify simultaneous signals that exhibit statistical significance across multiple independent experiments. In this study, we present a new approach called joint mirror (JM) procedure that effectively detects such features while maintaining false discovery rate (FDR) control in finite samples. The JM procedure employs an iterative method that gradually shrinks the rejection region based on progressively revealed information until a conservative estimate of the false discovery proportion (FDP) is below the target FDR level. Additionally, we introduce a more stringent error measure, known as the modified FDR (mFDR), which assigns weights to each false discovery based on its number of null components. We demonstrate that, under appropriate assumptions, the JM procedure controls the mFDR in finite samples. To implement the JM procedure, we propose an efficient algorithm that can incorporate partial ordering information. Through extensive simulations, we demonstrate that our procedure effectively controls the mFDR and enhances statistical power across various scenarios. Finally, we showcase the utility of our method by applying it to real-world mediation and replicability analyses. △ Less

Submitted 27 May, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:2303.04408 [pdf, other]

doi 10.1007/s13253-023-00585-8

Principal Component Analysis of Two-dimensional Functional Data with Serial Correlation

Authors: Shirun Shen, Huiya Zhou, Kejun He, Lan Zhou

Abstract: In this paper, we propose a novel model to analyze serially correlated two-dimensional functional data observed sparsely and irregularly on a domain which may not be a rectangle. Our approach employs a mixed effects model that specifies the principal component functions as bivariate splines on triangulations and the principal component scores as random effects which follow an auto-regressive model… ▽ More In this paper, we propose a novel model to analyze serially correlated two-dimensional functional data observed sparsely and irregularly on a domain which may not be a rectangle. Our approach employs a mixed effects model that specifies the principal component functions as bivariate splines on triangulations and the principal component scores as random effects which follow an auto-regressive model. We apply the thin-plate penalty for regularizing the bivariate function estimation and develop an effective EM algorithm along with Kalman filter and smoother for calculating the penalized likelihood estimates of the parameters. Our approach was applied on simulated datasets and on Texas monthly average temperature data from January year 1915 to December year 2014. △ Less

Submitted 7 December, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2302.11123 [pdf, other]

Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores

Authors: Di Wang, Wen Ye, Ji Zhu, Gongjun Xu, Wei**g Tang, Matthew Zawistowski, Lars G. Fritsche, Kevin He

Abstract: Polygenic hazard score (PHS) models designed for European ancestry (EUR) individuals provide ample information regarding survival risk discrimination. Incorporating such information can improve the performance of risk discrimination in an internal small-sized non-EUR cohort. However, given that external EUR-based model and internal individual-level data come from different populations, ignoring po… ▽ More Polygenic hazard score (PHS) models designed for European ancestry (EUR) individuals provide ample information regarding survival risk discrimination. Incorporating such information can improve the performance of risk discrimination in an internal small-sized non-EUR cohort. However, given that external EUR-based model and internal individual-level data come from different populations, ignoring population heterogeneity can introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. We establish the asymptotic properties of the CoxKL estimator. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We applied the proposed method to develop a trans-ancestry PHS model for prostate cancer and found that integrating a previously published EUR-based PHS with an internal genotype data of African ancestry (AFR) males yielded considerable improvement on the prostate cancer risk discrimination. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.08439 [pdf, other]

doi 10.1080/00401706.2023.2197471

Bayesian Nonlinear Tensor Regression with Functional Fused Elastic Net Prior

Authors: Shuoli Chen, Kejun He, Shiyuan He, Yang Ni, Raymond K. W. Wong

Abstract: Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this articl… ▽ More Tensor regression methods have been widely used to predict a scalar response from covariates in the form of a multiway array. In many applications, the regions of tensor covariates used for prediction are often spatially connected with unknown shapes and discontinuous jumps on the boundaries. Moreover, the relationship between the response and the tensor covariates can be nonlinear. In this article, we develop a nonlinear Bayesian tensor additive regression model to accommodate such spatial structure. A functional fused elastic net prior is proposed over the additive component functions to comprehensively model the nonlinearity and spatial smoothness, detect the discontinuous jumps, and simultaneously identify the active regions. The great flexibility and interpretability of the proposed method against the alternatives are demonstrated by a simulation study and an analysis on facial feature data. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Journal ref: Technometrics, 65:4, 524-536 (2023)

arXiv:2302.07458 [pdf, other]

CUTS: Neural Causal Discovery from Irregular Time-Series Data

Authors: Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, **li Suo, Kunlun He, Qionghai Dai

Abstract: Causal discovery from time-series data has been a central task in machine learning. Recently, Granger causality inference is gaining momentum due to its good explainability and high compatibility with emerging deep neural networks. However, most existing methods assume structured input data and degenerate greatly when encountering data with randomly missing entries or non-uniform sampling frequenc… ▽ More Causal discovery from time-series data has been a central task in machine learning. Recently, Granger causality inference is gaining momentum due to its good explainability and high compatibility with emerging deep neural networks. However, most existing methods assume structured input data and degenerate greatly when encountering data with randomly missing entries or non-uniform sampling frequencies, which hampers their applications in real scenarios. To address this issue, here we present CUTS, a neural Granger causal discovery algorithm to jointly impute unobserved data points and build causal graphs, via plugging in two mutually boosting modules in an iterative framework: (i) Latent data prediction stage: designs a Delayed Supervision Graph Neural Network (DSGNN) to hallucinate and register unstructured data which might be of high dimension and with complex distribution; (ii) Causal graph fitting stage: builds a causal adjacency matrix with imputed data under sparse penalty. Experiments show that CUTS effectively infers causal graphs from unstructured time-series data, with significantly superior performance to existing methods. Our approach constitutes a promising step towards applying causal discovery to real applications with non-ideal observations. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: https://openreview.net/forum?id=UG8bQcD3Emv

Journal ref: The Eleventh International Conference on Learning Representations, Feb. 2023

arXiv:2301.01107 [pdf]

Computing the Performance of A New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

Authors: James K. He, Sofía S. Villar, Lida Mavrogonatou

Abstract: Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP tha… ▽ More Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards. △ Less

Submitted 3 January, 2023; originally announced January 2023.

Comments: Accepted by Computing Conference, London 2023

arXiv:2211.14752 [pdf, other]

Differentiable Meta Multigraph Search with Partial Message Propagation on Heterogeneous Information Networks

Authors: Chao Li, Hao Xu, Kun He

Abstract: Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a n… ▽ More Heterogeneous information networks (HINs) are widely employed for describing real-world data with intricate entities and relationships. To automatically utilize their semantic information, graph neural architecture search has recently been developed on various tasks of HINs. Existing works, on the other hand, show weaknesses in instability and inflexibility. To address these issues, we propose a novel method called Partial Message Meta Multigraph search (PMMM) to automatically optimize the neural architecture design on HINs. Specifically, to learn how graph neural networks (GNNs) propagate messages along various types of edges, PMMM adopts an efficient differentiable framework to search for a meaningful meta multigraph, which can capture more flexible and complex semantic relations than a meta graph. The differentiable search typically suffers from performance instability, so we further propose a stable algorithm called partial message search to ensure that the searched meta multigraph consistently surpasses the manually designed meta-structures, i.e., meta-paths. Extensive experiments on six benchmark datasets over two representative tasks, including node classification and recommendation, demonstrate the effectiveness of the proposed method. Our approach outperforms the state-of-the-art heterogeneous GNNs, finds out meaningful meta multigraphs, and is significantly more stable. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: 12 pages, 7 figures, 8 tables, accepted by AAAI 2023 conference

arXiv:2211.04874 [pdf, other]

A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty

Authors: Shiyuan He, Hanxuan Ye, Kejun He

Abstract: This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We pr… ▽ More This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We propose a general model with double regularization over the spline coefficient matrix: i) a matrix manifold constraint, and ii) a composite penalty as a summation of quadratic terms. Many multi-task learning approaches can be treated as special cases of this proposed model, such as a reduced-rank model and a graph Laplacian regularized model. We show the composite penalty induces a specific norm, which helps to quantify the manifold curvature and determine the corresponding proper subset in the manifold tangent space. The complexity of tangent space subset is then bridged to the complexity of geodesic neighbor via generic chaining. A unified convergence upper bound is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model. The phase transition behaviors for the estimators are examined as we vary the configurations of model parameters. △ Less

Submitted 31 July, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.04784 [pdf, other]

Spline Estimation of Functional Principal Components via Manifold Conjugate Gradient Algorithm

Authors: Shiyuan He, Hanxuan Ye, Kejun He

Abstract: Functional principal component analysis has become the most important dimension reduction technique in functional data analysis. Based on B-spline approximation, functional principal components (FPCs) can be efficiently estimated by the expectation-maximization (EM) and the geometric restricted maximum likelihood (REML) algorithms under the strong assumption of Gaussianity on the principal compone… ▽ More Functional principal component analysis has become the most important dimension reduction technique in functional data analysis. Based on B-spline approximation, functional principal components (FPCs) can be efficiently estimated by the expectation-maximization (EM) and the geometric restricted maximum likelihood (REML) algorithms under the strong assumption of Gaussianity on the principal component scores and observational errors. When computing the solution, the EM algorithm does not exploit the underlying geometric manifold structure, while the performance of REML is known to be unstable. In this article, we propose a conjugate gradient algorithm over the product manifold to estimate FPCs. This algorithm exploits the manifold geometry structure of the overall parameter space, thus improving its search efficiency and estimation accuracy. In addition, a distribution-free interpretation of the loss function is provided from the viewpoint of matrix Bregman divergence, which explains why the proposed method works well under general distribution settings. We also show that a roughness penalization can be easily incorporated into our algorithm with a potentially better fit. The appealing numerical performance of the proposed method is demonstrated by simulation studies and the analysis of a Type Ia supernova light curve dataset. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.17121 [pdf, other]

Powerful Spatial Multiple Testing via Borrowing Neighboring Information

Authors: Linsui Deng, Kejun He, Xianyang Zhang

Abstract: Clustered effects are often encountered in multiple hypothesis testing of spatial signals. In this paper, we propose a new method, termed two-dimensional spatial multiple testing (2d-SMT) procedure, to control the false discovery rate (FDR) and improve the detection power by exploiting the spatial information encoded in neighboring observations. The proposed method provides a novel perspective of… ▽ More Clustered effects are often encountered in multiple hypothesis testing of spatial signals. In this paper, we propose a new method, termed two-dimensional spatial multiple testing (2d-SMT) procedure, to control the false discovery rate (FDR) and improve the detection power by exploiting the spatial information encoded in neighboring observations. The proposed method provides a novel perspective of utilizing spatial information by gathering signal patterns and spatial dependence into an auxiliary statistic. 2d-SMT rejects the null when a primary statistic at the location of interest and the auxiliary statistic constructed based on nearby observations are greater than their corresponding thresholds. 2d-SMT can also be combined with different variants of the weighted BH procedures to improve the detection power further. A fast step-down algorithm is developed to accelerate the search for optimal thresholds in 2d-SMT. In theory, we establish the asymptotical FDR control of 2d-SMT under weak spatial dependence. Extensive numerical experiments demonstrate that the 2d-SMT method combined with various weighted BH procedures achieves the most competitive performance in FDR and power trade-off. △ Less

Submitted 31 October, 2022; originally announced October 2022.

Comments: 35 pages, 10 figures

arXiv:2210.12832 [pdf, other]

Functional Bayesian Networks for Discovering Causality from Multivariate Functional Data

Authors: Fangting Zhou, Kejun He, Kunbo Wang, Yanxun Xu, Yang Ni

Abstract: Multivariate functional data arise in a wide range of applications. One fundamental task is to understand the causal relationships among these functional objects of interest, which has not yet been fully explored. In this article, we develop a novel Bayesian network model for multivariate functional data where the conditional independence and causal structure are both encoded by a directed acyclic… ▽ More Multivariate functional data arise in a wide range of applications. One fundamental task is to understand the causal relationships among these functional objects of interest, which has not yet been fully explored. In this article, we develop a novel Bayesian network model for multivariate functional data where the conditional independence and causal structure are both encoded by a directed acyclic graph. Specifically, we allow the functional objects to deviate from Gaussian process, which is adopted by most existing functional data analysis models. The more reasonable non-Gaussian assumption is the key for unique causal structure identification even when the functions are measured with noises. A fully Bayesian framework is designed to infer the functional Bayesian network model with natural uncertainty quantification through posterior summaries. Simulation studies and real data examples are used to demonstrate the practical utility of the proposed model. △ Less

Submitted 23 October, 2022; originally announced October 2022.

arXiv:2210.06025 [pdf, other]

Bregman Divergence-Based Data Integration with Application to Polygenic Risk Score (PRS) Heterogeneity Adjustment

Authors: Qinmengge Li, Matthew T. Patrick, Haihan Zhang, Chachrit Khunsriraksakul, Philip E. Stuart, Johann E. Gudjonsson, Rajan Nair, James T. Elder, Dajiang J. Liu, Jian Kang, Lam C. Tsoi, Kevin He

Abstract: Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Cau… ▽ More Polygenic risk scores (PRS) have recently received much attention for genetics risk prediction. While successful for the Caucasian population, the PRS based on the minority population suffer from small sample sizes, high dimensionality and low signal-to-noise ratios, exacerbating already severe health disparities. Due to population heterogeneity, direct trans-ethnic prediction by utilizing the Caucasian model for the minority population also has limited performance. In addition, due to data privacy, the individual genotype data is not accessible for either the Caucasian population or the minority population. To address these challenges, we propose a Bregman divergence-based estimation procedure to measure and optimally balance the information from different populations. The proposed method only requires the use of encrypted summary statistics and improves the PRS performance for ethnic minority groups by incorporating additional information. We provide the asymptotic consistency and weak oracle property for the proposed method. Simulations and real data analyses also show its advantages in prediction and variable selection. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 35 pages, 6 figures

arXiv:2209.00181 [pdf, other]

Understanding the dynamic impact of COVID-19 through competing risk modeling with bivariate varying coefficients

Authors: Wenbo Wu, John D. Kalbfleisch, Jeremy M. G. Taylor, Jian Kang, Kevin He

Abstract: The coronavirus disease 2019 (COVID-19) pandemic has exerted a profound impact on patients with end-stage renal disease relying on kidney dialysis to sustain their lives. Motivated by a request by the U.S. Centers for Medicare & Medicaid Services, our analysis of their postdischarge hospital readmissions and deaths in 2020 revealed that the COVID-19 effect has varied significantly with postdischar… ▽ More The coronavirus disease 2019 (COVID-19) pandemic has exerted a profound impact on patients with end-stage renal disease relying on kidney dialysis to sustain their lives. Motivated by a request by the U.S. Centers for Medicare & Medicaid Services, our analysis of their postdischarge hospital readmissions and deaths in 2020 revealed that the COVID-19 effect has varied significantly with postdischarge time and time since the onset of the pandemic. However, the complex dynamics of the COVID-19 effect trajectories cannot be characterized by existing varying coefficient models. To address this issue, we propose a bivariate varying coefficient model for competing risks within a cause-specific hazard framework, where tensor-product B-splines are used to estimate the surface of the COVID-19 effect. An efficient proximal Newton algorithm is developed to facilitate the fitting of the new model to the massive Medicare data for dialysis patients. Difference-based anisotropic penalization is introduced to mitigate model overfitting and the wiggliness of the estimated trajectories; various cross-validation methods are considered in the determination of optimal tuning parameters. Hypothesis testing procedures are designed to examine whether the COVID-19 effect varies significantly with postdischarge time and the time since pandemic onset, either jointly or separately. Simulation experiments are conducted to evaluate the estimation accuracy, type I error rate, statistical power, and model selection procedures. Applications to Medicare dialysis patients demonstrate the real-world performance of the proposed methods. △ Less

Submitted 31 August, 2022; originally announced September 2022.

Comments: 40 pages, 8 figures, 1 table

arXiv:2208.05100

KL-divergence Based Deep Learning for Discrete Time Model

Authors: Li Liu, Xiangeng Fang, Di Wang, Wei**g Tang, Kevin He

Abstract: Neural Network (Deep Learning) is a modern model in Artificial Intelligence and it has been exploited in Survival Analysis. Although several improvements have been shown by previous works, training an excellent deep learning model requires a huge amount of data, which may not hold in practice. To address this challenge, we develop a Kullback-Leibler-based (KL) deep learning procedure to integrate… ▽ More Neural Network (Deep Learning) is a modern model in Artificial Intelligence and it has been exploited in Survival Analysis. Although several improvements have been shown by previous works, training an excellent deep learning model requires a huge amount of data, which may not hold in practice. To address this challenge, we develop a Kullback-Leibler-based (KL) deep learning procedure to integrate external survival prediction models with newly collected time-to-event data. Time-dependent KL discrimination information is utilized to measure the discrepancy between the external and internal data. This is the first work considering using prior information to deal with short data problem in Survival Analysis for deep learning. Simulation and real data results show that the proposed model achieves better performance and higher robustness compared with previous works. △ Less

Submitted 11 April, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

Comments: This paper is not complete and the results are not qualified to be public. Therefore we decided to withdraw the paper and plan to submit a newer version in the future

arXiv:2207.07602 [pdf, other]

Composite Scores for Transplant Center Evaluation: A New Individualized Empirical Null Method

Authors: Nicholas Hartman, Joseph M. Messana, Jian Kang, Abhijit S. Naik, Tempie H. Shearon, Kevin He

Abstract: Risk-adjusted quality measures are used to evaluate healthcare providers while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the risk adjustment is perfect and the between-provider variation in quality measures is entirely due to the quality of care. However, in practice, even with very good models for risk adjustment, some be… ▽ More Risk-adjusted quality measures are used to evaluate healthcare providers while controlling for factors beyond their control. Existing healthcare provider profiling approaches typically assume that the risk adjustment is perfect and the between-provider variation in quality measures is entirely due to the quality of care. However, in practice, even with very good models for risk adjustment, some between-provider variation will be due to incomplete risk adjustment, which should be recognized in assessing and monitoring providers. Otherwise, conventional methods disproportionately identify larger providers as outliers, even though their provider effects need not be "extreme.'' Motivated by efforts to evaluate the quality of care provided by transplant centers, we develop a composite evaluation score based on a novel individualized empirical null method, which robustly accounts for overdispersion due to unobserved risk factors, models the marginal variance of standardized scores as a function of the effective center size, and only requires the use of publicly-available center-level statistics. The evaluations of United States kidney transplant centers based on the proposed composite score are substantially different from those based on conventional methods. Simulations show that the proposed empirical null approach more accurately classifies centers in terms of quality of care, compared to existing methods. △ Less

Submitted 23 July, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

arXiv:2206.03718 [pdf, other]

Learning Interpretable Decision Rule Sets: A Submodular Optimization Approach

Authors: Fan Yang, Kai He, Linxiao Yang, Hongxia Du, **gbang Yang, Bo Yang, Liang Sun

Abstract: Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in whi… ▽ More Rule sets are highly interpretable logical models in which the predicates for decision are expressed in disjunctive normal form (DNF, OR-of-ANDs), or, equivalently, the overall model comprises an unordered collection of if-then decision rules. In this paper, we consider a submodular optimization based approach for learning rule sets. The learning problem is framed as a subset selection task in which a subset of all possible rules needs to be selected to form an accurate and interpretable rule set. We employ an objective function that exhibits submodularity and thus is amenable to submodular optimization techniques. To overcome the difficulty arose from dealing with the exponential-sized ground set of rules, the subproblem of searching a rule is casted as another subset selection task that asks for a subset of features. We show it is possible to write the induced objective function for the subproblem as a difference of two submodular (DS) functions to make it approximately solvable by DS optimization algorithms. Overall, the proposed approach is simple, scalable, and likely to be benefited from further research on submodular optimization. Experiments on real datasets demonstrate the effectiveness of our method. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: NeurIPS 2021 (Spotlight)

arXiv:2202.11269 [pdf, other]

NetRCA: An Effective Network Fault Cause Localization Algorithm

Authors: Chaoli Zhang, Zhiqiang Zhou, Yingying Zhang, Linxiao Yang, Kai He, Qingsong Wen, Liang Sun

Abstract: Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from… ▽ More Localizing the root cause of network faults is crucial to network operation and maintenance. However, due to the complicated network architectures and wireless environments, as well as limited labeled data, accurately localizing the true root cause is challenging. In this paper, we propose a novel algorithm named NetRCA to deal with this problem. Firstly, we extract effective derived features from the original raw data by considering temporal, directional, attribution, and interaction characteristics. Secondly, we adopt multivariate time series similarity and label propagation to generate new training data from both labeled and unlabeled data to overcome the lack of labeled samples. Thirdly, we design an ensemble model which combines XGBoost, rule set learning, attribution model, and graph algorithm, to fully utilize all data information and enhance performance. Finally, experiments and analysis are conducted on the real-world dataset from ICASSP 2022 AIOps Challenge to demonstrate the superiority and effectiveness of our approach. △ Less

Submitted 6 March, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022. NetRCA is the solution of the First Place of 2022 ICASSP AIOps Challenge. All authors are contributed equally, and Qingsong Wen is the team leader (Team Name: MindOps). The website of 2022 ICASSP AIOps Challenge is https://www.aiops.sribd.cn/home/introduction

arXiv:2201.12392 [pdf, other]

Causal Discovery with Heterogeneous Observational Data

Authors: Fangting Zhou, Kejun He, Yang Ni

Abstract: We consider the problem of causal discovery (structure learning) from heterogeneous observational data. Most existing methods assume a homogeneous sampling scheme, which leads to misleading conclusions when violated in many applications. To this end, we propose a novel approach that exploits data heterogeneity to infer possibly cyclic causal structures from causally insufficient systems. The core… ▽ More We consider the problem of causal discovery (structure learning) from heterogeneous observational data. Most existing methods assume a homogeneous sampling scheme, which leads to misleading conclusions when violated in many applications. To this end, we propose a novel approach that exploits data heterogeneity to infer possibly cyclic causal structures from causally insufficient systems. The core idea is to model the direct causal effects as functions of exogenous covariates that properly explain data heterogeneity. We investigate structure identifiability properties of the proposed model. Structure learning is carried out in a fully Bayesian fashion, which provides natural uncertainty quantification. We demonstrate its utility through extensive simulations and a real-world application. △ Less

Submitted 28 January, 2022; originally announced January 2022.

arXiv:2108.00127 [pdf, other]

Structure Amplification on Multi-layer Stochastic Block Models

Authors: Xiaodong Xin, Kun He, Jialu Bao, Bart Selman, John E. Hopcroft

Abstract: Much of the complexity of social, biological, and engineered systems arises from a network of complex interactions connecting many basic components. Network analysis tools have been successful at uncovering latent structure termed communities in such networks. However, some of the most interesting structure can be difficult to uncover because it is obscured by the more dominant structure. Our prev… ▽ More Much of the complexity of social, biological, and engineered systems arises from a network of complex interactions connecting many basic components. Network analysis tools have been successful at uncovering latent structure termed communities in such networks. However, some of the most interesting structure can be difficult to uncover because it is obscured by the more dominant structure. Our previous work proposes a general structure amplification technique called HICODE that uncovers many layers of functional hidden structure in complex networks. HICODE incrementally weakens dominant structure through randomization allowing the hidden functionality to emerge, and uncovers these hidden structure in real-world networks that previous methods rarely uncover. In this work, we conduct a comprehensive and systematic theoretical analysis on the hidden community structure. In what follows, we define multi-layer stochastic block model, and provide theoretical support using the model on why the existence of hidden structure will make the detection of dominant structure harder compared with equivalent random noise. We then provide theoretical proofs that the iterative reducing methods could help promote the uncovering of hidden structure as well as boosting the detection quality of dominant structure. △ Less

Submitted 30 July, 2021; originally announced August 2021.

Comments: 27 pages, 6 figures, 1 table, submitted to a journal

arXiv:2104.00242 [pdf, other]

LinDA: linear models for differential abundance analysis of microbiome compositional data

Authors: Huijuan Zhou, Kejun He, Jun Chen, Xianyang Zhang

Abstract: Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered lo… ▽ More Differential abundance analysis is at the core of statistical analysis of microbiome data. The compositional nature of microbiome sequencing data makes false positive control challenging. Here, we show that the compositional effects can be addressed by a simple, yet highly flexible and scalable, approach. The proposed method, LinDA, only requires fitting linear regression models on the centered log-ratio transformed data, and correcting the bias due to compositional effects. We show that LinDA enjoys asymptotic FDR control and can be extended to mixed-effect models for correlated microbiome data. Using simulations and real examples, we demonstrate the effectiveness of LinDA. △ Less

Submitted 12 March, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

arXiv:2101.02354 [pdf, other]

Kullback-Leibler-Based Discrete Failure Time Models for Integration of Published Prediction Models with New Time-To-Event Dataset

Authors: Di Wang, Wen Ye, Randall Sung, Hui Jiang, Jeremy M. G. Taylor, Lisa Ly, Kevin He

Abstract: Prediction of time-to-event data often suffers from rare event rates, small sample sizes, high dimensionality and low signal-to-noise ratios. Incorporating published prediction models from large-scale studies is expected to improve the performance of prognosis prediction on internal individual-level time-to-event data. However, existing integration approaches typically assume that underlying distr… ▽ More Prediction of time-to-event data often suffers from rare event rates, small sample sizes, high dimensionality and low signal-to-noise ratios. Incorporating published prediction models from large-scale studies is expected to improve the performance of prognosis prediction on internal individual-level time-to-event data. However, existing integration approaches typically assume that underlying distributions from the external and internal data sources are similar, which is often invalid. To account for challenges including heterogeneity, data sharing, and privacy constraints, we propose a discrete failure time modeling procedure, which utilizes a discrete hazard-based Kullback-Leibler discriminatory information measuring the discrepancy between the published models and the internal dataset. Simulations show the advantage of the proposed method compared with those solely based on the internal data or published models. We apply the proposed method to improve prediction performance on a kidney transplant dataset from a local hospital by integrating this small-scale dataset with published survival models obtained from the national transplant registry. △ Less

Submitted 28 July, 2022; v1 submitted 6 January, 2021; originally announced January 2021.

arXiv:2010.13568 [pdf, other]

doi 10.1109/ACCESS.2021.3049494

CP Degeneracy in Tensor Regression

Authors: Ya Zhou, Raymond K. W. Wong, Kejun He

Abstract: Tensor linear regression is an important and useful tool for analyzing tensor data. To deal with high dimensionality, CANDECOMP/PARAFAC (CP) low-rank constraints are often imposed on the coefficient tensor parameter in the (penalized) $M$-estimation. However, we show that the corresponding optimization may not be attainable, and when this happens, the estimator is not well-defined. This is closely… ▽ More Tensor linear regression is an important and useful tool for analyzing tensor data. To deal with high dimensionality, CANDECOMP/PARAFAC (CP) low-rank constraints are often imposed on the coefficient tensor parameter in the (penalized) $M$-estimation. However, we show that the corresponding optimization may not be attainable, and when this happens, the estimator is not well-defined. This is closely related to a phenomenon, called CP degeneracy, in low-rank tensor approximation problems. In this article, we provide useful results of CP degeneracy in tensor regression problems. In addition, we provide a general penalized strategy as a solution to overcome CP degeneracy. The asymptotic properties of the resulting estimation are also studied. Numerical experiments are conducted to illustrate our findings. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Journal ref: IEEE Access, 9:1, 7775-7788 (2021)

arXiv:2010.08766 [pdf, ps, other]

Tight Lower Complexity Bounds for Strongly Convex Finite-Sum Optimization

Authors: Min Zhang, Yao Shu, Kun He

Abstract: Finite-sum optimization plays an important role in the area of machine learning, and hence has triggered a surge of interest in recent years. To address this optimization problem, various randomized incremental gradient methods have been proposed with guaranteed upper and lower complexity bounds for their convergence. Nonetheless, these lower bounds rely on certain conditions: deterministic optimi… ▽ More Finite-sum optimization plays an important role in the area of machine learning, and hence has triggered a surge of interest in recent years. To address this optimization problem, various randomized incremental gradient methods have been proposed with guaranteed upper and lower complexity bounds for their convergence. Nonetheless, these lower bounds rely on certain conditions: deterministic optimization algorithm, or fixed probability distribution for the selection of component functions. Meanwhile, some lower bounds even do not match the upper bounds of the best known methods in certain cases. To break these limitations, we derive tight lower complexity bounds of randomized incremental gradient methods, including SAG, SAGA, SVRG, and SARAH, for two typical cases of finite-sum optimization. Specifically, our results tightly match the upper complexity of Katyusha or VRADA when each component function is strongly convex and smooth, and tightly match the upper complexity of SDCA without duality and of KatyushaX when the finite-sum function is strongly convex and the component functions are average smooth. △ Less

Submitted 19 June, 2022; v1 submitted 17 October, 2020; originally announced October 2020.

arXiv:2009.03449 [pdf, other]

Survival Analysis via Ordinary Differential Equations

Authors: Wei**g Tang, Kevin He, Gongjun Xu, Ji Zhu

Abstract: This paper introduces an Ordinary Differential Equation (ODE) notion for survival analysis. The ODE notion not only provides a unified modeling framework, but more importantly, also enables the development of a widely applicable, scalable, and easy-to-implement procedure for estimation and inference. Specifically, the ODE modeling framework unifies many existing survival models, such as the propor… ▽ More This paper introduces an Ordinary Differential Equation (ODE) notion for survival analysis. The ODE notion not only provides a unified modeling framework, but more importantly, also enables the development of a widely applicable, scalable, and easy-to-implement procedure for estimation and inference. Specifically, the ODE modeling framework unifies many existing survival models, such as the proportional hazards model, the linear transformation model, the accelerated failure time model, and the time-varying coefficient model as special cases. The generality of the proposed framework serves as the foundation of a widely applicable estimation procedure. As an illustrative example, we develop a sieve maximum likelihood estimator for a general semi-parametric class of ODE models. In comparison to existing estimation methods, the proposed procedure has advantages in terms of computational scalability and numerical stability. Moreover, to address unique theoretical challenges induced by the ODE notion, we establish a new general sieve M-theorem for bundled parameters and show that the proposed sieve estimator is consistent and asymptotically normal, and achieves the semi-parametric efficiency bound. The finite sample performance of the proposed estimator is examined in simulation studies and a real-world data example. △ Less

Submitted 5 December, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

arXiv:2008.12927 [pdf, other]

doi 10.1093/jrsssb/qkae027

Broadcasted Nonparametric Tensor Regression

Authors: Ya Zhou, Raymond K. W. Wong, Kejun He

Abstract: We propose a novel use of a broadcasting operation, which distributes univariate functions to all entries of the tensor covariate, to model the nonlinearity in tensor regression nonparametrically. A penalized estimation and the corresponding algorithm are proposed. Our theoretical investigation, which allows the dimensions of the tensor covariate to diverge, indicates that the proposed estimation… ▽ More We propose a novel use of a broadcasting operation, which distributes univariate functions to all entries of the tensor covariate, to model the nonlinearity in tensor regression nonparametrically. A penalized estimation and the corresponding algorithm are proposed. Our theoretical investigation, which allows the dimensions of the tensor covariate to diverge, indicates that the proposed estimation yields a desirable convergence rate. We also provide a minimax lower bound, which characterizes the optimality of the proposed estimator for a wide range of scenarios. Numerical experiments are conducted to confirm the theoretical findings, and they show that the proposed model has advantages over its existing linear counterparts. △ Less

Submitted 23 March, 2024; v1 submitted 29 August, 2020; originally announced August 2020.

arXiv:2007.06559 [pdf, other]

Graph Structure of Neural Networks

Authors: Jiaxuan You, Jure Leskovec, Kaiming He, Saining Xie

Abstract: Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a… ▽ More Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graph-based representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation we show that: (1) a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance; (2) neural network's performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (3) our findings are consistent across many different tasks and datasets; (4) the sweet spot can be identified efficiently; (5) top-performing neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general. △ Less

Submitted 27 August, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

Comments: ICML 2020, with open-source code

arXiv:2005.09738 [pdf, other]

Matching methods for obtaining survival functions to estimate the effect of a time-dependent treatment

Authors: Yun Li, Douglas E. Schaubel, Kevin He

Abstract: In observational studies of survival time featuring a binary time-dependent treatment, the hazard ratio (an instantaneous measure) is often used to represent the treatment effect. However, investigators are often more interested in the difference in survival functions. We propose semiparametric methods to estimate the causal effect of treatment among the treated with respect to survival probabilit… ▽ More In observational studies of survival time featuring a binary time-dependent treatment, the hazard ratio (an instantaneous measure) is often used to represent the treatment effect. However, investigators are often more interested in the difference in survival functions. We propose semiparametric methods to estimate the causal effect of treatment among the treated with respect to survival probability. The objective is to compare post-treatment survival with the survival function that would have been observed in the absence of treatment. For each patient, we compute a prognostic score (based on the pre-treatment death hazard) and a propensity score (based on the treatment hazard). Each treated patient is then matched with an alive, uncensored and not-yet-treated patient with similar prognostic and/or propensity scores. The experience of each treated and matched patient is weighted using a variant of Inverse Probability of Censoring Weighting to account for the impact of censoring. We propose estimators of the treatment-specific survival functions (and their difference), computed through weighted Nelson-Aalen estimators. Closed-form variance estimators are proposed which take into consideration the potential replication of subjects across matched sets. The proposed methods are evaluated through simulation, then applied to estimate the effect of kidney transplantation on survival among end-stage renal disease patients using data from a national organ failure registry. △ Less

Submitted 19 May, 2020; originally announced May 2020.

arXiv:2005.08361 [pdf, other]

Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization

Authors: Fangting Zhou, Kejun He, Qiwei Li, Robert S. Chapkin, Yang Ni

Abstract: High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlap** clusters on both microbes a… ▽ More High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlap** clusters on both microbes and hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of microbes is naturally incorporated, which greatly improves the interpretability of our findings. We demonstrate the utility of the proposed approach by comparing to alternative methods in simulations. An application to a human gut microbiome dataset involving patients with inflammatory bowel disease reveals interesting clusters, which contain bacteria families Bacteroidaceae, Bifidobacteriaceae, Enterobacteriaceae, Fusobacteriaceae, Lachnospiraceae, Ruminococcaceae, Pasteurellaceae, and Porphyromonadaceae that are known to be related to the inflammatory bowel disease and its subtypes according to biological literature. Our findings can help generate potential hypotheses for future investigation of the heterogeneity of the human gut microbiome. △ Less

Submitted 8 October, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

arXiv:2002.09535 [pdf, other]

doi 10.1145/3448016.3452779

RobustPeriod: Time-Frequency Mining for Robust Multiple Periodicity Detection

Authors: Qingsong Wen, Kai He, Liang Sun, Yingying Zhang, Min Ke, Huan Xu

Abstract: Periodicity detection is a crucial step in time series tasks, including monitoring and forecasting of metrics in many areas, such as IoT applications and self-driving database management system. In many of these applications, multiple periodic components exist and are often interlaced with each other. Such dynamic and complicated periodic patterns make the accurate periodicity detection difficult.… ▽ More Periodicity detection is a crucial step in time series tasks, including monitoring and forecasting of metrics in many areas, such as IoT applications and self-driving database management system. In many of these applications, multiple periodic components exist and are often interlaced with each other. Such dynamic and complicated periodic patterns make the accurate periodicity detection difficult. In addition, other components in the time series, such as trend, outliers and noises, also pose additional challenges for accurate periodicity detection. In this paper, we propose a robust and general framework for multiple periodicity detection. Our algorithm applies maximal overlap discrete wavelet transform to transform the time series into multiple temporal-frequency scales such that different periodic components can be isolated. We rank them by wavelet variance, and then at each scale detect single periodicity by our proposed Huber-periodogram and Huber-ACF robustly. We rigorously prove the theoretical properties of Huber-periodogram and justify the use of Fisher's test on Huber-periodogram for periodicity detection. To further refine the detected periods, we compute unbiased autocorrelation function based on Wiener-Khinchin theorem from Huber-periodogram for improved robustness and efficiency. Experiments on synthetic and real-world datasets show that our algorithm outperforms other popular ones for both single and multiple periodicity detection. △ Less

Submitted 7 March, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: Accepted by SIGMOD 2021; 10 pages, 6 figures, 8 tables, and 70 referred papers

arXiv:2002.00717 [pdf, other]

doi 10.1016/j.neucom.2021.06.051

Error-feedback stochastic modeling strategy for time series forecasting with convolutional neural networks

Authors: Xinze Zhang, Kun He, Yukun Bao

Abstract: Despite the superiority of convolutional neural networks demonstrated in time series modeling and forecasting, it has not been fully explored on the design of the neural network architecture and the tuning of the hyper-parameters. Inspired by the incremental construction strategy for building a random multilayer perceptron, we propose a novel Error-feedback Stochastic Modeling (ESM) strategy to co… ▽ More Despite the superiority of convolutional neural networks demonstrated in time series modeling and forecasting, it has not been fully explored on the design of the neural network architecture and the tuning of the hyper-parameters. Inspired by the incremental construction strategy for building a random multilayer perceptron, we propose a novel Error-feedback Stochastic Modeling (ESM) strategy to construct a random Convolutional Neural Network (ESM-CNN) for time series forecasting task, which builds the network architecture adaptively. The ESM strategy suggests that random filters and neurons of the error-feedback fully connected layer are incrementally added to steadily compensate the prediction error during the construction process, and then a filter selection strategy is introduced to enable ESM-CNN to extract the different size of temporal features, providing helpful information at each iterative process for the prediction. The performance of ESM-CNN is justified on its prediction accuracy of one-step-ahead and multi-step-ahead forecasting tasks respectively. Comprehensive experiments on both the synthetic and real-world datasets show that the proposed ESM-CNN not only outperforms the state-of-art random neural networks, but also exhibits stronger predictive power and less computing overhead in comparison to trained state-of-art deep neural network models. △ Less

Submitted 11 February, 2022; v1 submitted 3 February, 2020; originally announced February 2020.

Journal ref: Neurocomputing 459 (2021): 234-248

arXiv:1912.12353 [pdf, other]

Minorization-Maximization-based Steepest Ascent for Large-scale Survival Analysis with Time-Varying Effects: Application to the National Kidney Transplant Dataset

Authors: Kevin He, Ji Zhu, Jian Kang, Yi Li

Abstract: The time-varying effects model is a flexible and powerful tool for modeling the dynamic changes of covariate effects. However, in survival analysis, its computational burden increases quickly as the number of sample sizes or predictors grows. Traditional methods that perform well for moderate sample sizes and low-dimensional data do not scale to massive data. Analysis of national kidney transplant… ▽ More The time-varying effects model is a flexible and powerful tool for modeling the dynamic changes of covariate effects. However, in survival analysis, its computational burden increases quickly as the number of sample sizes or predictors grows. Traditional methods that perform well for moderate sample sizes and low-dimensional data do not scale to massive data. Analysis of national kidney transplant data with a massive sample size and large number of predictors defy any existing statistical methods and software. In view of these difficulties, we propose a Minorization-Maximization-based steepest ascent procedure for estimating the time-varying effects. Leveraging the block structure formed by the basis expansions, the proposed procedure iteratively updates the optimal block-wise direction along which the approximate increase in the log-partial likelihood is maximized. The resulting estimates ensure the ascent property and serve as refinements of the previous step. The performance of the proposed method is examined by simulations and applications to the analysis of national kidney transplant data. △ Less

Submitted 27 December, 2019; originally announced December 2019.

arXiv:1912.00295

Efficient Estimation of Mixture Cure Frailty Model for Clustered Current Status Data

Authors: Tong Wang, Kejun He, Wei Ma, Dipankar Bandyopadhyay, Samiran Sinha

Abstract: Current status data abounds in the field of epidemiology and public health, where the only observable data for a subject is the random inspection time and the event status at inspection. Motivated by such a current status data from a periodontal study where data are inherently clustered, we propose a unified methodology to analyze such complex data. We allow the time-to-event to follow the semipar… ▽ More Current status data abounds in the field of epidemiology and public health, where the only observable data for a subject is the random inspection time and the event status at inspection. Motivated by such a current status data from a periodontal study where data are inherently clustered, we propose a unified methodology to analyze such complex data. We allow the time-to-event to follow the semiparametric GOR model with a cure fraction, and develop a unified estimation scheme powered by the EM algorithm. The within-subject correlation is accounted for by a random (frailty) effect, and the non-parametric component of the GOR model is approximated via penalized splines, with a set of knot points that increases with the sample size. Proposed methodology is accompanied by a rigorous asymptotic theory, and the related semiparametric efficiency. The finite sample performance of our model parameters are assessed via simulation studies. Furthermore, the proposed methodology is illustrated via application to the oral health data, accompanied by diagnostic checks to identify influential observations. An easy to use R package CRFCSD is also available for implementation. △ Less

Submitted 23 April, 2020; v1 submitted 30 November, 2019; originally announced December 2019.

Comments: Unstable EM algorithm due to limited information in current status data

arXiv:1908.06281 [pdf, other]

Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks

Authors: Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, John E. Hopcroft

Abstract: Deep learning models are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on benign inputs. However, under the black-box setting, most existing adversaries often have a poor transferability to attack other defense models. In this work, from the perspective of regarding the adversarial example generation as an optimization process, we propose two new methods… ▽ More Deep learning models are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on benign inputs. However, under the black-box setting, most existing adversaries often have a poor transferability to attack other defense models. In this work, from the perspective of regarding the adversarial example generation as an optimization process, we propose two new methods to improve the transferability of adversarial examples, namely Nesterov Iterative Fast Gradient Sign Method (NI-FGSM) and Scale-Invariant attack Method (SIM). NI-FGSM aims to adapt Nesterov accelerated gradient into the iterative attacks so as to effectively look ahead and improve the transferability of adversarial examples. While SIM is based on our discovery on the scale-invariant property of deep learning models, for which we leverage to optimize the adversarial perturbations over the scale copies of the input images so as to avoid "overfitting" on the white-box model being attacked and generate more transferable adversarial examples. NI-FGSM and SIM can be naturally integrated to build a robust gradient-based attack to generate more transferable adversarial examples against the defense models. Empirical results on ImageNet dataset demonstrate that our attack methods exhibit higher transferability and achieve higher attack success rates than state-of-the-art gradient-based attacks. △ Less

Submitted 2 February, 2020; v1 submitted 17 August, 2019; originally announced August 2019.

Comments: ICLR 2020

arXiv:1907.07809 [pdf, ps, other]

Accounting for total variation and robustness in profiling health care providers

Authors: Lu Xia, Kevin He, Yanming Li, John D. Kalbfleisch

Abstract: Monitoring outcomes of health care providers, such as patient deaths, hospitalizations and hospital readmissions, helps in assessing the quality of health care. We consider a large database on patients being treated at dialysis facilities in the United States, and the problem of identifying facilities with outcomes that are better than or worse than expected. Analyses of such data have been common… ▽ More Monitoring outcomes of health care providers, such as patient deaths, hospitalizations and hospital readmissions, helps in assessing the quality of health care. We consider a large database on patients being treated at dialysis facilities in the United States, and the problem of identifying facilities with outcomes that are better than or worse than expected. Analyses of such data have been commonly based on random or fixed facility effects, which have shortcomings that can lead to unfair assessments. A primary issue is that they do not appropriately account for variation between providers that is outside the providers' control due, for example, to unobserved patient characteristics that vary between providers. In this article, we propose a smoothed empirical null approach that accounts for the total variation and adapts to different provider sizes. The linear model provides an illustration that extends easily to other nonlinear models for survival or binary outcomes, for example. The empirical null method is generalized to allow for some variation being due to quality of care. These methods are examined with numerical simulations and applied to the monitoring of survival in the dialysis facility data. △ Less

Submitted 23 June, 2020; v1 submitted 17 July, 2019; originally announced July 2019.

arXiv:1906.00555 [pdf, ps, other]

Adversarially Robust Generalization Just Requires More Unlabeled Data

Authors: Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, Liwei Wang

Abstract: Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model w… ▽ More Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem, adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we further show that a practical adversarial training algorithm that leverages unlabeled data can improve adversarial robust generalization on MNIST and Cifar-10. △ Less

Submitted 25 September, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

Comments: 16 pages. Submitted to ICLR 2020

arXiv:1905.06109 [pdf, ps, other]

A New Anchor Word Selection Method for the Separable Topic Discovery

Authors: Kun He, Wu Wang, Xiaosen Wang, John E. Hopcroft

Abstract: Separable Non-negative Matrix Factorization (SNMF) is an important method for topic modeling, where "separable" assumes every topic contains at least one anchor word, defined as a word that has non-zero probability only on that topic. SNMF focuses on the word co-occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly inf… ▽ More Separable Non-negative Matrix Factorization (SNMF) is an important method for topic modeling, where "separable" assumes every topic contains at least one anchor word, defined as a word that has non-zero probability only on that topic. SNMF focuses on the word co-occurrence patterns to reveal topics by two steps: anchor word selection and topic recovery. The quality of the anchor words strongly influences the quality of the extracted topics. Existing anchor word selection algorithm is to greedily find an approximate convex hull in a high-dimensional word co-occurrence space. In this work, we propose a new method for the anchor word selection by associating the word co-occurrence probability with the words similarity and assuming that the most different words on semantic are potential candidates for the anchor words. Therefore, if the similarity of a word-pair is very low, then the two words are very likely to be the anchor words. According to the statistical information of text corpora, we can get the similarity of all word-pairs. We build the word similarity graph where the nodes correspond to words and weights on edges stand for the word-pair similarity. Following this way, we design a greedy method to find a minimum edge-weight anchor clique of a given size in the graph for the anchor word selection. Extensive experiments on real-world corpus demonstrate the effectiveness of the proposed anchor word selection method that outperforms the common convex hull-based methods on the revealed topic quality. Meanwhile, our method is much faster than typical SNMF based method. △ Less

Submitted 10 May, 2019; originally announced May 2019.

Comments: 18 pages, 4 figures

arXiv:1905.05840 [pdf, other]

A Learning based Branch and Bound for Maximum Common Subgraph Problems

Authors: Yan-li Liu, Chu-min Li, Hua Jiang, Kun He

Abstract: Branch-and-bound (BnB) algorithms are widely used to solve combinatorial problems, and the performance crucially depends on its branching heuristic.In this work, we consider a typical problem of maximum common subgraph (MCS), and propose a branching heuristic inspired from reinforcement learning with a goal of reaching a tree leaf as early as possible to greatly reduce the search tree size.Extensi… ▽ More Branch-and-bound (BnB) algorithms are widely used to solve combinatorial problems, and the performance crucially depends on its branching heuristic.In this work, we consider a typical problem of maximum common subgraph (MCS), and propose a branching heuristic inspired from reinforcement learning with a goal of reaching a tree leaf as early as possible to greatly reduce the search tree size.Extensive experiments show that our method is beneficial and outperforms current best BnB algorithm for the MCS. △ Less

Submitted 21 May, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

Comments: 6 pages, 4 figures, uses ijcai19.sty

ACM Class: I.5.2; F.2.2

arXiv:1810.11750 [pdf, other]

Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation

Authors: Liwei Wang, Lunjia Hu, Jiayuan Gu, Yue Wu, Zhiqiang Hu, Kun He, John Hopcroft

Abstract: It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we… ▽ More It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this work, we move a tiny step towards a theory and better understanding of the representations. Specifically, we study a simpler problem: How similar are the representations learned by two networks with identical architecture but trained from different initializations. We develop a rigorous theory based on the neuron activation subspace match model. The theory gives a complete characterization of the structure of neuron activation subspace matches, where the core concepts are maximum match and simple match which describe the overall and the finest similarity between sets of neurons in two networks respectively. We also propose efficient algorithms to find the maximum match and simple matches. Finally, we conduct extensive experiments using our algorithms. Experimental results suggest that, surprisingly, representations learned by the same convolutional layers of networks trained from different initializations are not as similar as prevalently expected, at least in terms of subspace match. △ Less

Submitted 28 November, 2018; v1 submitted 27 October, 2018; originally announced October 2018.

Comments: 17 pages, 6 figures

arXiv:1810.00740 [pdf, other]

Improving the Generalization of Adversarial Training with Domain Adaptation

Authors: Chuanbiao Song, Kun He, Liwei Wang, John E. Hopcroft

Abstract: By injecting adversarial examples into training data, adversarial training is promising for improving the robustness of deep learning models. However, most existing adversarial training approaches are based on a specific type of adversarial attack. It may not provide sufficiently representative samples from the adversarial domain, leading to a weak generalization ability on adversarial examples fr… ▽ More By injecting adversarial examples into training data, adversarial training is promising for improving the robustness of deep learning models. However, most existing adversarial training approaches are based on a specific type of adversarial attack. It may not provide sufficiently representative samples from the adversarial domain, leading to a weak generalization ability on adversarial examples from other attacks. Moreover, during the adversarial training, adversarial perturbations on inputs are usually crafted by fast single-step adversaries so as to scale to large datasets. This work is mainly focused on the adversarial training yet efficient FGSM adversary. In this scenario, it is difficult to train a model with great generalization due to the lack of representative adversarial samples, aka the samples are unable to accurately reflect the adversarial domain. To alleviate this problem, we propose a novel Adversarial Training with Domain Adaptation (ATDA) method. Our intuition is to regard the adversarial training on FGSM adversary as a domain adaption task with limited number of target domain samples. The main idea is to learn a representation that is semantically meaningful and domain invariant on the clean domain as well as the adversarial domain. Empirical evaluations on Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100 demonstrate that ATDA can greatly improve the generalization of adversarial training and the smoothness of the learned models, and outperforms state-of-the-art methods on standard benchmark datasets. To show the transfer ability of our method, we also extend ATDA to the adversarial training on iterative attacks such as PGD-Adversial Training (PAT) and the defense performance is improved considerably. △ Less

Submitted 15 March, 2019; v1 submitted 1 October, 2018; originally announced October 2018.

Comments: ICLR 2019

arXiv:1808.01990 [pdf, other]

Hashing with Binary Matrix Pursuit

Authors: Fatih Cakir, Kun He, Stan Sclaroff

Abstract: We propose theoretical and empirical improvements for two-stage hashing methods. We first provide a theoretical analysis on the quality of the binary codes and show that, under mild assumptions, a residual learning scheme can construct binary codes that fit any neighborhood structure with arbitrary accuracy. Secondly, we show that with high-capacity hash functions such as CNNs, binary code inferen… ▽ More We propose theoretical and empirical improvements for two-stage hashing methods. We first provide a theoretical analysis on the quality of the binary codes and show that, under mild assumptions, a residual learning scheme can construct binary codes that fit any neighborhood structure with arbitrary accuracy. Secondly, we show that with high-capacity hash functions such as CNNs, binary code inference can be greatly simplified for many standard neighborhood definitions, yielding smaller optimization problems and more robust codes. Incorporating our findings, we propose a novel two-stage hashing method that significantly outperforms previous hashing studies on widely used image retrieval benchmarks. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Comments: 23 pages, 4 figures. In Proceedings of European Conference on Computer Vision (ECCV), 2018

arXiv:1806.05662 [pdf, other]

GLoMo: Unsupervisedly Learned Relational Graphs as Transferable Representations

Authors: Zhilin Yang, Jake Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Ruslan Salakhutdinov, Yann LeCun

Abstract: Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning… ▽ More Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels. △ Less

Submitted 2 July, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

arXiv:1805.09267 [pdf, other]

doi 10.1016/j.neucom.2020.08.054

Reinforcement Learning for Heterogeneous Teams with PALO Bounds

Authors: Roi Ceren, Prashant Doshi, Keyang He

Abstract: We introduce reinforcement learning for heterogeneous teams in which rewards for an agent are additively factored into local costs, stimuli unique to each agent, and global rewards, those shared by all agents in the domain. Motivating domains include coordination of varied robotic platforms, which incur different costs for the same action, but share an overall goal. We present two templates for le… ▽ More We introduce reinforcement learning for heterogeneous teams in which rewards for an agent are additively factored into local costs, stimuli unique to each agent, and global rewards, those shared by all agents in the domain. Motivating domains include coordination of varied robotic platforms, which incur different costs for the same action, but share an overall goal. We present two templates for learning in this setting with factored rewards: a generalization of Perkins' Monte Carlo exploring starts for POMDPs to canonical MPOMDPs, with a single policy map** joint observations of all agents to joint actions (MCES-MP); and another with each agent individually map** joint observations to their own action (MCES-FMP). We use probably approximately local optimal (PALO) bounds to analyze sample complexity, instantiating these templates to PALO learning. We promote sample efficiency by including a policy space pruning technique, and evaluate the approaches on three domains of heterogeneous agents demonstrating that MCES-FMP yields improved policies in less samples compared to MCES-MP and a previous benchmark. △ Less

Submitted 23 May, 2018; originally announced May 2018.

Journal ref: Neurocomputing, Volume 420, 8 January 2021, Pages 36-56

arXiv:1805.06595 [pdf, ps, other]

Covariance-Insured Screening

Authors: Kevin He, Jian Kang, Hyokyoung Grace Hong, Ji Zhu, Yanming Li, Huazhen Lin, Han Xu, Yi Li

Abstract: Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation infor… ▽ More Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors far greater than the sample size. In order to identify more novel biomarkers and understand biological mechanisms, it is vital to detect signals weakly associated with outcomes among ultrahigh-dimensional predictors. However, existing screening methods, which typically ignore correlation information, are likely to miss these weak signals. By incorporating the inter-feature dependence, we propose a covariance-insured screening methodology to identify predictors that are jointly informative but only marginally weakly associated with outcomes. The validity of the method is examined via extensive simulations and real data studies for selecting potential genetic factors related to the onset of cancer. △ Less

Submitted 16 May, 2018; originally announced May 2018.

arXiv:1804.08222 [pdf, other]

Null-free False Discovery Rate Control Using Decoy Permutations

Authors: Kun He, Mengjie Li, Yan Fu, Fuzhou Gong, Xiaoming Sun

Abstract: The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution… ▽ More The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution. Here, we propose a null distribution-free approach to FDR control for multiple hypothesis testing. This approach, named target-decoy procedure, simply builds on the ordering of tests by some statistic or score, the null distribution of which is not required to be known. Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries. We prove that this approach controls the FDR when the statistics are independent between different tests. Simulation demonstrates that it is more stable and powerful than two existing popular approaches. Evaluation is also made on a real dataset. △ Less

Submitted 12 April, 2021; v1 submitted 22 April, 2018; originally announced April 2018.

Comments: 23 pages

Showing 1–50 of 57 results for author: He, K