Search | arXiv e-print repository

Subgroup Identification with Latent Factor Structure

Authors: Yong He, Dong Liu, Fuxin Wang, Mingjuan Zhang, Wen-Xin Zhou

Abstract: Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in th… ▽ More Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in the existing literature. In this paper, we aim to fill this gap in the ``diverging dimension" regime and propose a center-augmented subgroup identification method under the Factor Augmented (sparse) Linear Model framework, which bridge dimension reduction and sparse regression together. The proposed method is flexible to the possibly high cross-sectional dependence among covariates and inherits the computational advantage with complexity $O(nK)$, in contrast to the $O(n^2)$ complexity of the conventional pairwise fusion penalty method in the literature, where $n$ is the sample size and $K$ is the number of subgroups. We also investigate the asymptotic properties of its oracle estimators under conditions on the minimal distance between group centroids. To implement the proposed approach, we introduce a Difference of Convex functions based Alternating Direction Method of Multipliers (DC-ADMM) algorithm and demonstrate its convergence to a local minimizer in finite steps. We illustrate the superiority of the proposed method through extensive numerical experiments and a real macroeconomic data example. An \texttt{R} package \texttt{SILFS} implementing the method is also available on CRAN. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2406.15762 [pdf, other]

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

Authors: Zhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. Wang

Abstract: Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within… ▽ More Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within the realm of numerical tabular datasets, we introduce a novel principled approach termed Kernelized Negative Entropy-regularized Wasserstein gradient flow Imputation (KnewImp). Specifically, based on Wasserstein gradient flow (WGF) framework, we first prove that issue (1) stems from the cost functionals implicitly maximized in DM-based MDI are equivalent to the MDI's objective plus diversification-promoting non-negative terms. Based on this, we then design a novel cost functional with diversification-discouraging negative entropy and derive our KnewImp approach within WGF framework and reproducing kernel Hilbert space. After that, we prove that the imputation procedure of KnewImp can be derived from another cost functional related to the joint distribution, eliminating the need for the mask matrix and hence naturally addressing issue (2). Extensive experiments demonstrate that our proposed KnewImp approach significantly outperforms existing state-of-the-art methods. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.00701 [pdf, other]

Profiled Transfer Learning for High Dimensional Linear Model

Authors: Ziqian Lin, Junlong Zhao, Fang Wang, Hansheng Wang

Abstract: We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \textit{vanishing-difference} assumption and \textit{low-rank} assumption in the literature, the \textit{approximate-linear} assumption is more flexible and less… ▽ More We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \textit{vanishing-difference} assumption and \textit{low-rank} assumption in the literature, the \textit{approximate-linear} assumption is more flexible and less stringent. Specifically, the PTL estimator is constructed by two major steps. Firstly, we regress the response on the transferred feature, leading to the profiled responses. Subsequently, we learn the regression relationship between profiled responses and the covariates on the target data. The final estimator is then assembled based on the \textit{approximate-linear} relationship. To theoretically support the PTL estimator, we derive the non-asymptotic upper bound and minimax lower bound. We find that the PTL estimator is minimax optimal under appropriate regularity conditions. Extensive simulation studies are presented to demonstrate the finite sample performance of the new method. A real data example about sentence prediction is also presented with very encouraging results. △ Less

Submitted 5 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2405.16413 [pdf, other]

Augmented Risk Prediction for the Onset of Alzheimer's Disease from Electronic Health Records with Large Language Models

Authors: Jiankun Wang, Sumyeong Ahn, Taykhoom Dalal, Xiaodan Zhang, Weishen Pan, Qiannan Zhang, Bin Chen, Hiroko H. Dodge, Fei Wang, Jiayu Zhou

Abstract: Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning bas… ▽ More Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning based predictive models. Recent advancements in large language models (LLMs) demonstrate their unprecedented capability of encoding knowledge and performing reasoning, which offers them strong potential for enhancing risk prediction. This paper proposes a novel pipeline that augments risk prediction by leveraging the few-shot inference power of LLMs to make predictions on cases where traditional supervised learning methods (SLs) may not excel. Specifically, we develop a collaborative pipeline that combines SLs and LLMs via a confidence-driven decision-making mechanism, leveraging the strengths of SLs in clear-cut cases and LLMs in more complex scenarios. We evaluate this pipeline using a real-world EHR data warehouse from Oregon Health \& Science University (OHSU) Hospital, encompassing EHRs from over 2.5 million patients and more than 20 million patient encounters. Our results show that our proposed approach effectively combines the power of SLs and LLMs, offering significant improvements in predictive performance. This advancement holds promise for revolutionizing ADRD screening and early detection practices, with potential implications for better strategies of patient management and thus improving healthcare. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.14848 [pdf, other]

Local Causal Discovery for Structural Evidence of Direct Discrimination

Authors: Jacqueline Maasch, Kyra Gan, Violet Chen, Agni Orfanoudaki, Nil-Jana Akpinar, Fei Wang

Abstract: Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for develo** efficient cau… ▽ More Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for develo** efficient causal discovery methods for fairness applications. To this end, we introduce local discovery for direct discrimination (LD3): a polynomial-time algorithm that recovers structural evidence of direct discrimination. LD3 performs a linear number of conditional independence tests with respect to variable set size. Moreover, we propose a graphical criterion for identifying the weighted controlled direct effect (CDE), a qualitative measure of direct discrimination. We prove that this criterion is satisfied by the knowledge returned by LD3, increasing the accessibility of the weighted CDE as a causal fairness measure. Taking liver transplant allocation as a case study, we highlight the potential impact of LD3 for modeling fairness in complex decision systems. Results on real-world data demonstrate more plausible causal relations than baselines, which took 197x to 5870x longer to execute. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.10329

Causal inference approach to appraise long-term effects of maintenance policy on functional performance of asphalt pavements

Authors: Lingyun You, Nanning Guo, Zhengwu Long, Fusong Wang, Chundi Si, Aboelkasim Diab

Abstract: Asphalt pavements as the most prevalent transportation infrastructure, are prone to serious traffic safety problems due to functional or structural damage caused by stresses or strains imposed through repeated traffic loads and continuous climatic cycles. The good quality or high serviceability of infrastructure networks is vital to the urbanization and industrial development of nations. In order… ▽ More Asphalt pavements as the most prevalent transportation infrastructure, are prone to serious traffic safety problems due to functional or structural damage caused by stresses or strains imposed through repeated traffic loads and continuous climatic cycles. The good quality or high serviceability of infrastructure networks is vital to the urbanization and industrial development of nations. In order to maintain good functional pavement performance and extend the service life of asphalt pavements, the long-term performance of pavements under maintenance policies needs to be evaluated and favorable options selected based on the condition of the pavement. A major challenge in evaluating maintenance policies is to produce valid treatments for the outcome assessment under the control of uncertainty of vehicle loads and the disturbance of freeze-thaw cycles in the climatic environment. In this study, a novel causal inference approach combining a classical causal structural model and a potential outcome model framework is proposed to appraise the long-term effects of four preventive maintenance treatments for longitudinal cracking over a 5-year period of upkeep. Three fundamental issues were brought to our attention: 1) detection of causal relationships prior to variables under environmental loading (identification of causal structure); 2) obtaining direct causal effects of treatment on outcomes excluding covariates (identification of causal effects); and 3) sensitivity analysis of causal relationships. The results show that the method can accurately evaluate the effect of preventive maintenance treatments and assess the maintenance time to cater well for the functional performance of different preventive maintenance approaches. This framework could help policymakers to develop appropriate maintenance strategies for pavements. △ Less

Submitted 2 July, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: The arXiv version needs to be withdrawn since the model needs to be validated and updated with advanced machine learning technologies to enhance the accuracy of the model, and there are some crucial definition errors of symbols in the arXiv version

arXiv:2403.11163 [pdf, ps, other]

doi 10.1080/24754269.2024.2343151

A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques

Authors: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, **g Zhou, Xuening Zhu, Yingqiu Zhu, Hansheng Wang

Abstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first clas… ▽ More This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.07185 [pdf, other]

Uncertainty in Graph Neural Networks: A Survey

Authors: Fangxin Wang, Yuqing Liu, Kay Liu, Yibo Wang, Sourav Medya, Philip S. Yu

Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the… ▽ More Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the downstream tasks as well as the reliability of the GNN predictions. This survey aims to provide a comprehensive overview of the GNNs from the perspective of uncertainty with an emphasis on its integration in graph learning. We compare and summarize existing graph uncertainty theory and methods, alongside the corresponding downstream tasks. Thereby, we bridge the gap between theory and practice, meanwhile connecting different GNN communities. Moreover, our work provides valuable insights into promising directions in this field. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 13 main pages, 3 figures, 1 table. Under review

arXiv:2402.09970 [pdf, other]

Accelerating Parallel Sampling of Diffusion Models

Authors: Zhiwei Tang, Jiasheng Tang, Hao Luo, Fan Wang, Tsung-Hui Chang

Abstract: Diffusion models have emerged as state-of-the-art generative models for image generation. However, sampling from diffusion models is usually time-consuming due to the inherent autoregressive nature of their sampling process. In this work, we propose a novel approach that accelerates the sampling of diffusion models by parallelizing the autoregressive process. Specifically, we reformulate the sampl… ▽ More Diffusion models have emerged as state-of-the-art generative models for image generation. However, sampling from diffusion models is usually time-consuming due to the inherent autoregressive nature of their sampling process. In this work, we propose a novel approach that accelerates the sampling of diffusion models by parallelizing the autoregressive process. Specifically, we reformulate the sampling process as solving a system of triangular nonlinear equations through fixed-point iteration. With this innovative formulation, we explore several systematic techniques to further reduce the iteration steps required by the solving process. Applying these techniques, we introduce ParaTAA, a universal and training-free parallel sampling algorithm that can leverage extra computational and memory resources to increase the sampling speed. Our experiments demonstrate that ParaTAA can decrease the inference steps required by common sequential sampling algorithms such as DDIM and DDPM by a factor of 4$\sim$14 times. Notably, when applying ParaTAA with 100 steps DDIM for Stable Diffusion, a widely-used text-to-image diffusion model, it can produce the same images as the sequential sampling in only 7 inference steps. The code is available at https://github.com/TZW1998/ParaTAA-Diffusion. △ Less

Submitted 27 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2312.04281 [pdf, other]

Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

Authors: Feifei Wang, Huiyun Tang, Yang Li

Abstract: Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we r… ▽ More Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we refer to as FedSplit. This modeling framework is motivated by the finding that, data in different clients contain both common knowledge and personalized knowledge. Then the hidden elements in each neural layer can be split into the shared and personalized groups. With this decomposition, a novel objective function is established and optimized. We demonstrate FedSplit enjoyers a faster convergence speed than the standard federated learning method both theoretically and empirically. The generalization bound of the FedSplit method is also studied. To practically implement the proposed method on real datasets, factor analysis is introduced to facilitate the decoupling of hidden elements. This leads to a practically implemented model for FedSplit and we further refer to as FedFac. We demonstrated by simulation studies that, using factor analysis can well recover the underlying shared/personalized decomposition. The superior prediction performance of FedFac is further verified empirically by comparison with various state-of-the-art federated learning methods on several real datasets. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 29 pages, 10 figures

arXiv:2311.07906 [pdf, other]

Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects

Authors: Jiaxin Shi, Fang Wang, Yuan Gao, Xiaojun Song, Hansheng Wang

Abstract: Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. However, those methods cannot handle control variables with ultrahigh dimensionality, such as found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture… ▽ More Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. However, those methods cannot handle control variables with ultrahigh dimensionality, such as found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture conditional regression (MCR) approach, assuming that the whole sample can be classified into a number of latent classes. Within each latent class, a standard linear regression model can be used to model the relationship between the response and a key feature vector, which is assumed to be of a fixed dimension. Meanwhile, ultrahigh dimensional control variables are then used to determine the latent class membership, where a Naïve Bayes type model is used to describe the relationship. Hence, the dimension of control variables is allowed to be arbitrarily high. A novel expectation-maximization algorithm is developed for model estimation. Therefore, we are able to estimate the interested key parameters as efficiently as if the true class membership were known in advance. Simulation studies are presented to demonstrate the proposed MCR method. A real dataset of Chinese burglary offenses is analyzed for illustration purpose. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2310.17816 [pdf, other]

Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs

Authors: Jacqueline Maasch, Weishen Pan, Shantanu Gupta, Volodymyr Kuleshov, Kyra Gan, Fei Wang

Abstract: Causal discovery is crucial for causal inference in observational studies, as it can enable the identification of valid adjustment sets (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose local discovery by partitioning (LDP): a local causa… ▽ More Causal discovery is crucial for causal inference in observational studies, as it can enable the identification of valid adjustment sets (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose local discovery by partitioning (LDP): a local causal discovery method that is tailored for downstream inference tasks without requiring parametric and pretreatment assumptions. LDP is a constraint-based procedure that returns a VAS for an exposure-outcome pair under latent confounding, given sufficient conditions. The total number of independence tests performed is worst-case quadratic with respect to the cardinality of the variable set. Asymptotic theoretical guarantees are numerically validated on synthetic graphs. Adjustment sets from LDP yield less biased and more precise average treatment effect estimates than baseline discovery algorithms, with LDP outperforming on confounder recall, runtime, and test count for VAS discovery. Notably, LDP ran at least 1300x faster than baselines on a benchmark. △ Less

Submitted 1 June, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Journal ref: Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence (2024)

arXiv:2310.17760 [pdf, other]

Novel Models for Multiple Dependent Heteroskedastic Time Series

Authors: Fangyijie Wang, Michael Salter-Townshend

Abstract: Functional magnetic resonance imaging or functional MRI (fMRI) is a very popular tool used for differing brain regions by measuring brain activity. It is affected by physiological noise, such as head and brain movement in the scanner from breathing, heart beats, or the subject fidgeting. The purpose of this paper is to propose a novel approach to handling fMRI data for infants with high volatility… ▽ More Functional magnetic resonance imaging or functional MRI (fMRI) is a very popular tool used for differing brain regions by measuring brain activity. It is affected by physiological noise, such as head and brain movement in the scanner from breathing, heart beats, or the subject fidgeting. The purpose of this paper is to propose a novel approach to handling fMRI data for infants with high volatility caused by sudden head movements. Another purpose is to evaluate the volatility modelling performance of multiple dependent fMRI time series data. The models examined in this paper are AR and GARCH and the modelling performance is evaluated by several statistical performance measures. The conclusions of this paper are that multiple dependent fMRI series data can be fitted with AR + GARCH model if the multiple fMRI data have many sudden head movements. The GARCH model can capture the shared volatility clustering caused by head movements across brain regions. However, the multiple fMRI data without many head movements have fitted AR + GARCH model with different performance. The conclusions are supported by statistical tests and measures. This paper highlights the difference between the proposed approach from traditional approaches when estimating model parameters and modelling conditional variances on multiple dependent time series. In the future, the proposed approach can be applied to other research fields, such as financial economics, and signal processing. Code is available at \url{https://github.com/13204942/STAT40710}. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 18 pages

arXiv:2310.05646 [pdf, other]

Transfer learning for piecewise-constant mean estimation: Optimality, $\ell_1$- and $\ell_0$-penalisation

Authors: Fan Wang, Yi Yu

Abstract: We study transfer learning for estimating piecewise-constant signals when source data, which may be relevant but disparate, are available in addition to the target data. We first investigate transfer learning estimators that respectively employ $\ell_1$- and $\ell_0$-penalties for unisource data scenarios and then generalise these estimators to accommodate multisources. To further reduce estimatio… ▽ More We study transfer learning for estimating piecewise-constant signals when source data, which may be relevant but disparate, are available in addition to the target data. We first investigate transfer learning estimators that respectively employ $\ell_1$- and $\ell_0$-penalties for unisource data scenarios and then generalise these estimators to accommodate multisources. To further reduce estimation errors, especially when some sources significantly differ from the target, we introduce an informative source selection algorithm. We then examine these estimators with multisource selection and establish their minimax optimality. Unlike the common narrative in the transfer learning literature that the performance is enhanced through large source sample sizes, our approaches leverage higher observation frequencies and accommodate diverse frequencies across multiple sources. Our theoretical findings are supported by extensive numerical experiments, with the code available online, see https://github.com/chrisfanwang/transferlearning △ Less

Submitted 29 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.05019 [pdf, other]

Compressed online Sinkhorn

Authors: Fengpei Wang, Clarice Poon, Tony Shardlow

Abstract: The use of optimal transport (OT) distances, and in particular entropic-regularised OT distances, is an increasingly popular evaluation metric in many areas of machine learning and data science. Their use has largely been driven by the availability of efficient algorithms such as the Sinkhorn algorithm. One of the drawbacks of the Sinkhorn algorithm for large-scale data processing is that it is a… ▽ More The use of optimal transport (OT) distances, and in particular entropic-regularised OT distances, is an increasingly popular evaluation metric in many areas of machine learning and data science. Their use has largely been driven by the availability of efficient algorithms such as the Sinkhorn algorithm. One of the drawbacks of the Sinkhorn algorithm for large-scale data processing is that it is a two-phase method, where one first draws a large stream of data from the probability distributions, before applying the Sinkhorn algorithm to the discrete probability measures. More recently, there have been several works develo** stochastic versions of Sinkhorn that directly handle continuous streams of data. In this work, we revisit the recently introduced online Sinkhorn algorithm of [Mensch and Peyré, 2020]. Our contributions are twofold: We improve the convergence analysis for the online Sinkhorn algorithm, the new rate that we obtain is faster than the previous rate under certain parameter choices. We also present numerical results to verify the sharpness of our result. Secondly, we propose the compressed online Sinkhorn algorithm which combines measure compression techniques with the online Sinkhorn algorithm. We provide numerical experiments to show practical numerical gains, as well as theoretical guarantees on the efficiency of our approach. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2306.15286 [pdf, other]

Multilayer random dot product graphs: Estimation and online change point detection

Authors: Fan Wang, Wanshan Li, Oscar Hernan Madrid Padilla, Yi Yu, Alessandro Rinaldo

Abstract: We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realiza… ▽ More We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realization from an MRDPG. Across layers, we assume fixed shared common node sets and latent positions but allow for different connectivity matrices. We propose efficient tensor algorithms under both fixed and random latent position cases to minimize the detection delay while controlling false alarms. Notably, in the random latent position case, we devise a novel nonparametric change point detection algorithm based on density kernel estimation that is applicable to a wide range of scenarios, including stochastic block models as special cases. Our theoretical findings are supported by extensive numerical experiments, with the code available online https://github.com/MountLee/MRDPG. △ Less

Submitted 10 June, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.04093 [pdf, other]

Subnetwork Estimation for Spatial Autoregressive Models in Large-scale Networks

Authors: Xuetong Li, Feifei Wang, Wei Lan, Hansheng Wang

Abstract: Large-scale networks are commonly encountered in practice (e.g., Facebook and Twitter) by researchers. In order to study the network interaction between different nodes of large-scale networks, the spatial autoregressive (SAR) model has been popularly employed. Despite its popularity, the estimation of a SAR model on large-scale networks remains very challenging. On the one hand, due to policy lim… ▽ More Large-scale networks are commonly encountered in practice (e.g., Facebook and Twitter) by researchers. In order to study the network interaction between different nodes of large-scale networks, the spatial autoregressive (SAR) model has been popularly employed. Despite its popularity, the estimation of a SAR model on large-scale networks remains very challenging. On the one hand, due to policy limitations or high collection costs, it is often impossible for independent researchers to observe or collect all network information. On the other hand, even if the entire network is accessible, estimating the SAR model using the quasi-maximum likelihood estimator (QMLE) could be computationally infeasible due to its high computational cost. To address these challenges, we propose here a subnetwork estimation method based on QMLE for the SAR model. By using appropriate sampling methods, a subnetwork, consisting of a much-reduced number of nodes, can be constructed. Subsequently, the standard QMLE can be computed by treating the sampled subnetwork as if it were the entire network. This leads to a significant reduction in information collection and model computation costs, which increases the practical feasibility of the effort. Theoretically, we show that the subnetwork-based QMLE is consistent and asymptotically normal under appropriate regularity conditions. Extensive simulation studies, based on both simulated and real network structures, are presented. △ Less

Submitted 8 June, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.08172 [pdf, other]

Fast Signal Region Detection with Application to Whole Genome Association Studies

Authors: Wei Zhang, Fan Wang, Fang Yao

Abstract: Research on the localization of the genetic basis associated with diseases or traits has been widely conducted in the last a few decades. Scan methods have been developed for region-based analysis in whole-genome association studies, hel** us better understand how genetics influences human diseases or traits, especially when the aggregated effects of multiple causal variants are present. In this… ▽ More Research on the localization of the genetic basis associated with diseases or traits has been widely conducted in the last a few decades. Scan methods have been developed for region-based analysis in whole-genome association studies, hel** us better understand how genetics influences human diseases or traits, especially when the aggregated effects of multiple causal variants are present. In this paper, we propose a fast and effective algorithm coupling with high-dimensional test for simultaneously detecting multiple signal regions, which is distinct from existing methods using scan or knockoff statistics. The idea is to conduct binary splitting with re-search and arrangement based on a sequence of dynamic critical values to increase detection accuracy and reduce computation. Theoretical and empirical studies demonstrate that our approach enjoys favorable theoretical guarantees with fewer restrictions and exhibits superior numerical performance with faster computation. Utilizing the UK Biobank data to identify the genetic regions related to breast cancer, we confirm previous findings and meanwhile, identify a number of new regions which suggest strong association with risk of breast cancer and deserve further investigation. △ Less

Submitted 8 February, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

arXiv:2305.05722 [pdf]

Enhancing Clinical Predictive Modeling through Model Complexity-Driven Class Proportion Tuning for Class Imbalanced Data: An Empirical Study on Opioid Overdose Prediction

Authors: Yinan Liu, Xinyu Dong, Weimin Lyu, Richard N. Rosenthal, Rachel Wong, Tengfei Ma, Fusheng Wang

Abstract: Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that li… ▽ More Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that links the optimal class proportions to the model complexity, thereby tuning the class proportions per model. Our experiments on the opioid overdose prediction problem highlight the performance gain of tuning class proportions. Rigorous regression analysis also confirms the advantages of the theoretical framework proposed and the statistically significant correlation between the hyperparameters controlling the model complexity and the optimal class proportions. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.03555 [pdf, other]

Contrastive Graph Clustering in Curvature Spaces

Authors: Li Sun, Feiyang Wang, Junda Ye, Hao Peng, Philip S. Yu

Abstract: Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand,… ▽ More Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand, contrastive learning boosts the deep graph clustering but usually struggles in either graph augmentation or hard sample mining. To bridge this gap, we rethink the problem of graph clustering from geometric perspective and, to the best of our knowledge, make the first attempt to introduce a heterogeneous curvature space to graph clustering problem. Correspondingly, we present a novel end-to-end contrastive graph clustering model named CONGREGATE, addressing geometric graph clustering with Ricci curvatures. To support geometric clustering, we construct a theoretically grounded Heterogeneous Curvature Space where deep representations are generated via the product of the proposed fully Riemannian graph convolutional nets. Thereafter, we train the graph clusters by an augmentation-free reweighted contrastive approach where we pay more attention to both hard negatives and hard positives in our curvature space. Empirical results on real-world graphs show that our model outperforms the state-of-the-art competitors. △ Less

Submitted 5 May, 2023; originally announced May 2023.

Comments: Accepted by IJCAI'23

arXiv:2304.06564 [pdf, other]

Statistical Analysis of Fixed Mini-Batch Gradient Descent Estimator

Authors: Haobo Qi, Feifei Wang, Hansheng Wang

Abstract: We study here a fixed mini-batch gradient decent (FMGD) algorithm to solve optimization problems with massive datasets. In FMGD, the whole sample is split into multiple non-overlap** partitions. Once the partitions are formed, they are then fixed throughout the rest of the algorithm. For convenience, we refer to the fixed partitions as fixed mini-batches. Then for each computation iteration, the… ▽ More We study here a fixed mini-batch gradient decent (FMGD) algorithm to solve optimization problems with massive datasets. In FMGD, the whole sample is split into multiple non-overlap** partitions. Once the partitions are formed, they are then fixed throughout the rest of the algorithm. For convenience, we refer to the fixed partitions as fixed mini-batches. Then for each computation iteration, the gradients are sequentially calculated on each fixed mini-batch. Because the size of fixed mini-batches is typically much smaller than the whole sample size, it can be easily computed. This leads to much reduced computation cost for each computational iteration. It makes FMGD computationally efficient and practically more feasible. To demonstrate the theoretical properties of FMGD, we start with a linear regression model with a constant learning rate. We study its numerical convergence and statistical efficiency properties. We find that sufficiently small learning rates are necessarily required for both numerical convergence and statistical efficiency. Nevertheless, an extremely small learning rate might lead to painfully slow numerical convergence. To solve the problem, a diminishing learning rate scheduling strategy can be used. This leads to the FMGD estimator with faster numerical convergence and better statistical efficiency. Finally, the FMGD algorithms with random shuffling and a general loss function are also studied. △ Less

Submitted 13 April, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.06292 [pdf, ps, other]

Improved Naive Bayes with Mislabeled Data

Authors: Qianhan Zeng, Yingqiu Zhu, Xuening Zhu, Feifei Wang, Weichen Zhao, Shuning Sun, Meng Su, Hansheng Wang

Abstract: Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generatin… ▽ More Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generating mechanism of incorrect labels, we optimize the corresponding log-likelihood function iteratively by using an EM algorithm. Our simulation and experiment results show that the improved Naive Bayes method greatly improves the performances of the Naive Bayes method with mislabeled data. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.05636 [pdf, other]

Testing Sufficiency for Transfer Learning

Authors: Ziqian Lin, Yuan Gao, Feifei Wang, Hansheng Wang

Abstract: Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes the target data based statistical estimation very difficult. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this… ▽ More Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes the target data based statistical estimation very difficult. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this regard have been developed recently. In this work, we study transfer learning from a different perspective. Specifically, we consider here the problem of testing for transfer learning sufficiency. By transfer learning sufficiency (denoted as the null hypothesis), we mean that, with the help of the source data, the useful information contained in the feature vectors of the target data can be sufficiently extracted for predicting the interested target response. Therefore, the rejection of the null hypothesis implies that information useful for prediction remains in the feature vectors of the target data and thus calls for further exploration. To this end, we develop a novel testing procedure and a centralized and standardized test statistic, whose asymptotic null distribution is analytically derived. Simulation studies are presented to demonstrate the finite sample performance of the proposed method. A deep learning related real data example is presented for illustration purpose. △ Less

Submitted 12 April, 2023; originally announced April 2023.

arXiv:2302.02768 [pdf, other]

Network Autoregression for Incomplete Matrix-Valued Time Series

Authors: Xuening Zhu, Feifei Wang, Zeng Li, Yanyuan Ma

Abstract: We study the dynamics of matrix-valued time series with observed network structures by proposing a matrix network autoregression model with row and column networks of the subjects. We incorporate covariate information and a low rank intercept matrix. We allow incomplete observations in the matrices and the missing mechanism can be covariate dependent. To estimate the model, a two-step estimation p… ▽ More We study the dynamics of matrix-valued time series with observed network structures by proposing a matrix network autoregression model with row and column networks of the subjects. We incorporate covariate information and a low rank intercept matrix. We allow incomplete observations in the matrices and the missing mechanism can be covariate dependent. To estimate the model, a two-step estimation procedure is proposed. The first step aims to estimate the network autoregression coefficients, and the second step aims to estimate the regression parameters, which are matrices themselves. Theoretically, we first separately establish the asymptotic properties of the autoregression coefficients and the error bounds of the regression parameters. Subsequently, a bias reduction procedure is proposed to reduce the asymptotic bias and the theoretical property of the debiased estimator is studied. Lastly, we illustrate the usefulness of the proposed method through a number of numerical studies and an analysis of a Yelp data set. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2302.00107 [pdf, ps, other]

Distributed sequential federated learning

Authors: Z. F. Wang, X. Y. Zhang, Y-c I. Chang

Abstract: The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obta… ▽ More The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method. △ Less

Submitted 31 January, 2023; originally announced February 2023.

Comments: 22 pages

MSC Class: 62L10; 62L12

arXiv:2301.03747 [pdf, other]

Semiparametric Regression for Spatial Data via Deep Learning

Authors: Kexuan Li, Jun Zhu, Anthony R. Ives, Volker C. Radeloff, Fangfang Wang

Abstract: In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under… ▽ More In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method. △ Less

Submitted 16 December, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

arXiv:2211.16473 [pdf]

doi 10.1177/0962280220909969

Semiparametric integrative interaction analysis for non-small-cell lung cancer

Authors: Yang Li, Fan Wang, Rong Li, Yifan Sun

Abstract: In the genomic analysis, it is significant while challenging to identify markers associated with cancer outcomes or phenotypes. Based on the biological mechanisms of cancers and the characteristics of datasets as well, this paper proposes a novel integrative interaction approach under the semiparametric model, in which the genetic factors and environmental factors are included as the parametric an… ▽ More In the genomic analysis, it is significant while challenging to identify markers associated with cancer outcomes or phenotypes. Based on the biological mechanisms of cancers and the characteristics of datasets as well, this paper proposes a novel integrative interaction approach under the semiparametric model, in which the genetic factors and environmental factors are included as the parametric and nonparametric components, respectively. The goal of this approach is to identify the genetic factors and gene-gene interactions associated with cancer outcomes, and meanwhile, estimate the nonlinear effects of environmental factors. The proposed approach is based on the threshold gradient directed regularization (TGDR) technique. Simulation studies indicate that the proposed approach outperforms in the identification of main effects and interactions, and has favorable estimation and prediction accuracy compared with the alternative methods. The analysis of non-small-cell lung carcinomas (NSCLC) datasets from The Cancer Genome Atlas (TCGA) are conducted, showing that the proposed approach can identify markers with important implications and have favorable performance in prediction accuracy, identification stability, and computation cost. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: 16 pages, 4 figures

Journal ref: Statistical Methods in Medical Research, 29: 2865- 2880, 2020

arXiv:2208.14123 [pdf, other]

Catalytic Priors: Using Synthetic Data to Specify Prior Distributions in Bayesian Analysis

Authors: Dongming Huang, Feicheng Wang, Donald B. Rubin, S. C. Kou

Abstract: Catalytic prior distributions provide general, easy-to-use, and interpretable specifications of prior distributions for Bayesian analysis. They are particularly beneficial when the observed data are inadequate to stably estimate a complex target model. A catalytic prior distribution is constructed by augmenting the observed data with synthetic data that are sampled from the predictive distribution… ▽ More Catalytic prior distributions provide general, easy-to-use, and interpretable specifications of prior distributions for Bayesian analysis. They are particularly beneficial when the observed data are inadequate to stably estimate a complex target model. A catalytic prior distribution is constructed by augmenting the observed data with synthetic data that are sampled from the predictive distribution of a simpler model estimated from the observed data. We illustrate the usefulness of the catalytic prior approach using an example from labor economics. In the example, the resulting Bayesian inference reflects many important aspects of the observed data, and the estimation accuracy and predictive performance of the inference based on the catalytic prior are superior to, or comparable to, that of other commonly used prior distributions. We further explore the connection between the catalytic prior approach and a few popular regularization methods. We expect the catalytic prior approach to be useful in many applications. △ Less

Submitted 22 September, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

arXiv:2207.05471 [pdf, other]

Uncertainty-Aware Learning Against Label Noise on Imbalanced Datasets

Authors: Yingsong Huang, Bing Bai, Shengwei Zhao, Kun Bai, Fei Wang

Abstract: Learning against label noise is a vital topic to guarantee a reliable performance for deep neural networks. Recent research usually refers to dynamic noise modeling with model output probabilities and loss values, and then separates clean and noisy samples. These methods have gained notable success. However, unlike cherry-picked data, existing approaches often cannot perform well when facing imbal… ▽ More Learning against label noise is a vital topic to guarantee a reliable performance for deep neural networks. Recent research usually refers to dynamic noise modeling with model output probabilities and loss values, and then separates clean and noisy samples. These methods have gained notable success. However, unlike cherry-picked data, existing approaches often cannot perform well when facing imbalanced datasets, a common scenario in the real world. We thoroughly investigate this phenomenon and point out two major issues that hinder the performance, i.e., \emph{inter-class loss distribution discrepancy} and \emph{misleading predictions due to uncertainty}. The first issue is that existing methods often perform class-agnostic noise modeling. However, loss distributions show a significant discrepancy among classes under class imbalance, and class-agnostic noise modeling can easily get confused with noisy samples and samples in minority classes. The second issue refers to that models may output misleading predictions due to epistemic uncertainty and aleatoric uncertainty, thus existing methods that rely solely on the output probabilities may fail to distinguish confident samples. Inspired by our observations, we propose an Uncertainty-aware Label Correction framework~(ULC) to handle label noise on imbalanced datasets. First, we perform epistemic uncertainty-aware class-specific noise modeling to identify trustworthy clean samples and refine/discard highly confident true/corrupted labels. Then, we introduce aleatoric uncertainty in the subsequent learning process to prevent noise accumulation in the label noise modeling process. We conduct experiments on several synthetic and real-world datasets. The results demonstrate the effectiveness of the proposed method, especially on imbalanced datasets. △ Less

Submitted 12 July, 2022; originally announced July 2022.

arXiv:2206.09107 [pdf, other]

Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data

Authors: Jianmin Chen, Robert H. Aseltine, Fei Wang, Kun Chen

Abstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testin… ▽ More Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk. △ Less

Submitted 26 February, 2024; v1 submitted 17 June, 2022; originally announced June 2022.

arXiv:2206.08449 [pdf, ps, other]

Adaptive Algorithm for Quantum Amplitude Estimation

Authors: Yunpeng Zhao, Haiyan Wang, Kuai Xu, Yue Wang, Ji Zhu, Feng Wang

Abstract: Quantum amplitude estimation is a key sub-routine of a number of quantum algorithms with various applications. We propose an adaptive algorithm for interval estimation of amplitudes. The quantum part of the algorithm is based only on Grover's algorithm. The key ingredient is the introduction of an adjustment factor, which adjusts the amplitude of good states such that the amplitude after the adjus… ▽ More Quantum amplitude estimation is a key sub-routine of a number of quantum algorithms with various applications. We propose an adaptive algorithm for interval estimation of amplitudes. The quantum part of the algorithm is based only on Grover's algorithm. The key ingredient is the introduction of an adjustment factor, which adjusts the amplitude of good states such that the amplitude after the adjustment, and the original amplitude, can be estimated without ambiguity in the subsequent step. We show with numerical studies that the proposed algorithm uses a similar number of quantum queries to achieve the same level of precision $ε$ compared to state-of-the-art algorithms, but the classical part, i.e., the non-quantum part, has substantially lower computational complexity. We rigorously prove that the number of oracle queries achieves $O(1/ε)$, i.e., a quadratic speedup over classical Monte Carlo sampling, and the computational complexity of the classical part achieves $O(\log(1/ε))$, both up to a double-logarithmic factor. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2204.01682 [pdf, other]

Deep Feature Screening: Feature Selection for Ultra High-Dimensional Data via Deep Neural Networks

Authors: Kexuan Li, Fangfang Wang, Lingli Yang, Ruiqi Liu

Abstract: The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these proble… ▽ More The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these problems and identify significant features with high precision for ultra high-dimensional, low-sample-size data. This approach first extracts a low-dimensional representation of input data and then applies feature screening based on multivariate rank distance correlation recently developed by Deb and Sen (2021). This approach combines the strengths of both deep neural networks and feature screening, and thereby has the following appealing features in addition to its ability of handling ultra high-dimensional data with small number of samples: (1) it is model free and distribution free; (2) it can be used for both supervised and unsupervised feature selection; and (3) it is capable of recovering the original input data. The superiority of DeepFS is demonstrated via extensive simulation studies and real data analyses. △ Less

Submitted 16 December, 2023; v1 submitted 4 April, 2022; originally announced April 2022.

arXiv:2204.00750 [pdf, other]

Structural randomised selection

Authors: Fan Wang, Sylvia Richardson, Steven M. Hill

Abstract: An important problem in the analysis of high-dimensional omics data is to identify subsets of molecular variables that are associated with a phenotype of interest. This requires addressing the challenges of high dimensionality, strong multicollinearity and model uncertainty. We propose a new ensemble learning approach for improving the performance of sparse penalised regression methods, called STr… ▽ More An important problem in the analysis of high-dimensional omics data is to identify subsets of molecular variables that are associated with a phenotype of interest. This requires addressing the challenges of high dimensionality, strong multicollinearity and model uncertainty. We propose a new ensemble learning approach for improving the performance of sparse penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the "base learner". Using synthetic data and real biological datasets, we demonstrate that STRANDS typically improves upon its base learner, and that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. △ Less

Submitted 1 April, 2022; originally announced April 2022.

arXiv:2203.11469 [pdf, other]

A new class of composite GBII regression models with varying threshold for modelling heavy-tailed data

Authors: Zhengxiao Li, Fei Wang, Zhengtang Zhao

Abstract: The four-parameter generalized beta distribution of the second kind (GBII) has been proposed for modelling insurance losses with heavy-tailed features. The aim of this paper is to present a parametric composite GBII regression modelling by splicing two GBII distributions using mode matching method. It is designed for simultaneous modeling of small and large claims and capturing the policyholder he… ▽ More The four-parameter generalized beta distribution of the second kind (GBII) has been proposed for modelling insurance losses with heavy-tailed features. The aim of this paper is to present a parametric composite GBII regression modelling by splicing two GBII distributions using mode matching method. It is designed for simultaneous modeling of small and large claims and capturing the policyholder heterogeneity by introducing the covariates into the location parameter. In such cases, the threshold that splits two GBII distributions varies across individuals policyholders based on their risk features. The proposed regression modelling also contains a wide range of insurance loss distributions as the head and the tail respectively and provides the close-formed expressions for parameter estimation and model prediction. A simulation study is conducted to show the accuracy of the proposed estimation method and the flexibility of the regressions. Some illustrations of the applicability of the new class of distributions and regressions are provided with a Danish fire losses data set and a Chinese medical insurance claims data set, comparing with the results of competing models from the literature. △ Less

Submitted 26 January, 2024; v1 submitted 22 March, 2022; originally announced March 2022.

arXiv:2203.11015 [pdf, other]

doi 10.1109/JBHI.2022.3193365

Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

Authors: Xianghao Zhan, Fan** Wang, Olivier Gevaert

Abstract: Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital re… ▽ More Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital records, in clinical trial results, and also in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from previous publications relies heavily on resource-demanding manual labelling, which considerably decreased the efficiency of the information extraction process. The recent development of artificial intelligence, particularly, the rise of natural language processing (NLP) techniques, enabled the automatic processing of biomedical texts. In this study, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked model performances on filtering out DILI literature. Among four word vectorization techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 with our in-house test set. Furthermore, an ensemble model with similar overall performances was implemented and was fine-tuned to lower the false-negative cases to avoid neglecting potential DILI reports. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data provided by the CAMDA committee. Moreover, important words in positive/negative predictions were identified via model interpretation. Overall, the ensemble model reached satisfactory classification results, which can be further used by researchers to rapidly filter DILI-related literature. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 8 pages, 4 figures

arXiv:2202.13829 [pdf, ps, other]

How and what to learn:The modes of machine learning

Authors: Sihan Feng, Yong Zhang, Fuming Wang, Hong Zhao

Abstract: Despite their great success, neural networks still remain as black-boxes due to the lack of interpretability. Here we propose a new analyzing method, namely the weight pathway analysis (WPA), to make them transparent. We consider weights in pathways that link neurons longitudinally from input neurons to output neurons, or simply weight pathways, as the basic units for understanding a neural networ… ▽ More Despite their great success, neural networks still remain as black-boxes due to the lack of interpretability. Here we propose a new analyzing method, namely the weight pathway analysis (WPA), to make them transparent. We consider weights in pathways that link neurons longitudinally from input neurons to output neurons, or simply weight pathways, as the basic units for understanding a neural network, and decompose a neural network into a series of subnetworks of such weight pathways. A visualization scheme of the subnetworks is presented that gives longitudinal perspectives of the network like radiographs, making the internal structures of the network visible. Impacts of parameter adjustments or structural changes to the network can be visualized via such radiographs. Characteristic maps are established for subnetworks to characterize the enhancement or suppression of the influence of input samples on each output neuron. Using WPA, we discover that neural network store and utilize information in a holographic way, that is, subnetworks encode all training samples in a coherent structure and thus only by investigating the weight pathways can one explore samples stored in the network. Furthermore, with WPA, we reveal fundamental learning modes of a neural network: the linear learning mode and the nonlinear learning mode. The former extracts linearly separable features while the latter extracts linearly inseparable features. The hidden-layer neurons self-organize into different classes for establishing learning modes and for reaching the training goal. The finding of learning modes provides us the theoretical ground for understanding some of the fundamental problems of machine learning, such as the dynamics of learning process, the role of linear and nonlinear neurons, as well as the role of network width and depth. △ Less

Submitted 8 August, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: 16 pages, 10 figures

arXiv:2112.02792 [pdf, other]

Incentive Compatible Pareto Alignment for Multi-Source Large Graphs

Authors: Jian Liang, Fangrui Lv, Di Liu, Zehui Dai, Xu Tian, Shuang Li, Fei Wang, Han Li

Abstract: In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sour… ▽ More In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sources to share information and 2) how to mitigate negative transfer from joint learning multi-source data. What's worse, one practical issue is the entanglement between both challenges. Specifically, incorrect alignments may increase negative transfer; while mitigating negative transfer for one source may result in poorly learned representations for other sources and then decrease alignment accuracy. To handle the entangled challenges, we point out that the key is to optimize information sharing first based on Pareto front optimization, by showing that information sharing significantly influences the Pareto front which depicts lower bounds of negative transfer. Consequently, we proposed an Incentive Compatible Pareto Alignment (ICPA) method to first optimize cross-source alignments based on Pareto front optimization, then mitigate negative transfer constrained on the optimized alignments. This mechanism renders each source can learn based on its true preference without worrying about deteriorating representations of other sources. Specifically, the Pareto front optimization encourages minimizing lower bounds of negative transfer, which optimizes whether and which to align. Comprehensive empirical evaluation results on four large-scale datasets are provided to demonstrate the effectiveness and superiority of ICPA. Online A/B test results at a search advertising platform also demonstrate the effectiveness of ICPA in production environments. △ Less

Submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.15086 [pdf, other]

Scalable Semiparametric Spatio-temporal Regression for Large Data Analysis

Authors: Ting Fung Ma, Fangfang Wang, Jun Zhu, Anthony R. Ives, Katarzyna E. Lewińska

Abstract: With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric… ▽ More With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric autoregressive model without the usual Gaussian assumption and devise a computationally scalable procedure that enables the regression analysis of large datasets. We estimate the model parameters by quasi maximum likelihood and show that the computational complexity can be reduced from cubic to linear of the sample size. Asymptotic properties under suitable regularity conditions are further established that inform the computational procedure to be efficient and scalable. A simulation study is conducted to evaluate the finite-sample properties of the parameter estimation and statistical inference. We illustrate our methodology by a dataset with over 2.96 million observations of annual land surface temperature and the comparison with an existing state-of-the-art approach highlights the advantages of our method. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2111.10846 [pdf, other]

Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

Authors: Yandi Zhu, Xiaoling Lu, **gya Hong, Feifei Wang

Abstract: Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and fur… ▽ More Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the "lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a jointly dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the jointly dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered. △ Less

Submitted 21 November, 2021; originally announced November 2021.

arXiv:2110.14298 [pdf, other]

Denoising and change point localisation in piecewise-constant high-dimensional regression coefficients

Authors: Fan Wang, Oscar Hernan Madrid Padilla, Yi Yu, Alessandro Rinaldo

Abstract: We study the theoretical properties of the fused lasso procedure originally proposed by \cite{tibshirani2005sparsity} in the context of a linear regression model in which the regression coefficient are totally ordered and assumed to be sparse and piecewise constant. Despite its popularity, to the best of our knowledge, estimation error bounds in high-dimensional settings have only been obtained fo… ▽ More We study the theoretical properties of the fused lasso procedure originally proposed by \cite{tibshirani2005sparsity} in the context of a linear regression model in which the regression coefficient are totally ordered and assumed to be sparse and piecewise constant. Despite its popularity, to the best of our knowledge, estimation error bounds in high-dimensional settings have only been obtained for the simple case in which the design matrix is the identity matrix. We formulate a novel restricted isometry condition on the design matrix that is tailored to the fused lasso estimator and derive estimation bounds for both the constrained version of the fused lasso assuming dense coefficients and for its penalised version. We observe that the estimation error can be dominated by either the lasso or the fused lasso rate, depending on whether the number of non-zero coefficient is larger than the number of piece-wise constant segments. Finally, we devise a post-processing procedure to recover the piecewise-constant pattern of the coefficients. Extensive numerical experiments support our theoretical findings. △ Less

Submitted 18 February, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

arXiv:2109.10399 [pdf, other]

SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and Benchmarking

Authors: Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, Lester Mackey

Abstract: Subseasonal forecasting of the weather two to six weeks in advance is critical for resource allocation and advance disaster notice but poses many challenges for the forecasting community. At this forecast horizon, physics-based dynamical models have limited skill, and the targets for prediction depend in a complex manner on both local weather variables and global climate variables. Recently, machi… ▽ More Subseasonal forecasting of the weather two to six weeks in advance is critical for resource allocation and advance disaster notice but poses many challenges for the forecasting community. At this forecast horizon, physics-based dynamical models have limited skill, and the targets for prediction depend in a complex manner on both local weather variables and global climate variables. Recently, machine learning methods have shown promise in advancing the state of the art but only at the cost of complex data curation, integrating expert knowledge with aggregation across multiple relevant data sources, file formats, and temporal and spatial resolutions. To streamline this process and accelerate future development, we introduce SubseasonalClimateUSA, a curated dataset for training and benchmarking subseasonal forecasting models in the United States. We use this dataset to benchmark a diverse suite of models, including operational dynamical models, classical meteorological baselines, and ten state-of-the-art machine learning and deep learning-based methods from the literature. Overall, our benchmarks suggest simple and effective ways to extend the accuracy of current operational models. SubseasonalClimateUSA is regularly updated and accessible via the https://github.com/microsoft/subseasonal_data/ Python package. △ Less

Submitted 16 January, 2024; v1 submitted 21 September, 2021; originally announced September 2021.

arXiv:2109.09856 [pdf]

SFFDD: Deep Neural Network with Enriched Features for Failure Prediction with Its Application to Computer Disk Driver

Authors: Lanfa Frank Wang, Danjue Li

Abstract: A classification technique incorporating a novel feature derivation method is proposed for predicting failure of a system or device with multivariate time series sensor data. We treat the multivariate time series sensor data as images for both visualization and computation. Failure follows various patterns which are closely related to the root causes. Different predefined transformations are appli… ▽ More A classification technique incorporating a novel feature derivation method is proposed for predicting failure of a system or device with multivariate time series sensor data. We treat the multivariate time series sensor data as images for both visualization and computation. Failure follows various patterns which are closely related to the root causes. Different predefined transformations are applied on the original sensors data to better characterize the failure patterns. In addition to feature derivation, ensemble method is used to further improve the performance. In addition, a general algorithm architecture of deep neural network is proposed to handle multiple types of data with less manual feature engineering. We apply the proposed method on the early predict failure of computer disk drive in order to improve storage systems availability and avoid data loss. The classification accuracy is largely improved with the enriched features, named smart features. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: 11 pages, 20 figures

arXiv:2108.07928 [pdf, ps, other]

Implicit Profiling Estimation for Semiparametric Models with Bundled Parameters

Authors: Yucong Lin, **hua Su, Yang Liu, Jue Hou, Feifei Wang

Abstract: Solving semiparametric models can be computationally challenging because the dimension of parameter space may grow large with increasing sample size. Classical Newton's method becomes quite slow and unstable with intensive calculation of the large Hessian matrix and its inverse. Iterative methods separately update parameters for finite dimensional component and infinite dimensional component have… ▽ More Solving semiparametric models can be computationally challenging because the dimension of parameter space may grow large with increasing sample size. Classical Newton's method becomes quite slow and unstable with intensive calculation of the large Hessian matrix and its inverse. Iterative methods separately update parameters for finite dimensional component and infinite dimensional component have been developed to speed up single iteration, but they often take more steps until convergence or even sometimes sacrifice estimation precision due to sub-optimal update direction. We propose a computationally efficient implicit profiling algorithm that achieves simultaneously the fast iteration step in iterative methods and the optimal update direction in the Newton's method by profiling out the infinite dimensional component as the function of the finite dimensional component. We devise a first order approximation when the profiling function has no explicit analytical form. We show that our implicit profiling method always solve any local quadratic programming problem in two steps. In two numerical experiments under semiparametric transformation models and GARCH-M models, we demonstrated the computational efficiency and statistical precision of our implicit profiling method. △ Less

Submitted 17 August, 2021; originally announced August 2021.

arXiv:2106.07875 [pdf, other]

doi 10.1145/3447548.3467274

S-LIME: Stabilized-LIME for Model Explanation

Authors: Zhengze Zhou, Giles Hooker, Fei Wang

Abstract: An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME, are widely us… ▽ More An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME, are widely used approaches to interpret a machine learning model after it has been built. This class of methods has been shown to exhibit large instability, posing serious challenges to the effectiveness of the method itself and harming user trust. In this paper, we propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation. Experiments on both simulated and real world data sets are provided to demonstrate the effectiveness of our method. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '21), August 14--18, 2021, Virtual Event, Singapore

arXiv:2106.03591 [pdf, other]

Calibrating multi-dimensional complex ODE from noisy data via deep neural networks

Authors: Kexuan Li, Fangfang Wang, Ruiqi Liu, Fan Yang, Zuofeng Shang

Abstract: Ordinary differential equations (ODEs) are widely used to model complex dynamics that arises in biology, chemistry, engineering, finance, physics, etc. Calibration of a complicated ODE system using noisy data is generally very difficult. In this work, we propose a two-stage nonparametric approach to address this problem. We first extract the de-noised data and their higher order derivatives using… ▽ More Ordinary differential equations (ODEs) are widely used to model complex dynamics that arises in biology, chemistry, engineering, finance, physics, etc. Calibration of a complicated ODE system using noisy data is generally very difficult. In this work, we propose a two-stage nonparametric approach to address this problem. We first extract the de-noised data and their higher order derivatives using boundary kernel method, and then feed them into a sparsely connected deep neural network with ReLU activation function. Our method is able to recover the ODE system without being subject to the curse of dimensionality and complicated ODE structure. When the ODE possesses a general modular structure, with each modular component involving only a few input variables, and the network architecture is properly chosen, our method is proven to be consistent. Theoretical properties are corroborated by an extensive simulation study that demonstrates the validity and effectiveness of the proposed method. Finally, we use our method to simultaneously characterize the growth rate of Covid-19 infection cases from 50 states of the USA. △ Less

Submitted 18 September, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2105.09670 [pdf, other]

Ensemble machine learning approach for screening of coronary heart disease based on echocardiography and risk factors

Authors: **gyi Zhang, Huolan Zhu, Yongkai Chen, Chenguang Yang, Huimin Cheng, Yi Li, Wenxuan Zhong, Fang Wang

Abstract: Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classifica… ▽ More Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classification methods together by model stacking, and generalize the traditional stacking method to a two-step stacking method to improve the diagnostic performance. Results: By borrowing strengths from multiple classification models through the proposed method, we improve the CHD classification accuracy from around 70% to 87.7% on the testing set. The sensitivity of the proposed method is 0.903 and the specificity is 0.843, with an AUC of 0.904, which is significantly higher than those of the individual classification models. Conclusions: Our work lays a foundation for the deployment of speckle tracking echocardiography-based screening tools for coronary heart disease. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: 30 pages, 5 figures, 5 tables

arXiv:2010.05430 [pdf, other]

doi 10.1007/s10618-018-0564-z

Robust Finite Mixture Regression for Heterogeneous Targets

Authors: Jian Liang, Kun Chen, Ming Lin, Changshui Zhang, Fei Wang

Abstract: Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this… ▽ More Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this paper, we propose an FMR model that 1) finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously, 2) achieves shared feature selection among tasks and cluster components, and 3) detects anomaly tasks or clustered structure among tasks, and accommodates outlier samples. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The proposed model is evaluated on both synthetic and real-world data sets. The results show that our model can achieve state-of-the-art performance. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Journal ref: Data Mining and Knowledge Discovery, volume 32, pages 1509 to 1560, year 2018

arXiv:2010.05250 [pdf, other]

Domain Agnostic Learning for Unbiased Authentication

Authors: Jian Liang, Yuren Cao, Shuang Li, Bing Bai, Hao Li, Fei Wang, Kun Bai

Abstract: Authentication is the task of confirming the matching relationship between a data instance and a given identity. Typical examples of authentication problems include face recognition and person re-identification. Data-driven authentication could be affected by undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while applied in other domains… ▽ More Authentication is the task of confirming the matching relationship between a data instance and a given identity. Typical examples of authentication problems include face recognition and person re-identification. Data-driven authentication could be affected by undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while applied in other domains (e.g., they change the clothes to summer outfits). Previous works have made efforts to eliminate domain-difference. They typically assume domain annotations are provided, and all the domains share classes. However, for authentication, there could be a large number of domains shared by different identities/classes, and it is impossible to annotate these domains exhaustively. It could make domain-difference challenging to model and eliminate. In this paper, we propose a domain-agnostic method that eliminates domain-difference without domain labels. We alternately perform latent domain discovery and domain-difference elimination until our model no longer detects domain-difference. In our approach, the latent domains are discovered by learning the heterogeneous predictive relationships between inputs and outputs. Then domain-difference is eliminated in both class-dependent and class-independent spaces to improve robustness of elimination. We further extend our method to a meta-learning framework to pursue more thorough domain-difference elimination. Comprehensive empirical evaluation results are provided to demonstrate the effectiveness and superiority of our proposed method. △ Less

Submitted 23 November, 2020; v1 submitted 11 October, 2020; originally announced October 2020.

arXiv:2010.04589 [pdf]

Identifying Risk of Opioid Use Disorder for Patients Taking Opioid Medications with Deep Learning

Authors: Xinyu Dong, Jianyuan Deng, Sina Rashidian, Kayley Abell-Hart, Wei Hou, Richard N Rosenthal, Mary Saltz, Joel Saltz, Fusheng Wang

Abstract: The United States is experiencing an opioid epidemic, and there were more than 10 million opioid misusers aged 12 or older each year. Identifying patients at high risk of Opioid Use Disorder (OUD) can help to make early clinical interventions to reduce the risk of OUD. Our goal is to predict OUD patients among opioid prescription users through analyzing electronic health records with machine learn… ▽ More The United States is experiencing an opioid epidemic, and there were more than 10 million opioid misusers aged 12 or older each year. Identifying patients at high risk of Opioid Use Disorder (OUD) can help to make early clinical interventions to reduce the risk of OUD. Our goal is to predict OUD patients among opioid prescription users through analyzing electronic health records with machine learning and deep learning methods. This will help us to better understand the diagnoses of OUD, providing new insights on opioid epidemic. Electronic health records of patients who have been prescribed with medications containing active opioid ingredients were extracted from Cerner Health Facts database between January 1, 2008 and December 31, 2017. Long Short-Term Memory (LSTM) models were applied to predict opioid use disorder risk in the future based on recent five encounters, and compared to Logistic Regression, Random Forest, Decision Tree and Dense Neural Network. Prediction performance was assessed using F-1 score, precision, recall, and AUROC. Our temporal deep learning model provided promising prediction results which outperformed other methods, with a F1 score of 0.8023 and AUCROC of 0.9369. The model can identify OUD related medications and vital signs as important features for the prediction. LSTM based temporal deep learning model is effective on predicting opioid use disorder using a patient past history of electronic health records, with minimal domain knowledge. It has potential to improve clinical decision support for early intervention and prevention to combat the opioid epidemic. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: 20 pages, 6 figures

arXiv:2010.03757 [pdf, other]

AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

Authors: Geoffrey C. Fox, Gregor von Laszewski, Fugang Wang, Saumyadipta Pyne

Abstract: The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting wi… ▽ More The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting with population covariates, some of which may serve as putative risk factors. We have integrated multiple different strategies into AICov, including the ability to use deep learning strategies based on LSTM and even modeling. To demonstrate our approach, we have conducted a pilot that integrates population covariates from multiple sources. Thus, AICov not only includes data on COVID-19 cases and deaths but, more importantly, the population's socioeconomic, health and behavioral risk factors at a local level. The compiled data are fed into AICov, and thus we obtain improved prediction by integration of the data to our model as compared to one that only uses case and death data. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: 25 pages, 4 tabkes, 19 figures

Showing 1–50 of 117 results for author: Wang, F