Skip to main content

Showing 1–50 of 117 results for author: Wang, F

Searching in archive stat. Search in all archives.
.
  1. arXiv:2407.00882  [pdf, other

    stat.ME

    Subgroup Identification with Latent Factor Structure

    Authors: Yong He, Dong Liu, Fuxin Wang, Mingjuan Zhang, Wen-Xin Zhou

    Abstract: Subgroup analysis has attracted growing attention due to its ability to identify meaningful subgroups from a heterogeneous population and thereby improving predictive power. However, in many scenarios such as social science and biology, the covariates are possibly highly correlated due to the existence of common factors, which brings great challenges for group identification and is neglected in th… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  2. arXiv:2406.15762  [pdf, other

    cs.LG stat.ML

    Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

    Authors: Zhichao Chen, Haoxuan Li, Fangyikang Wang, Odin Zhang, Hu Xu, Xiaoyu Jiang, Zhihuan Song, Eric H. Wang

    Abstract: Diffusion models (DMs) have gained attention in Missing Data Imputation (MDI), but there remain two long-neglected issues to be addressed: (1). Inaccurate Imputation, which arises from inherently sample-diversification-pursuing generative process of DMs. (2). Difficult Training, which stems from intricate design required for the mask matrix in model training stage. To address these concerns within… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  3. arXiv:2406.00701  [pdf, other

    math.ST stat.ME

    Profiled Transfer Learning for High Dimensional Linear Model

    Authors: Ziqian Lin, Junlong Zhao, Fang Wang, Hansheng Wang

    Abstract: We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \textit{vanishing-difference} assumption and \textit{low-rank} assumption in the literature, the \textit{approximate-linear} assumption is more flexible and less… ▽ More

    Submitted 5 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

  4. arXiv:2405.16413  [pdf, other

    cs.AI cs.CL cs.LG stat.AP

    Augmented Risk Prediction for the Onset of Alzheimer's Disease from Electronic Health Records with Large Language Models

    Authors: Jiankun Wang, Sumyeong Ahn, Taykhoom Dalal, Xiaodan Zhang, Weishen Pan, Qiannan Zhang, Bin Chen, Hiroko H. Dodge, Fei Wang, Jiayu Zhou

    Abstract: Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning bas… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  5. arXiv:2405.14848  [pdf, other

    stat.ML cs.LG

    Local Causal Discovery for Structural Evidence of Direct Discrimination

    Authors: Jacqueline Maasch, Kyra Gan, Violet Chen, Agni Orfanoudaki, Nil-Jana Akpinar, Fei Wang

    Abstract: Fairness is a critical objective in policy design and algorithmic decision-making. Identifying the causal pathways of unfairness requires knowledge of the underlying structural causal model, which may be incomplete or unavailable. This limits the practicality of causal fairness analysis in complex or low-knowledge domains. To mitigate this practicality gap, we advocate for develo** efficient cau… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  6. arXiv:2405.10329   

    stat.AP cs.AI

    Causal inference approach to appraise long-term effects of maintenance policy on functional performance of asphalt pavements

    Authors: Lingyun You, Nanning Guo, Zhengwu Long, Fusong Wang, Chundi Si, Aboelkasim Diab

    Abstract: Asphalt pavements as the most prevalent transportation infrastructure, are prone to serious traffic safety problems due to functional or structural damage caused by stresses or strains imposed through repeated traffic loads and continuous climatic cycles. The good quality or high serviceability of infrastructure networks is vital to the urbanization and industrial development of nations. In order… ▽ More

    Submitted 2 July, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

    Comments: The arXiv version needs to be withdrawn since the model needs to be validated and updated with advanced machine learning technologies to enhance the accuracy of the model, and there are some crucial definition errors of symbols in the arXiv version

  7. arXiv:2403.11163  [pdf, ps, other

    stat.ME cs.LG math.ST stat.CO

    A Selective Review on Statistical Methods for Massive Data Computation: Distributed Computing, Subsampling, and Minibatch Techniques

    Authors: Xuetong Li, Yuan Gao, Hong Chang, Danyang Huang, Yingying Ma, Rui Pan, Haobo Qi, Feifei Wang, Shuyuan Wu, Ke Xu, **g Zhou, Xuening Zhu, Yingqiu Zhu, Hansheng Wang

    Abstract: This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first clas… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  8. arXiv:2403.07185  [pdf, other

    cs.LG stat.ML

    Uncertainty in Graph Neural Networks: A Survey

    Authors: Fangxin Wang, Yuqing Liu, Kay Liu, Yibo Wang, Sourav Medya, Philip S. Yu

    Abstract: Graph Neural Networks (GNNs) have been extensively used in various real-world applications. However, the predictive uncertainty of GNNs stemming from diverse sources such as inherent randomness in data and model training errors can lead to unstable and erroneous predictions. Therefore, identifying, quantifying, and utilizing uncertainty are essential to enhance the performance of the model for the… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: 13 main pages, 3 figures, 1 table. Under review

  9. arXiv:2402.09970  [pdf, other

    cs.LG stat.ML

    Accelerating Parallel Sampling of Diffusion Models

    Authors: Zhiwei Tang, Jiasheng Tang, Hao Luo, Fan Wang, Tsung-Hui Chang

    Abstract: Diffusion models have emerged as state-of-the-art generative models for image generation. However, sampling from diffusion models is usually time-consuming due to the inherent autoregressive nature of their sampling process. In this work, we propose a novel approach that accelerates the sampling of diffusion models by parallelizing the autoregressive process. Specifically, we reformulate the sampl… ▽ More

    Submitted 27 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  10. arXiv:2312.04281  [pdf, other

    stat.ML cs.LG

    Factor-Assisted Federated Learning for Personalized Optimization with Heterogeneous Data

    Authors: Feifei Wang, Huiyun Tang, Yang Li

    Abstract: Federated learning is an emerging distributed machine learning framework aiming at protecting data privacy. Data heterogeneity is one of the core challenges in federated learning, which could severely degrade the convergence rate and prediction performance of deep neural networks. To address this issue, we develop a novel personalized federated learning framework for heterogeneous data, which we r… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: 29 pages, 10 figures

  11. arXiv:2311.07906  [pdf, other

    stat.ME

    Mixture Conditional Regression with Ultrahigh Dimensional Text Data for Estimating Extralegal Factor Effects

    Authors: Jiaxin Shi, Fang Wang, Yuan Gao, Xiaojun Song, Hansheng Wang

    Abstract: Testing judicial impartiality is a problem of fundamental importance in empirical legal studies, for which standard regression methods have been popularly used to estimate the extralegal factor effects. However, those methods cannot handle control variables with ultrahigh dimensionality, such as found in judgment documents recorded in text format. To solve this problem, we develop a novel mixture… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

  12. arXiv:2310.17816  [pdf, other

    stat.ML cs.LG stat.ME

    Local Discovery by Partitioning: Polynomial-Time Causal Discovery Around Exposure-Outcome Pairs

    Authors: Jacqueline Maasch, Weishen Pan, Shantanu Gupta, Volodymyr Kuleshov, Kyra Gan, Fei Wang

    Abstract: Causal discovery is crucial for causal inference in observational studies, as it can enable the identification of valid adjustment sets (VAS) for unbiased effect estimation. However, global causal discovery is notoriously hard in the nonparametric setting, with exponential time and sample complexity in the worst case. To address this, we propose local discovery by partitioning (LDP): a local causa… ▽ More

    Submitted 1 June, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

    Journal ref: Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence (2024)

  13. arXiv:2310.17760  [pdf, other

    stat.ME eess.SP

    Novel Models for Multiple Dependent Heteroskedastic Time Series

    Authors: Fangyijie Wang, Michael Salter-Townshend

    Abstract: Functional magnetic resonance imaging or functional MRI (fMRI) is a very popular tool used for differing brain regions by measuring brain activity. It is affected by physiological noise, such as head and brain movement in the scanner from breathing, heart beats, or the subject fidgeting. The purpose of this paper is to propose a novel approach to handling fMRI data for infants with high volatility… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: 18 pages

  14. arXiv:2310.05646  [pdf, other

    stat.ME math.ST

    Transfer learning for piecewise-constant mean estimation: Optimality, $\ell_1$- and $\ell_0$-penalisation

    Authors: Fan Wang, Yi Yu

    Abstract: We study transfer learning for estimating piecewise-constant signals when source data, which may be relevant but disparate, are available in addition to the target data. We first investigate transfer learning estimators that respectively employ $\ell_1$- and $\ell_0$-penalties for unisource data scenarios and then generalise these estimators to accommodate multisources. To further reduce estimatio… ▽ More

    Submitted 29 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

  15. arXiv:2310.05019  [pdf, other

    cs.LG stat.ML

    Compressed online Sinkhorn

    Authors: Fengpei Wang, Clarice Poon, Tony Shardlow

    Abstract: The use of optimal transport (OT) distances, and in particular entropic-regularised OT distances, is an increasingly popular evaluation metric in many areas of machine learning and data science. Their use has largely been driven by the availability of efficient algorithms such as the Sinkhorn algorithm. One of the drawbacks of the Sinkhorn algorithm for large-scale data processing is that it is a… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

  16. arXiv:2306.15286  [pdf, other

    stat.ME

    Multilayer random dot product graphs: Estimation and online change point detection

    Authors: Fan Wang, Wanshan Li, Oscar Hernan Madrid Padilla, Yi Yu, Alessandro Rinaldo

    Abstract: We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realiza… ▽ More

    Submitted 10 June, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

  17. arXiv:2306.04093  [pdf, other

    stat.CO

    Subnetwork Estimation for Spatial Autoregressive Models in Large-scale Networks

    Authors: Xuetong Li, Feifei Wang, Wei Lan, Hansheng Wang

    Abstract: Large-scale networks are commonly encountered in practice (e.g., Facebook and Twitter) by researchers. In order to study the network interaction between different nodes of large-scale networks, the spatial autoregressive (SAR) model has been popularly employed. Despite its popularity, the estimation of a SAR model on large-scale networks remains very challenging. On the one hand, due to policy lim… ▽ More

    Submitted 8 June, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

  18. arXiv:2305.08172  [pdf, other

    stat.ME

    Fast Signal Region Detection with Application to Whole Genome Association Studies

    Authors: Wei Zhang, Fan Wang, Fang Yao

    Abstract: Research on the localization of the genetic basis associated with diseases or traits has been widely conducted in the last a few decades. Scan methods have been developed for region-based analysis in whole-genome association studies, hel** us better understand how genetics influences human diseases or traits, especially when the aggregated effects of multiple causal variants are present. In this… ▽ More

    Submitted 8 February, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

  19. arXiv:2305.05722  [pdf

    cs.LG stat.AP

    Enhancing Clinical Predictive Modeling through Model Complexity-Driven Class Proportion Tuning for Class Imbalanced Data: An Empirical Study on Opioid Overdose Prediction

    Authors: Yinan Liu, Xinyu Dong, Weimin Lyu, Richard N. Rosenthal, Rachel Wong, Tengfei Ma, Fusheng Wang

    Abstract: Class imbalance problems widely exist in the medical field and heavily deteriorates performance of clinical predictive models. Most techniques to alleviate the problem rebalance class proportions and they predominantly assume the rebalanced proportions should be a function of the original data and oblivious to the model one uses. This work challenges this prevailing assumption and proposes that li… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  20. arXiv:2305.03555  [pdf, other

    cs.LG stat.ML

    Contrastive Graph Clustering in Curvature Spaces

    Authors: Li Sun, Feiyang Wang, Junda Ye, Hao Peng, Philip S. Yu

    Abstract: Graph clustering is a longstanding research topic, and has achieved remarkable success with the deep learning methods in recent years. Nevertheless, we observe that several important issues largely remain open. On the one hand, graph clustering from the geometric perspective is appealing but has rarely been touched before, as it lacks a promising space for geometric clustering. On the other hand,… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted by IJCAI'23

  21. arXiv:2304.06564  [pdf, other

    stat.CO

    Statistical Analysis of Fixed Mini-Batch Gradient Descent Estimator

    Authors: Haobo Qi, Feifei Wang, Hansheng Wang

    Abstract: We study here a fixed mini-batch gradient decent (FMGD) algorithm to solve optimization problems with massive datasets. In FMGD, the whole sample is split into multiple non-overlap** partitions. Once the partitions are formed, they are then fixed throughout the rest of the algorithm. For convenience, we refer to the fixed partitions as fixed mini-batches. Then for each computation iteration, the… ▽ More

    Submitted 13 April, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

  22. arXiv:2304.06292  [pdf, ps, other

    cs.LG stat.AP stat.ME

    Improved Naive Bayes with Mislabeled Data

    Authors: Qianhan Zeng, Yingqiu Zhu, Xuening Zhu, Feifei Wang, Weichen Zhao, Shuning Sun, Meng Su, Hansheng Wang

    Abstract: Labeling mistakes are frequently encountered in real-world applications. If not treated well, the labeling mistakes can deteriorate the classification performances of a model seriously. To address this issue, we propose an improved Naive Bayes method for text classification. It is analytically simple and free of subjective judgements on the correct and incorrect labels. By specifying the generatin… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

  23. arXiv:2304.05636  [pdf, other

    stat.ME

    Testing Sufficiency for Transfer Learning

    Authors: Ziqian Lin, Yuan Gao, Feifei Wang, Hansheng Wang

    Abstract: Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes the target data based statistical estimation very difficult. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

  24. arXiv:2302.02768  [pdf, other

    stat.ME

    Network Autoregression for Incomplete Matrix-Valued Time Series

    Authors: Xuening Zhu, Feifei Wang, Zeng Li, Yanyuan Ma

    Abstract: We study the dynamics of matrix-valued time series with observed network structures by proposing a matrix network autoregression model with row and column networks of the subjects. We incorporate covariate information and a low rank intercept matrix. We allow incomplete observations in the matrices and the missing mechanism can be covariate dependent. To estimate the model, a two-step estimation p… ▽ More

    Submitted 6 February, 2023; originally announced February 2023.

  25. arXiv:2302.00107  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Distributed sequential federated learning

    Authors: Z. F. Wang, X. Y. Zhang, Y-c I. Chang

    Abstract: The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obta… ▽ More

    Submitted 31 January, 2023; originally announced February 2023.

    Comments: 22 pages

    MSC Class: 62L10; 62L12

  26. arXiv:2301.03747  [pdf, other

    stat.ML cs.LG stat.ME

    Semiparametric Regression for Spatial Data via Deep Learning

    Authors: Kexuan Li, Jun Zhu, Anthony R. Ives, Volker C. Radeloff, Fangfang Wang

    Abstract: In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under… ▽ More

    Submitted 16 December, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

  27. arXiv:2211.16473  [pdf

    stat.ME q-bio.GN stat.AP

    Semiparametric integrative interaction analysis for non-small-cell lung cancer

    Authors: Yang Li, Fan Wang, Rong Li, Yifan Sun

    Abstract: In the genomic analysis, it is significant while challenging to identify markers associated with cancer outcomes or phenotypes. Based on the biological mechanisms of cancers and the characteristics of datasets as well, this paper proposes a novel integrative interaction approach under the semiparametric model, in which the genetic factors and environmental factors are included as the parametric an… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: 16 pages, 4 figures

    Journal ref: Statistical Methods in Medical Research, 29: 2865- 2880, 2020

  28. arXiv:2208.14123  [pdf, other

    stat.ME

    Catalytic Priors: Using Synthetic Data to Specify Prior Distributions in Bayesian Analysis

    Authors: Dongming Huang, Feicheng Wang, Donald B. Rubin, S. C. Kou

    Abstract: Catalytic prior distributions provide general, easy-to-use, and interpretable specifications of prior distributions for Bayesian analysis. They are particularly beneficial when the observed data are inadequate to stably estimate a complex target model. A catalytic prior distribution is constructed by augmenting the observed data with synthetic data that are sampled from the predictive distribution… ▽ More

    Submitted 22 September, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

  29. arXiv:2207.05471  [pdf, other

    stat.ML cs.LG

    Uncertainty-Aware Learning Against Label Noise on Imbalanced Datasets

    Authors: Yingsong Huang, Bing Bai, Shengwei Zhao, Kun Bai, Fei Wang

    Abstract: Learning against label noise is a vital topic to guarantee a reliable performance for deep neural networks. Recent research usually refers to dynamic noise modeling with model output probabilities and loss values, and then separates clean and noisy samples. These methods have gained notable success. However, unlike cherry-picked data, existing approaches often cannot perform well when facing imbal… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

  30. arXiv:2206.09107  [pdf, other

    cs.LG stat.AP stat.ME stat.ML

    Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data

    Authors: Jianmin Chen, Robert H. Aseltine, Fei Wang, Kun Chen

    Abstract: Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testin… ▽ More

    Submitted 26 February, 2024; v1 submitted 17 June, 2022; originally announced June 2022.

  31. arXiv:2206.08449  [pdf, ps, other

    quant-ph stat.ME

    Adaptive Algorithm for Quantum Amplitude Estimation

    Authors: Yunpeng Zhao, Haiyan Wang, Kuai Xu, Yue Wang, Ji Zhu, Feng Wang

    Abstract: Quantum amplitude estimation is a key sub-routine of a number of quantum algorithms with various applications. We propose an adaptive algorithm for interval estimation of amplitudes. The quantum part of the algorithm is based only on Grover's algorithm. The key ingredient is the introduction of an adjustment factor, which adjusts the amplitude of good states such that the amplitude after the adjus… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  32. arXiv:2204.01682  [pdf, other

    stat.ML cs.LG

    Deep Feature Screening: Feature Selection for Ultra High-Dimensional Data via Deep Neural Networks

    Authors: Kexuan Li, Fangfang Wang, Lingli Yang, Ruiqi Liu

    Abstract: The applications of traditional statistical feature selection methods to high-dimension, low sample-size data often struggle and encounter challenging problems, such as overfitting, curse of dimensionality, computational infeasibility, and strong model assumption. In this paper, we propose a novel two-step nonparametric approach called Deep Feature Screening (DeepFS) that can overcome these proble… ▽ More

    Submitted 16 December, 2023; v1 submitted 4 April, 2022; originally announced April 2022.

  33. arXiv:2204.00750  [pdf, other

    stat.ME

    Structural randomised selection

    Authors: Fan Wang, Sylvia Richardson, Steven M. Hill

    Abstract: An important problem in the analysis of high-dimensional omics data is to identify subsets of molecular variables that are associated with a phenotype of interest. This requires addressing the challenges of high dimensionality, strong multicollinearity and model uncertainty. We propose a new ensemble learning approach for improving the performance of sparse penalised regression methods, called STr… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  34. arXiv:2203.11469  [pdf, other

    stat.ME

    A new class of composite GBII regression models with varying threshold for modelling heavy-tailed data

    Authors: Zhengxiao Li, Fei Wang, Zhengtang Zhao

    Abstract: The four-parameter generalized beta distribution of the second kind (GBII) has been proposed for modelling insurance losses with heavy-tailed features. The aim of this paper is to present a parametric composite GBII regression modelling by splicing two GBII distributions using mode matching method. It is designed for simultaneous modeling of small and large claims and capturing the policyholder he… ▽ More

    Submitted 26 January, 2024; v1 submitted 22 March, 2022; originally announced March 2022.

  35. arXiv:2203.11015  [pdf, other

    cs.IR cs.LG stat.AP

    Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

    Authors: Xianghao Zhan, Fan** Wang, Olivier Gevaert

    Abstract: Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital re… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: 8 pages, 4 figures

  36. arXiv:2202.13829  [pdf, ps, other

    cs.LG cond-mat.dis-nn physics.data-an stat.ML

    How and what to learn:The modes of machine learning

    Authors: Sihan Feng, Yong Zhang, Fuming Wang, Hong Zhao

    Abstract: Despite their great success, neural networks still remain as black-boxes due to the lack of interpretability. Here we propose a new analyzing method, namely the weight pathway analysis (WPA), to make them transparent. We consider weights in pathways that link neurons longitudinally from input neurons to output neurons, or simply weight pathways, as the basic units for understanding a neural networ… ▽ More

    Submitted 8 August, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

    Comments: 16 pages, 10 figures

  37. arXiv:2112.02792  [pdf, other

    stat.ML cs.GT cs.LG

    Incentive Compatible Pareto Alignment for Multi-Source Large Graphs

    Authors: Jian Liang, Fangrui Lv, Di Liu, Zehui Dai, Xu Tian, Shuang Li, Fei Wang, Han Li

    Abstract: In this paper, we focus on learning effective entity matching models over multi-source large-scale data. For real applications, we relax typical assumptions that data distributions/spaces, or entity identities are shared between sources, and propose a Relaxed Multi-source Large-scale Entity-matching (RMLE) problem. Challenges of the problem include 1) how to align large-scale entities between sour… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

  38. arXiv:2111.15086  [pdf, other

    stat.ME

    Scalable Semiparametric Spatio-temporal Regression for Large Data Analysis

    Authors: Ting Fung Ma, Fangfang Wang, Jun Zhu, Anthony R. Ives, Katarzyna E. Lewińska

    Abstract: With the rapid advances of data acquisition techniques, spatio-temporal data are becoming increasingly abundant in a diverse array of disciplines. Here we develop spatio-temporal regression methodology for analyzing large amounts of spatially referenced data collected over time, motivated by environmental studies utilizing remotely sensed satellite data. In particular, we specify a semiparametric… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

  39. arXiv:2111.10846  [pdf, other

    cs.CL stat.ME stat.ML

    Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

    Authors: Yandi Zhu, Xiaoling Lu, **gya Hong, Feifei Wang

    Abstract: Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and fur… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

  40. arXiv:2110.14298  [pdf, other

    math.ST stat.ML

    Denoising and change point localisation in piecewise-constant high-dimensional regression coefficients

    Authors: Fan Wang, Oscar Hernan Madrid Padilla, Yi Yu, Alessandro Rinaldo

    Abstract: We study the theoretical properties of the fused lasso procedure originally proposed by \cite{tibshirani2005sparsity} in the context of a linear regression model in which the regression coefficient are totally ordered and assumed to be sparse and piecewise constant. Despite its popularity, to the best of our knowledge, estimation error bounds in high-dimensional settings have only been obtained fo… ▽ More

    Submitted 18 February, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

  41. arXiv:2109.10399  [pdf, other

    physics.ao-ph cs.LG stat.ML

    SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and Benchmarking

    Authors: Soukayna Mouatadid, Paulo Orenstein, Genevieve Flaspohler, Miruna Oprescu, Judah Cohen, Franklyn Wang, Sean Knight, Maria Geogdzhayeva, Sam Levang, Ernest Fraenkel, Lester Mackey

    Abstract: Subseasonal forecasting of the weather two to six weeks in advance is critical for resource allocation and advance disaster notice but poses many challenges for the forecasting community. At this forecast horizon, physics-based dynamical models have limited skill, and the targets for prediction depend in a complex manner on both local weather variables and global climate variables. Recently, machi… ▽ More

    Submitted 16 January, 2024; v1 submitted 21 September, 2021; originally announced September 2021.

  42. arXiv:2109.09856  [pdf

    cs.LG stat.ML

    SFFDD: Deep Neural Network with Enriched Features for Failure Prediction with Its Application to Computer Disk Driver

    Authors: Lanfa Frank Wang, Danjue Li

    Abstract: A classification technique incorporating a novel feature derivation method is proposed for predicting failure of a system or device with multivariate time series sensor data. We treat the multivariate time series sensor data as images for both visualization and computation. Failure follows various patterns which are closely related to the root causes. Different predefined transformations are appli… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: 11 pages, 20 figures

  43. arXiv:2108.07928  [pdf, ps, other

    stat.CO

    Implicit Profiling Estimation for Semiparametric Models with Bundled Parameters

    Authors: Yucong Lin, **hua Su, Yang Liu, Jue Hou, Feifei Wang

    Abstract: Solving semiparametric models can be computationally challenging because the dimension of parameter space may grow large with increasing sample size. Classical Newton's method becomes quite slow and unstable with intensive calculation of the large Hessian matrix and its inverse. Iterative methods separately update parameters for finite dimensional component and infinite dimensional component have… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

  44. S-LIME: Stabilized-LIME for Model Explanation

    Authors: Zhengze Zhou, Giles Hooker, Fei Wang

    Abstract: An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME, are widely us… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '21), August 14--18, 2021, Virtual Event, Singapore

  45. arXiv:2106.03591  [pdf, other

    stat.ML cs.LG stat.ME

    Calibrating multi-dimensional complex ODE from noisy data via deep neural networks

    Authors: Kexuan Li, Fangfang Wang, Ruiqi Liu, Fan Yang, Zuofeng Shang

    Abstract: Ordinary differential equations (ODEs) are widely used to model complex dynamics that arises in biology, chemistry, engineering, finance, physics, etc. Calibration of a complicated ODE system using noisy data is generally very difficult. In this work, we propose a two-stage nonparametric approach to address this problem. We first extract the de-noised data and their higher order derivatives using… ▽ More

    Submitted 18 September, 2023; v1 submitted 7 June, 2021; originally announced June 2021.

  46. arXiv:2105.09670  [pdf, other

    stat.ML cs.LG

    Ensemble machine learning approach for screening of coronary heart disease based on echocardiography and risk factors

    Authors: **gyi Zhang, Huolan Zhu, Yongkai Chen, Chenguang Yang, Huimin Cheng, Yi Li, Wenxuan Zhong, Fang Wang

    Abstract: Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classifica… ▽ More

    Submitted 20 May, 2021; originally announced May 2021.

    Comments: 30 pages, 5 figures, 5 tables

  47. Robust Finite Mixture Regression for Heterogeneous Targets

    Authors: Jian Liang, Kun Chen, Ming Lin, Changshui Zhang, Fei Wang

    Abstract: Finite Mixture Regression (FMR) refers to the mixture modeling scheme which learns multiple regression models from the training data set. Each of them is in charge of a subset. FMR is an effective scheme for handling sample heterogeneity, where a single regression model is not enough for capturing the complexities of the conditional distribution of the observed samples given the features. In this… ▽ More

    Submitted 11 October, 2020; originally announced October 2020.

    Journal ref: Data Mining and Knowledge Discovery, volume 32, pages 1509 to 1560, year 2018

  48. arXiv:2010.05250  [pdf, other

    stat.ML cs.CV cs.LG

    Domain Agnostic Learning for Unbiased Authentication

    Authors: Jian Liang, Yuren Cao, Shuang Li, Bing Bai, Hao Li, Fei Wang, Kun Bai

    Abstract: Authentication is the task of confirming the matching relationship between a data instance and a given identity. Typical examples of authentication problems include face recognition and person re-identification. Data-driven authentication could be affected by undesired biases, i.e., the models are often trained in one domain (e.g., for people wearing spring outfits) while applied in other domains… ▽ More

    Submitted 23 November, 2020; v1 submitted 11 October, 2020; originally announced October 2020.

  49. arXiv:2010.04589  [pdf

    cs.LG cs.CY stat.ML

    Identifying Risk of Opioid Use Disorder for Patients Taking Opioid Medications with Deep Learning

    Authors: Xinyu Dong, Jianyuan Deng, Sina Rashidian, Kayley Abell-Hart, Wei Hou, Richard N Rosenthal, Mary Saltz, Joel Saltz, Fusheng Wang

    Abstract: The United States is experiencing an opioid epidemic, and there were more than 10 million opioid misusers aged 12 or older each year. Identifying patients at high risk of Opioid Use Disorder (OUD) can help to make early clinical interventions to reduce the risk of OUD. Our goal is to predict OUD patients among opioid prescription users through analyzing electronic health records with machine learn… ▽ More

    Submitted 9 October, 2020; originally announced October 2020.

    Comments: 20 pages, 6 figures

  50. arXiv:2010.03757  [pdf, other

    cs.LG stat.ML

    AICov: An Integrative Deep Learning Framework for COVID-19 Forecasting with Population Covariates

    Authors: Geoffrey C. Fox, Gregor von Laszewski, Fugang Wang, Saumyadipta Pyne

    Abstract: The COVID-19 pandemic has profound global consequences on health, economic, social, political, and almost every major aspect of human life. Therefore, it is of great importance to model COVID-19 and other pandemics in terms of the broader social contexts in which they take place. We present the architecture of AICov, which provides an integrative deep learning framework for COVID-19 forecasting wi… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: 25 pages, 4 tabkes, 19 figures