Search | arXiv e-print repository

Representation Transfer Learning for Semiparametric Regression

Authors: Baihua He, Huihang Liu, Xinyu Zhang, Jian Huang

Abstract: We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larg… ▽ More We propose a transfer learning method that utilizes data representations in a semiparametric regression model. Our aim is to perform statistical inference on the parameter of primary interest in the target model while accounting for potential nonlinear effects of confounding variables. We leverage knowledge from source domains, assuming that the sample size of the source data is substantially larger than that of the target data. This knowledge transfer is carried out by the sharing of data representations, predicated on the idea that there exists a set of latent representations transferable from the source to the target domain. We address model heterogeneity between the source and target domains by incorporating domain-specific parameters in their respective models. We establish sufficient conditions for the identifiability of the models and demonstrate that the estimator for the primary parameter in the target model is both consistent and asymptotically normal. These results lay the theoretical groundwork for making statistical inferences about the main effects. Our simulation studies highlight the benefits of our method, and we further illustrate its practical applications using real-world data. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 42 pages, 11 figures, 5 tables

MSC Class: 62F99

arXiv:2406.13060 [pdf, other]

Scale-Translation Equivariant Network for Oceanic Internal Solitary Wave Localization

Authors: Zhang Wan, Shuo Wang, Xudong Zhang

Abstract: Internal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter d… ▽ More Internal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter data from remote sensing instruments. However, cloud cover in optical remote sensing images variably obscures ground information, leading to blurred or missing surface observations. As such, this paper aims at altimeter-based machine learning solutions to automatically locate ISWs. The challenges, however, lie in the following two aspects: 1) the altimeter data has low resolution, which requires a strong machine learner; 2) labeling data is extremely labor-intensive, leading to very limited data for training. In recent years, the grand progress of deep learning demonstrates strong learning capacity given abundant data. Besides, more recent studies on efficient learning and self-supervised learning laid solid foundations to tackle the aforementioned challenges. In this paper, we propose to inject prior knowledge to achieve a strong and efficient learner. Specifically, intrinsic patterns in altimetry data are efficiently captured using a scale-translation equivariant convolutional neural network (ST-ECNN). By considering inherent symmetries in neural network design, ST-ECNN achieves higher efficiency and better performance than baseline models. Furthermore, we also introduce prior knowledge from massive unsupervised data to enhance our solution using the SimCLR framework for pre-training. Our final solution achieves an overall better performance than baselines on our handcrafted altimetry dataset. Data and codes are available at https://github.com/ZhangWan-byte/Internal_Solitary_Wave_Localization . △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 29 pages, 5 figures

arXiv:2406.11043 [pdf, other]

Statistical Considerations for Evaluating Treatment Effect under Various Non-proportional Hazard Scenarios

Authors: Xinyu Zhang, Erich J. Greene, Ondrej Blaha, Wei Wei

Abstract: We conducted a systematic comparison of statistical methods used for the analysis of time-to-event outcomes under various proportional and nonproportional hazard (NPH) scenarios. Our study used data from recently published oncology trials to compare the Log-rank test, still by far the most widely used option, against some available alternatives, including the MaxCombo test, the Restricted Mean Sur… ▽ More We conducted a systematic comparison of statistical methods used for the analysis of time-to-event outcomes under various proportional and nonproportional hazard (NPH) scenarios. Our study used data from recently published oncology trials to compare the Log-rank test, still by far the most widely used option, against some available alternatives, including the MaxCombo test, the Restricted Mean Survival Time Difference (dRMST) test, the Generalized Gamma Model (GGM) and the Generalized F Model (GFM). Power, type I error rate, and time-dependent bias with respect to the RMST difference, survival probability difference, and median survival time were used to evaluate and compare the performance of these methods. In addition to the real data, we simulated three hypothetical scenarios with crossing hazards chosen so that the early and late effects 'cancel out' and used them to evaluate the ability of the aforementioned methods to detect time-specific and overall treatment effects. We implemented novel metrics for assessing the time-dependent bias in treatment effect estimates to provide a more comprehensive evaluation in NPH scenarios. Recommendations under each NPH scenario are provided by examining the type I error rate, power, and time-dependent bias associated with each statistical approach. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.09665 [pdf, other]

New algorithms for sampling and diffusion models

Authors: Xicheng Zhang

Abstract: Drawing from the theory of stochastic differential equations, we introduce a novel sampling method for known distributions and a new algorithm for diffusion generative models with unknown distributions. Our approach is inspired by the concept of the reverse diffusion process, widely adopted in diffusion generative models. Additionally, we derive the explicit convergence rate based on the smooth OD… ▽ More Drawing from the theory of stochastic differential equations, we introduce a novel sampling method for known distributions and a new algorithm for diffusion generative models with unknown distributions. Our approach is inspired by the concept of the reverse diffusion process, widely adopted in diffusion generative models. Additionally, we derive the explicit convergence rate based on the smooth ODE flow. For diffusion generative models and sampling, we establish a {\it dimension-free} particle approximation convergence result. Numerical experiments demonstrate the effectiveness of our method. Notably, unlike the traditional Langevin method, our sampling method does not require any regularity assumptions about the density function of the target distribution. Furthermore, we also apply our method to optimization problems. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 24pages

MSC Class: 60H10

arXiv:2406.08180 [pdf, other]

Stochastic Process-based Method for Degree-Degree Correlation of Evolving Networks

Authors: Yue Xiao, Xiaojun Zhang

Abstract: Existing studies on the degree correlation of evolving networks typically rely on differential equations and statistical analysis, resulting in only approximate solutions due to inherent randomness. To address this limitation, we propose an improved Markov chain method for modeling degree correlation in evolving networks. By redesigning the network evolution rules to reflect actual network dynamic… ▽ More Existing studies on the degree correlation of evolving networks typically rely on differential equations and statistical analysis, resulting in only approximate solutions due to inherent randomness. To address this limitation, we propose an improved Markov chain method for modeling degree correlation in evolving networks. By redesigning the network evolution rules to reflect actual network dynamics more accurately, we achieve a topological structure that closely matches real-world network evolution. Our method models the degree correlation evolution process for both directed and undirected networks and provides theoretical results that are verified through simulations. This work offers the first theoretical solution for the steady-state degree correlation in evolving network models and is applicable to more complex evolution mechanisms and networks with directional attributes. Additionally, it supports the study of dynamic characteristic control based on network structure at any given time, offering a new tool for researchers in the field. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.19649 [pdf, ps, other]

doi 10.1145/3589334.3645663

Towards Deeper Understanding of PPR-based Embedding Approaches: A Topological Perspective

Authors: Xingyi Zhang, Zixuan Weng, Sibo Wang

Abstract: Node embedding learns low-dimensional vectors for nodes in the graph. Recent state-of-the-art embedding approaches take Personalized PageRank (PPR) as the proximity measure and factorize the PPR matrix or its adaptation to generate embeddings. However, little previous work analyzes what information is encoded by these approaches, and how the information correlates with their superb performance in… ▽ More Node embedding learns low-dimensional vectors for nodes in the graph. Recent state-of-the-art embedding approaches take Personalized PageRank (PPR) as the proximity measure and factorize the PPR matrix or its adaptation to generate embeddings. However, little previous work analyzes what information is encoded by these approaches, and how the information correlates with their superb performance in downstream tasks. In this work, we first show that state-of-the-art embedding approaches that factorize a PPR-related matrix can be unified into a closed-form framework. Then, we study whether the embeddings generated by this strategy can be inverted to better recover the graph topology information than random-walk based embeddings. To achieve this, we propose two methods for recovering graph topology via PPR-based embeddings, including the analytical method and the optimization method. Extensive experimental results demonstrate that the embeddings generated by factorizing a PPR-related matrix maintain more topological information, such as common edges and community structures, than that generated by random walks, paving a new way to systematically comprehend why PPR-based node embedding approaches outperform random walk-based alternatives in various downstream tasks. To the best of our knowledge, this is the first work that focuses on the interpretability of PPR-based node embedding approaches. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.16577 [pdf, other]

Reflected Flow Matching

Authors: Tianyu Xie, Yu Zhu, Longlin Yu, Tong Yang, Ziheng Cheng, Shiyue Zhang, Xiangyu Zhang, Cheng Zhang

Abstract: Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural sampl… ▽ More Continuous normalizing flows (CNFs) learn an ordinary differential equation to transform prior samples into data. Flow matching (FM) has recently emerged as a simulation-free approach for training CNFs by regressing a velocity model towards the conditional velocity field. However, on constrained domains, the learned velocity model may lead to undesirable flows that result in highly unnatural samples, e.g., oversaturated images, due to both flow matching error and simulation error. To address this, we add a boundary constraint term to CNFs, which leads to reflected CNFs that keep trajectories within the constrained domains. We propose reflected flow matching (RFM) to train the velocity model in reflected CNFs by matching the conditional velocity fields in a simulation-free manner, similar to the vanilla FM. Moreover, the analytical form of conditional velocity fields in RFM avoids potentially biased approximations, making it superior to existing score-based generative models on constrained domains. We demonstrate that RFM achieves comparable or better results on standard image benchmarks and produces high-quality class-conditioned samples under high guidance weight. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: ICML 2024 camera-ready

arXiv:2405.16413 [pdf, other]

Augmented Risk Prediction for the Onset of Alzheimer's Disease from Electronic Health Records with Large Language Models

Authors: Jiankun Wang, Sumyeong Ahn, Taykhoom Dalal, Xiaodan Zhang, Weishen Pan, Qiannan Zhang, Bin Chen, Hiroko H. Dodge, Fei Wang, Jiayu Zhou

Abstract: Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning bas… ▽ More Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning based predictive models. Recent advancements in large language models (LLMs) demonstrate their unprecedented capability of encoding knowledge and performing reasoning, which offers them strong potential for enhancing risk prediction. This paper proposes a novel pipeline that augments risk prediction by leveraging the few-shot inference power of LLMs to make predictions on cases where traditional supervised learning methods (SLs) may not excel. Specifically, we develop a collaborative pipeline that combines SLs and LLMs via a confidence-driven decision-making mechanism, leveraging the strengths of SLs in clear-cut cases and LLMs in more complex scenarios. We evaluate this pipeline using a real-world EHR data warehouse from Oregon Health \& Science University (OHSU) Hospital, encompassing EHRs from over 2.5 million patients and more than 20 million patient encounters. Our results show that our proposed approach effectively combines the power of SLs and LLMs, offering significant improvements in predictive performance. This advancement holds promise for revolutionizing ADRD screening and early detection practices, with potential implications for better strategies of patient management and thus improving healthcare. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.14892 [pdf, other]

Parallel Approximations for High-Dimensional Multivariate Normal Probability Computation in Confidence Region Detection Applications

Authors: Xiran Zhang, Sameh Abdulah, Jian Cao, Hatem Ltaief, Ying Sun, Marc G. Genton, David E. Keyes

Abstract: Addressing the statistical challenge of computing the multivariate normal (MVN) probability in high dimensions holds significant potential for enhancing various applications. One common way to compute high-dimensional MVN probabilities is the Separation-of-Variables (SOV) algorithm. This algorithm is known for its high computational complexity of O(n^3) and space complexity of O(n^2), mainly due t… ▽ More Addressing the statistical challenge of computing the multivariate normal (MVN) probability in high dimensions holds significant potential for enhancing various applications. One common way to compute high-dimensional MVN probabilities is the Separation-of-Variables (SOV) algorithm. This algorithm is known for its high computational complexity of O(n^3) and space complexity of O(n^2), mainly due to a Cholesky factorization operation for an n X n covariance matrix, where $n$ represents the dimensionality of the MVN problem. This work proposes a high-performance computing framework that allows scaling the SOV algorithm and, subsequently, the confidence region detection algorithm. The framework leverages parallel linear algebra algorithms with a task-based programming model to achieve performance scalability in computing process probabilities, especially on large-scale systems. In addition, we enhance our implementation by incorporating Tile Low-Rank (TLR) approximation techniques to reduce algorithmic complexity without compromising the necessary accuracy. To evaluate the performance and accuracy of our framework, we conduct assessments using simulated data and a wind speed dataset. Our proposed implementation effectively handles high-dimensional multivariate normal (MVN) probability computations on shared and distributed-memory systems using finite precision arithmetics and TLR approximation computation. Performance results show a significant speedup of up to 20X in solving the MVN problem using TLR approximation compared to the reference dense solution without sacrificing the application's accuracy. The qualitative results on synthetic and real datasets demonstrate how we maintain high accuracy in detecting confidence regions even when relying on TLR approximation to perform the underlying linear algebra operations. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.11431 [pdf, other]

Review of deep learning models for crypto price prediction: implementation and evaluation

Authors: **gyang Wu, Xinyi Zhang, Fangyixuan Huang, Haochen Zhou, Rohtiash Chandra

Abstract: There has been much interest in accurate cryptocurrency price forecast models by investors and researchers. Deep Learning models are prominent machine learning techniques that have transformed various fields and have shown potential for finance and economics. Although various deep learning models have been explored for cryptocurrency price forecasting, it is not clear which models are suitable due… ▽ More There has been much interest in accurate cryptocurrency price forecast models by investors and researchers. Deep Learning models are prominent machine learning techniques that have transformed various fields and have shown potential for finance and economics. Although various deep learning models have been explored for cryptocurrency price forecasting, it is not clear which models are suitable due to high market volatility. In this study, we review the literature about deep learning for cryptocurrency price forecasting and evaluate novel deep learning models for cryptocurrency stock price prediction. Our deep learning models include variants of long short-term memory (LSTM) recurrent neural networks, variants of convolutional neural networks (CNNs), and the Transformer model. We evaluate univariate and multivariate approaches for multi-step ahead predicting of cryptocurrencies close-price. We also carry out volatility analysis on the four cryptocurrencies which reveals significant fluctuations in their prices throughout the COVID-19 pandemic. Additionally, we investigate the prediction accuracy of two scenarios identified by different training sets for the models. First, we use the pre-COVID-19 datasets to model cryptocurrency close-price forecasting during the early period of COVID-19. Secondly, we utilise data from the COVID-19 period to predict prices for 2023 to 2024. Our results show that the convolutional LSTM with a multivariate approach provides the best prediction accuracy in two major experimental settings. Our results also indicate that the multivariate deep learning models exhibit better performance in forecasting four different cryptocurrencies when compared to the univariate models. △ Less

Submitted 2 June, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.10991 [pdf, other]

Relative Counterfactual Contrastive Learning for Mitigating Pretrained Stance Bias in Stance Detection

Authors: Jiarui Zhang, Shaojuan Wu, Xiaowang Zhang, Zhiyong Feng

Abstract: Stance detection classifies stance relations (namely, Favor, Against, or Neither) between comments and targets. Pretrained language models (PLMs) are widely used to mine the stance relation to improve the performance of stance detection through pretrained knowledge. However, PLMs also embed ``bad'' pretrained knowledge concerning stance into the extracted stance relation semantics, resulting in pr… ▽ More Stance detection classifies stance relations (namely, Favor, Against, or Neither) between comments and targets. Pretrained language models (PLMs) are widely used to mine the stance relation to improve the performance of stance detection through pretrained knowledge. However, PLMs also embed ``bad'' pretrained knowledge concerning stance into the extracted stance relation semantics, resulting in pretrained stance bias. It is not trivial to measure pretrained stance bias due to its weak quantifiability. In this paper, we propose Relative Counterfactual Contrastive Learning (RCCL), in which pretrained stance bias is mitigated as relative stance bias instead of absolute stance bias to overtake the difficulty of measuring bias. Firstly, we present a new structural causal model for characterizing complicated relationships among context, PLMs and stance relations to locate pretrained stance bias. Then, based on masked language model prediction, we present a target-aware relative stance sample generation method for obtaining relative bias. Finally, we use contrastive learning based on counterfactual theory to mitigate pretrained stance bias and preserve context stance relation. Experiments show that the proposed method is superior to stance detection and debiasing baselines. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.08699 [pdf]

Weakly-supervised causal discovery based on fuzzy knowledge and complex data complementarity

Authors: Wenrui Li, Wei Zhang, Qinghao Zhang, Xuegong Zhang, Xiaowo Wang

Abstract: Causal discovery based on observational data is important for deciphering the causal mechanism behind complex systems. However, the effectiveness of existing causal discovery methods is limited due to inferior prior knowledge, domain inconsistencies, and the challenges of high-dimensional datasets with small sample sizes. To address this gap, we propose a novel weakly-supervised fuzzy knowledge an… ▽ More Causal discovery based on observational data is important for deciphering the causal mechanism behind complex systems. However, the effectiveness of existing causal discovery methods is limited due to inferior prior knowledge, domain inconsistencies, and the challenges of high-dimensional datasets with small sample sizes. To address this gap, we propose a novel weakly-supervised fuzzy knowledge and data co-driven causal discovery method named KEEL. KEEL adopts a fuzzy causal knowledge schema to encapsulate diverse types of fuzzy knowledge, and forms corresponding weakened constraints. This schema not only lessens the dependency on expertise but also allows various types of limited and error-prone fuzzy knowledge to guide causal discovery. It can enhance the generalization and robustness of causal discovery, especially in high-dimensional and small-sample scenarios. In addition, we integrate the extended linear causal model (ELCM) into KEEL for dealing with the multi-distribution and incomplete data. Extensive experiments with different datasets demonstrate the superiority of KEEL over several state-of-the-art methods in accuracy, robustness and computational efficiency. For causal discovery in real protein signal transduction processes, KEEL outperforms the benchmark method with limited data. In summary, KEEL is effective to tackle the causal discovery tasks with higher accuracy while alleviating the requirement for extensive domain expertise. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.06613 [pdf, other]

Simultaneously detecting spatiotemporal changes with penalized Poisson regression models

Authors: Zerui Zhang, Xin Wang, Xin Zhang, **g Zhang

Abstract: In the realm of large-scale spatiotemporal data, abrupt changes are commonly occurring across both spatial and temporal domains. This study aims to address the concurrent challenges of detecting change points and identifying spatial clusters within spatiotemporal count data. We introduce an innovative method based on the Poisson regression model, employing doubly fused penalization to unveil the u… ▽ More In the realm of large-scale spatiotemporal data, abrupt changes are commonly occurring across both spatial and temporal domains. This study aims to address the concurrent challenges of detecting change points and identifying spatial clusters within spatiotemporal count data. We introduce an innovative method based on the Poisson regression model, employing doubly fused penalization to unveil the underlying spatiotemporal change patterns. To efficiently estimate the model, we present an iterative shrinkage and threshold based algorithm to minimize the doubly penalized likelihood function. We establish the statistical consistency properties of the proposed estimator, confirming its reliability and accuracy. Furthermore, we conduct extensive numerical experiments to validate our theoretical findings, thereby highlighting the superior performance of our method when compared to existing competitive approaches. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.19118 [pdf, other]

Identification and estimation of causal effects using non-concurrent controls in platform trials

Authors: Michele Santacatterina, Federico Macchiavelli Giron, Xinyi Zhang, Ivan Diaz

Abstract: Platform trials are multi-arm designs that simultaneously evaluate multiple treatments for a single disease within the same overall trial structure. Unlike traditional randomized controlled trials, they allow treatment arms to enter and exit the trial at distinct times while maintaining a control arm throughout. This control arm comprises both concurrent controls, where participants are randomized… ▽ More Platform trials are multi-arm designs that simultaneously evaluate multiple treatments for a single disease within the same overall trial structure. Unlike traditional randomized controlled trials, they allow treatment arms to enter and exit the trial at distinct times while maintaining a control arm throughout. This control arm comprises both concurrent controls, where participants are randomized concurrently to either the treatment or control arm, and non-concurrent controls, who enter the trial when the treatment arm under study is unavailable. While flexible, platform trials introduce a unique challenge with the use of non-concurrent controls, raising questions about how to efficiently utilize their data to estimate treatment effects. Specifically, what estimands should be used to evaluate the causal effect of a treatment versus control? Under what assumptions can these estimands be identified and estimated? Do we achieve any efficiency gains? In this paper, we use structural causal models and counterfactuals to clarify estimands and formalize their identification in the presence of non-concurrent controls in platform trials. We also provide outcome regression, inverse probability weighting, and doubly robust estimators for their estimation. We discuss efficiency gains, demonstrate their performance in a simulation study, and apply them to the ACTT platform trial, resulting in a 20% improvement in precision. △ Less

Submitted 29 April, 2024; originally announced April 2024.

MSC Class: 62P10

arXiv:2404.11579 [pdf, other]

Spatial Heterogeneous Additive Partial Linear Model: A Joint Approach of Bivariate Spline and Forest Lasso

Authors: Xin Zhang, Shan Yu, Zhengyuan Zhu, Xin Wang

Abstract: Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an ef… ▽ More Identifying spatial heterogeneous patterns has attracted a surge of research interest in recent years, due to its important applications in various scientific and engineering fields. In practice the spatially heterogeneous components are often mixed with components which are spatially smooth, making the task of identifying the heterogeneous regions more challenging. In this paper, we develop an efficient clustering approach to identify the model heterogeneity of the spatial additive partial linear model. Specifically, we aim to detect the spatially contiguous clusters based on the regression coefficients while introducing a spatially varying intercept to deal with the smooth spatial effect. On the one hand, to approximate the spatial varying intercept, we use the method of bivariate spline over triangulation, which can effectively handle the data from a complex domain. On the other hand, a novel fusion penalty termed the forest lasso is proposed to reveal the spatial clustering pattern. Our proposed fusion penalty has advantages in both the estimation and computation efficiencies when dealing with large spatial data. Theoretically properties of our estimator are established, and simulation results show that our approach can achieve more accurate estimation with a limited computation cost compared with the existing approaches. To illustrate its practical use, we apply our approach to analyze the spatial pattern of the relationship between land surface temperature measured by satellites and air temperature measured by ground stations in the United States. △ Less

Submitted 3 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.05976 [pdf, other]

A Cyber Manufacturing IoT System for Adaptive Machine Learning Model Deployment by Interactive Causality Enabled Self-Labeling

Authors: Yutian Ren, Yuqi He, Xuyin Zhang, Aaron Yen, G. P. Li

Abstract: Machine Learning (ML) has been demonstrated to improve productivity in many manufacturing applications. To host these ML applications, several software and Industrial Internet of Things (IIoT) systems have been proposed for manufacturing applications to deploy ML applications and provide real-time intelligence. Recently, an interactive causality enabled self-labeling method has been proposed to ad… ▽ More Machine Learning (ML) has been demonstrated to improve productivity in many manufacturing applications. To host these ML applications, several software and Industrial Internet of Things (IIoT) systems have been proposed for manufacturing applications to deploy ML applications and provide real-time intelligence. Recently, an interactive causality enabled self-labeling method has been proposed to advance adaptive ML applications in cyber-physical systems, especially manufacturing, by automatically adapting and personalizing ML models after deployment to counter data distribution shifts. The unique features of the self-labeling method require a novel software system to support dynamism at various levels. This paper proposes the AdaptIoT system, comprised of an end-to-end data streaming pipeline, ML service integration, and an automated self-labeling service. The self-labeling service consists of causal knowledge bases and automated full-cycle self-labeling workflows to adapt multiple ML models simultaneously. AdaptIoT employs a containerized microservice architecture to deliver a scalable and portable solution for small and medium-sized manufacturers. A field demonstration of a self-labeling adaptive ML application is conducted with a makerspace and shows reliable performance. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.05933 [pdf, other]

fastcpd: Fast Change Point Detection in R

Authors: Xingchi Li, Xianyang Zhang

Abstract: Change point analysis is concerned with detecting and locating structure breaks in the underlying model of a sequence of observations ordered by time, space or other variables. A widely adopted approach for change point analysis is to minimize an objective function with a penalty term on the number of change points. This framework includes several well-established procedures, such as the penalized… ▽ More Change point analysis is concerned with detecting and locating structure breaks in the underlying model of a sequence of observations ordered by time, space or other variables. A widely adopted approach for change point analysis is to minimize an objective function with a penalty term on the number of change points. This framework includes several well-established procedures, such as the penalized log-likelihood using the (modified) Bayesian information criterion (BIC) or the minimum description length (MDL). The resulting optimization problem can be solved in polynomial time by dynamic programming or its improved version, such as the Pruned Exact Linear Time (PELT) algorithm (Killick, Fearnhead, and Eckley 2012). However, existing computational methods often suffer from two primary limitations: (1) methods based on direct implementation of dynamic programming or PELT are often time-consuming for long data sequences due to repeated computation of the cost value over different segments of the data sequence; (2) state-of-the-art R packages do not provide enough flexibility for users to handle different change point settings and models. In this work, we present the fastcpd package, aiming to provide an efficient and versatile framework for change point detection in several commonly encountered settings. The core of our algorithm is built upon PELT and the sequential gradient descent method recently proposed by Zhang and Dawn (2023). We illustrate the usage of the fastcpd package through several examples, including mean/variance changes in a (multivariate) Gaussian sequence, parameter changes in regression models, structural breaks in ARMA/GARCH/VAR models, and changes in user-specified models. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 53 pages, 16 figures

arXiv:2404.05808 [pdf, other]

Replicability analysis of high dimensional data accounting for dependence

Authors: Pengfei Lyu, Xianyang Zhang, Hongyuan Cao

Abstract: Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modelling assumptions, producing overly conservative or overly liberal results. Based on $p$-values from two… ▽ More Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modelling assumptions, producing overly conservative or overly liberal results. Based on $p$-values from two studies, we use a four-state hidden Markov model to capture the structure of local dependence. Our method effectively borrows information from different features and studies while accounting for dependence among features and heterogeneity across studies. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2403.18540 [pdf, other]

skscope: Fast Sparsity-Constrained Optimization in Python

Authors: Zezhi Wang, ** Zhu, Peng Chen, Huiyang Peng, Xiaoke Zhang, Anran Wang, Yu Zheng, Junxian Zhu, Xueqin Wang

Abstract: Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers' broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two… ▽ More Applying iterative solvers on sparsity-constrained optimization (SCO) requires tedious mathematical deduction and careful programming/debugging that hinders these solvers' broad impact. In the paper, the library skscope is introduced to overcome such an obstacle. With skscope, users can solve the SCO by just programming the objective function. The convenience of skscope is demonstrated through two examples in the paper, where sparse linear regression and trend filtering are addressed with just four lines of code. More importantly, skscope's efficient implementation allows state-of-the-art solvers to quickly attain the sparse solution regardless of the high dimensionality of parameter space. Numerical experiments reveal the available solvers in skscope can achieve up to 80x speedup on the competing relaxation solutions obtained via the benchmarked convex solver. skscope is published on the Python Package Index (PyPI) and Conda, and its source code is available at: https://github.com/abess-team/skscope. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 4 pages

arXiv:2403.13260 [pdf, other]

A Bayesian Approach for Selecting Relevant External Data (BASE): Application to a study of Long-Term Outcomes in a Hemophilia Gene Therapy Trial

Authors: Tianyu Pan, Xiang Zhang, Weining Shen, Ting Ye

Abstract: Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early sta… ▽ More Gene therapies aim to address the root causes of diseases, particularly those stemming from rare genetic defects that can be life-threatening or severely debilitating. While there has been notable progress in the development of gene therapies in recent years, understanding their long-term effectiveness remains challenging due to a lack of data on long-term outcomes, especially during the early stages of their introduction to the market. To address the critical question of estimating long-term efficacy without waiting for the completion of lengthy clinical trials, we propose a novel Bayesian framework. This framework selects pertinent data from external sources, often early-phase clinical trials with more comprehensive longitudinal efficacy data that could lead to an improved inference of the long-term efficacy outcome. We apply this methodology to predict the long-term factor IX (FIX) levels of HEMGENIX (etranacogene dezaparvovec), the first FDA-approved gene therapy to treat adults with severe Hemophilia B, in a phase 3 study. Our application showcases the capability of the framework to estimate the 5-year FIX levels following HEMGENIX therapy, demonstrating sustained FIX levels induced by HEMGENIX infusion. Additionally, we provide theoretical insights into the methodology by establishing its posterior convergence properties. △ Less

Submitted 9 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.13081 [pdf, other]

Parameter Estimation from Single Patient, Single Time-Point Sequencing Data of Recurrent Tumors

Authors: Kevin Leder, Ru** Sun, Zicheng Wang, Xuanming Zhang

Abstract: In this study, we develop consistent estimators for key parameters that govern the dynamics of tumor cell populations when subjected to pharmacological treatments. While these treatments often lead to an initial reduction in the abundance of drug-sensitive cells, a population of drug-resistant cells frequently emerges over time, resulting in cancer recurrence. Samples from recurrent tumors present… ▽ More In this study, we develop consistent estimators for key parameters that govern the dynamics of tumor cell populations when subjected to pharmacological treatments. While these treatments often lead to an initial reduction in the abundance of drug-sensitive cells, a population of drug-resistant cells frequently emerges over time, resulting in cancer recurrence. Samples from recurrent tumors present as an invaluable data source that can offer crucial insights into the ability of cancer cells to adapt and withstand treatment interventions. To effectively utilize the data obtained from recurrent tumors, we derive several large number limit theorems, specifically focusing on the metrics that quantify the clonal diversity of cancer cell populations at the time of cancer recurrence. These theorems then serve as the foundation for constructing our estimators. A distinguishing feature of our approach is that our estimators only require a single time-point sequencing data from a single tumor, thereby enhancing the practicality of our approach and enabling the understanding of cancer recurrence at the individual level. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.07431 [pdf, other]

Knowledge Transfer across Multiple Principal Component Analysis Studies

Authors: Zeyu Li, Kangxiang Qin, Yong He, Wang Zhou, Xinsheng Zhang

Abstract: Transfer learning has aroused great interest in the statistical community. In this article, we focus on knowledge transfer for unsupervised learning tasks in contrast to the supervised learning tasks in the literature. Given the transferable source populations, we propose a two-step transfer learning algorithm to extract useful information from multiple source principal component analysis (PCA) st… ▽ More Transfer learning has aroused great interest in the statistical community. In this article, we focus on knowledge transfer for unsupervised learning tasks in contrast to the supervised learning tasks in the literature. Given the transferable source populations, we propose a two-step transfer learning algorithm to extract useful information from multiple source principal component analysis (PCA) studies, thereby enhancing estimation accuracy for the target PCA task. In the first step, we integrate the shared subspace information across multiple studies by a proposed method named as Grassmannian barycenter, instead of directly performing PCA on the pooled dataset. The proposed Grassmannian barycenter method enjoys robustness and computational advantages in more general cases. Then the resulting estimator for the shared subspace from the first step is further utilized to estimate the target private subspace in the second step. Our theoretical analysis credits the gain of knowledge transfer between PCA studies to the enlarged eigenvalue gap, which is different from the existing supervised transfer learning tasks where sparsity plays the central role. In addition, we prove that the bilinear forms of the empirical spectral projectors have asymptotic normality under weaker eigenvalue gap conditions after knowledge transfer. When the set of informativesources is unknown, we endow our algorithm with the capability of useful dataset selection by solving a rectified optimization problem on the Grassmann manifold, which in turn leads to a computationally friendly rectified Grassmannian K-means procedure. In the end, extensive numerical simulation results and a real data case concerning activity recognition are reported to support our theoretical claims and to illustrate the empirical usefulness of the proposed transfer learning methods. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.06783 [pdf]

A doubly robust estimator for the Mann Whitney Wilcoxon Rank Sum Test when applied for causal inference in observational studies

Authors: Ruohui Chen, Tuo Lin, Lin Liu, **yuan Liu, Ruifeng Chen, **g**g Zou, Chenyu Liu, Loki Natarajan, Tang Wang, Xinlian Zhang, Xin Tu

Abstract: The Mann-Whitney-Wilcoxon rank sum test (MWWRST) is a widely used method for comparing two treatment groups in randomized control trials, particularly when dealing with highly skewed data. However, when applied to observational study data, the MWWRST often yields invalid results for causal inference. To address this limitation, Wu et al. (2014) introduced an approach that incorporates inverse prob… ▽ More The Mann-Whitney-Wilcoxon rank sum test (MWWRST) is a widely used method for comparing two treatment groups in randomized control trials, particularly when dealing with highly skewed data. However, when applied to observational study data, the MWWRST often yields invalid results for causal inference. To address this limitation, Wu et al. (2014) introduced an approach that incorporates inverse probability weighting (IPW) into this rank-based statistics to mitigate confounding effects. Subsequently, Mao (2018), Zhang et al. (2019), and Ai et al. (2020) extended this IPW estimator to develop doubly robust estimators. Nevertheless, each of these approaches has notable limitations. Mao's method imposes stringent assumptions that may not align with real-world study data. Zhang et al.'s (2019) estimators rely on bootstrap inference, which suffers from computational inefficiency and lacks known asymptotic properties. Meanwhile, Ai et al. (2020) primarily focus on testing the null hypothesis of equal distributions between two groups, which is a more stringent assumption that may not be well-suited to the primary practical application of MWWRST. In this paper, we aim to address these limitations by leveraging functional response models (FRM) to develop doubly robust estimators. We demonstrate the performance of our proposed approach using both simulated and real study data. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2403.05647 [pdf, other]

Minor Issues Escalated to Critical Levels in Large Samples: A Permutation-Based Fix

Authors: Xuekui Zhang, Li Xing, **g Zhang, Soojeong Kim

Abstract: In the big data era, the need to reevaluate traditional statistical methods is paramount due to the challenges posed by vast datasets. While larger samples theoretically enhance accuracy and hypothesis testing power without increasing false positives, practical concerns about inflated Type-I errors persist. The prevalent belief is that larger samples can uncover subtle effects, necessitating dual… ▽ More In the big data era, the need to reevaluate traditional statistical methods is paramount due to the challenges posed by vast datasets. While larger samples theoretically enhance accuracy and hypothesis testing power without increasing false positives, practical concerns about inflated Type-I errors persist. The prevalent belief is that larger samples can uncover subtle effects, necessitating dual consideration of p-value and effect size. Yet, the reliability of p-values from large samples remains debated. This paper warns that larger samples can exacerbate minor issues into significant errors, leading to false conclusions. Through our simulation study, we demonstrate how growing sample sizes amplify issues arising from two commonly encountered violations of model assumptions in real-world data and lead to incorrect decisions. This underscores the need for vigilant analytical approaches in the era of big data. In response, we introduce a permutation-based test to counterbalance the effects of sample size and assumption discrepancies by neutralizing them between actual and permuted data. We demonstrate that this approach effectively stabilizes nominal Type I error rates across various sample sizes, thereby ensuring robust statistical inferences even amidst breached conventional assumptions in big data. For reproducibility, our R codes are publicly available at: \url{https://github.com/ubcxzhang/bigDataIssue}. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.04873 [pdf, other]

The SIDO Performance Model for League of Legends

Authors: Amy X. Zhang, Parth Naidu

Abstract: League of Legends (LoL) has been a dominant esport for a decade, yet the inherent complexity of the game has stymied the creation of analytical measures of player skill and performance. Current industry standards are limited to easy-to-procure individual player statistics that are incomplete and lacking context as they do not take into account teamplay or game state. We present a unified performan… ▽ More League of Legends (LoL) has been a dominant esport for a decade, yet the inherent complexity of the game has stymied the creation of analytical measures of player skill and performance. Current industry standards are limited to easy-to-procure individual player statistics that are incomplete and lacking context as they do not take into account teamplay or game state. We present a unified performance model for League of Legends which blends together measures of a player's contribution within the context of their team, insights from traditional sports metrics such as the Plus-Minus model, and the intricacies of LoL as a complex team invasion sport. Using hierarchical Bayesian models, we outline the use of gold and damage dealt as a measure of skill, detailing players' impact on their own-, their allies'- and their enemies' statistics throughout the course of the game. Our results showcase the model's increased efficacy in separating professional players when compared to a Plus-Minus model and to current esports industry standards, while metric quality is rigorously assessed for discrimination, independence, and stability. Readers might also find additional qualitative analytics which explore champion proficiency and the impact of collaborative team-play. Future work is proposed to refine and expand the SIDO performance model, offering a comprehensive framework for esports analytics in team performance management, scouting and research realms. △ Less

Submitted 6 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.01717 [pdf, other]

Soft-constrained Schrodinger Bridge: a Stochastic Control Approach

Authors: Jhanvi Garg, Xianyang Zhang, Quan Zhou

Abstract: Schrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distri… ▽ More Schrödinger bridge can be viewed as a continuous-time stochastic control problem where the goal is to find an optimally controlled diffusion process whose terminal distribution coincides with a pre-specified target distribution. We propose to generalize this problem by allowing the terminal distribution to differ from the target but penalizing the Kullback-Leibler divergence between the two distributions. We call this new control problem soft-constrained Schrödinger bridge (SSB). The main contribution of this work is a theoretical derivation of the solution to SSB, which shows that the terminal distribution of the optimally controlled process is a geometric mixture of the target and some other distribution. This result is further extended to a time series setting. One application is the development of robust generative diffusion models. We propose a score matching-based algorithm for sampling from geometric mixtures and showcase its use via a numerical example for the MNIST data set. △ Less

Submitted 22 April, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: Made minor changes about the references. 38 pages, 7 figures. Accepted by AISTATS 2024

MSC Class: 60J60; 60J70; 93E20

arXiv:2402.13933 [pdf, other]

Powerful Large-scale Inference in High Dimensional Mediation Analysis

Authors: Asmita Roy, Xianyang Zhang

Abstract: In genome-wide epigenetic studies, exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables such as DNA methylation. Mediation analysis offers a way to study these intermediate variables and identify the presence or absence of causal mediation effects. Testing for mediation effects lead to a composite null hypothesis. Existing methods… ▽ More In genome-wide epigenetic studies, exposures (e.g., Single Nucleotide Polymorphisms) affect outcomes (e.g., gene expression) through intermediate variables such as DNA methylation. Mediation analysis offers a way to study these intermediate variables and identify the presence or absence of causal mediation effects. Testing for mediation effects lead to a composite null hypothesis. Existing methods like the Sobel's test or the Max-P test are often underpowered because 1) statistical inference is often conducted based on distributions determined under a subset of the null and 2) they are not designed to shoulder the multiple testing burden. To tackle these issues, we introduce a technique called MLFDR (Mediation Analysis using Local False Discovery Rates) for high dimensional mediation analysis, which uses the local False Discovery Rates based on the coefficients of the structural equation model specifying the mediation relationship to construct a rejection region. We have shown theoretically as well as through simulation studies that in the high-dimensional setting, the new method of identifying the mediating variables controls the FDR asymptotically and performs better with respect to power than several existing methods such as DACT (Liu et al.)and JS-mixture (Dai et al). △ Less

Submitted 26 February, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

arXiv:2401.17473 [pdf, other]

Adaptive Matrix Change Point Detection: Leveraging Structured Mean Shifts

Authors: Xinyu Zhang, Kung-Sik Chan

Abstract: In high-dimensional time series, the component processes are often assembled into a matrix to display their interrelationship. We focus on detecting mean shifts with unknown change point locations in these matrix time series. Series that are activated by a change may cluster along certain rows (columns), which forms mode-specific change point alignment. Leveraging mode-specific change point alignm… ▽ More In high-dimensional time series, the component processes are often assembled into a matrix to display their interrelationship. We focus on detecting mean shifts with unknown change point locations in these matrix time series. Series that are activated by a change may cluster along certain rows (columns), which forms mode-specific change point alignment. Leveraging mode-specific change point alignments may substantially enhance the power for change point detection. Yet, there may be no mode-specific alignments in the change point structure. We propose a powerful test to detect mode-specific change points, yet robust to non-mode-specific changes. We show the validity of using the multiplier bootstrap to compute the p-value of the proposed methods, and derive non-asymptotic bounds on the size and power of the tests. We also propose a parallel bootstrap, a computationally efficient approach for computing the p-value of the proposed adaptive test. In particular, we show the consistency of the proposed test, under mild regularity conditions. To obtain the theoretical results, we derive new, sharp bounds on Gaussian approximation and multiplier bootstrap approximation, which are of independent interest for high dimensional problems with diverging sparsity. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.16410 [pdf, other]

ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift

Authors: Hwanwoo Kim, Xin Zhang, Jiwei Zhao, Qinglong Tian

Abstract: The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the tra… ▽ More The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR 2024

arXiv:2401.14655 [pdf, other]

Distributionally Robust Optimization and Robust Statistics

Authors: Jose Blanchet, Jia** Li, Sirui Lin, Xuhui Zhang

Abstract: We review distributionally robust optimization (DRO), a principled approach for constructing statistical estimators that hedge against the impact of deviations in the expected loss between the training and deployment environments. Many well-known estimators in statistics and machine learning (e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are distributionally robust in a precise s… ▽ More We review distributionally robust optimization (DRO), a principled approach for constructing statistical estimators that hedge against the impact of deviations in the expected loss between the training and deployment environments. Many well-known estimators in statistics and machine learning (e.g. AdaBoost, LASSO, ridge regression, dropout training, etc.) are distributionally robust in a precise sense. We hope that by discussing the DRO interpretation of well-known estimators, statisticians who may not be too familiar with DRO may find a way to access the DRO literature through the bridge between classical results and their DRO equivalent formulation. On the other hand, the topic of robustness in statistics has a rich tradition associated with removing the impact of contamination. Thus, another objective of this paper is to clarify the difference between DRO and classical statistical robustness. As we will see, these are two fundamentally different philosophies leading to completely different types of estimators. In DRO, the statistician hedges against an environment shift that occurs after the decision is made; thus DRO estimators tend to be pessimistic in an adversarial setting, leading to a min-max type formulation. In classical robust statistics, the statistician seeks to correct contamination that occurred before a decision is made; thus robust statistical estimators tend to be optimistic leading to a min-min type formulation. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2401.14343 [pdf, other]

Class-attribute Priors: Adapting Optimization to Heterogeneity and Fairness Objective

Authors: Xuechen Zhang, Mingchen Li, Jiasi Chen, Christos Thrampoulidis, Samet Oymak

Abstract: Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting,… ▽ More Modern classification problems exhibit heterogeneities across individual classes: Each class may have unique attributes, such as sample size, label quality, or predictability (easy vs difficult), and variable importance at test-time. Without care, these heterogeneities impede the learning process, most notably, when optimizing fairness objectives. Confirming this, under a gaussian mixture setting, we show that the optimal SVM classifier for balanced accuracy needs to be adaptive to the class attributes. This motivates us to propose CAP: An effective and general method that generates a class-specific learning strategy (e.g. hyperparameter) based on the attributes of that class. This way, optimization process better adapts to heterogeneities. CAP leads to substantial improvements over the naive approach of assigning separate hyperparameters to each class. We instantiate CAP for loss function design and post-hoc logit adjustment, with emphasis on label-imbalanced problems. We show that CAP is competitive with prior art and its flexibility unlocks clear benefits for fairness objectives beyond balanced accuracy. Finally, we evaluate CAP on problems with label noise as well as weighted test objectives to showcase how CAP can jointly adapt to different heterogeneities. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: 15 pages, 8 figures

arXiv:2401.08173 [pdf, other]

Simultaneous Change Point Detection and Identification for High Dimensional Linear Models

Authors: Bin Liu, Xinsheng Zhang, Yufeng Liu

Abstract: In this article, we consider change point inference for high dimensional linear models. For change point detection, given any subgroup of variables, we propose a new method for testing the homogeneity of corresponding regression coefficients across the observations. Under some regularity conditions, the proposed new testing procedure controls the type I error asymptotically and is powerful against… ▽ More In this article, we consider change point inference for high dimensional linear models. For change point detection, given any subgroup of variables, we propose a new method for testing the homogeneity of corresponding regression coefficients across the observations. Under some regularity conditions, the proposed new testing procedure controls the type I error asymptotically and is powerful against sparse alternatives and enjoys certain optimality. For change point identification, an argmax based change point estimator is proposed which is shown to be consistent for the true change point location. Moreover, combining with the binary segmentation technique, we further extend our new method for detecting and identifying multiple change points. Extensive numerical studies justify the validity of our new method and an application to the Alzheimer's disease data analysis further demonstrate its competitive performance. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.03987 [pdf, other]

Navigating the Congestion Maze: Geospatial Analysis and Travel Behavior Insights for Dockless Bike-Sharing Systems in Xiamen

Authors: Xuxilu Zhang, Lingqi Gu, Nan Zhao

Abstract: Shared bicycles have emerged as a transformative force in urban transportation, effectively addressing the perennial 'last mile' challenge faced by commuters. The limitations of station-based bike-sharing systems, constrained by point-to-point travel, have spurred the popularity of the dockless model, offering flexible rentals and eliminating docking infrastructure constraints. However, the rapid… ▽ More Shared bicycles have emerged as a transformative force in urban transportation, effectively addressing the perennial 'last mile' challenge faced by commuters. The limitations of station-based bike-sharing systems, constrained by point-to-point travel, have spurred the popularity of the dockless model, offering flexible rentals and eliminating docking infrastructure constraints. However, the rapid growth of the sharing economy has introduced new challenges, notably an imbalance between supply and demand, leading to issues like the unavailability of bicycles and insufficient parking spaces during peak hours. To address these challenges, this study introduces a novel variable, Congestion Density (C), to quantitatively measure dynamic congestion levels in dockless bicycle-sharing systems. Leveraging real-time shared bike information from Xiamen, China, we present a sophisticated clustering framework for congested spots, identifying 563 congested spots categorized into Over-crowded, Semi-crowded, and Light-crowded clusters. Strikingly, these clusters align with established subway lines and bus stops, revealing a prevalent trend of integration between subway/bus services and bike-sharing. Overall, this study proposes parking lot management plans and policy recommendations based on the dynamics of crowded parking spaces, geographical characteristics, and land functional attributes. Our findings provide crucial insights for implementing bike-sharing electric fences and understanding urban mobility patterns, contributing to sustainable urban transportation. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: 17 pages, 8 figures

arXiv:2401.01294 [pdf, other]

doi 10.1109/TIFS.2023.3349054

Efficient Sparse Least Absolute Deviation Regression with Differential Privacy

Authors: Weidong Liu, Xiaojun Mao, Xiaofei Zhang, Xin Zhang

Abstract: In recent years, privacy-preserving machine learning algorithms have attracted increasing attention because of their important applications in many scientific fields. However, in the literature, most privacy-preserving algorithms demand learning objectives to be strongly convex and Lipschitz smooth, which thus cannot cover a wide class of robust loss functions (e.g., quantile/least absolute loss).… ▽ More In recent years, privacy-preserving machine learning algorithms have attracted increasing attention because of their important applications in many scientific fields. However, in the literature, most privacy-preserving algorithms demand learning objectives to be strongly convex and Lipschitz smooth, which thus cannot cover a wide class of robust loss functions (e.g., quantile/least absolute loss). In this work, we aim to develop a fast privacy-preserving learning solution for a sparse robust regression problem. Our learning loss consists of a robust least absolute loss and an $\ell_1$ sparse penalty term. To fast solve the non-smooth loss under a given privacy budget, we develop a Fast Robust And Privacy-Preserving Estimation (FRAPPE) algorithm for least absolute deviation regression. Our algorithm achieves a fast estimation by reformulating the sparse LAD problem as a penalized least square estimation problem and adopts a three-stage noise injection to guarantee the $(ε,δ)$-differential privacy. We show that our algorithm can achieve better privacy and statistical accuracy trade-off compared with the state-of-the-art privacy-preserving regression algorithms. In the end, we conduct experiments to verify the efficiency of our proposed FRAPPE algorithm. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: IEEE Transactions on Information Forensics and Security, 2024

MSC Class: 62J07

arXiv:2312.15566 [pdf, other]

Deep Copula-Based Survival Analysis for Dependent Censoring with Identifiability Guarantees

Authors: Weijia Zhang, Chun Kai Ling, Xuanhui Zhang

Abstract: Censoring is the central problem in survival analysis where either the time-to-event (for instance, death), or the time-tocensoring (such as loss of follow-up) is observed for each sample. The majority of existing machine learning-based survival analysis methods assume that survival is conditionally independent of censoring given a set of covariates; an assumption that cannot be verified since onl… ▽ More Censoring is the central problem in survival analysis where either the time-to-event (for instance, death), or the time-tocensoring (such as loss of follow-up) is observed for each sample. The majority of existing machine learning-based survival analysis methods assume that survival is conditionally independent of censoring given a set of covariates; an assumption that cannot be verified since only marginal distributions is available from the data. The existence of dependent censoring, along with the inherent bias in current estimators has been demonstrated in a variety of applications, accentuating the need for a more nuanced approach. However, existing methods that adjust for dependent censoring require practitioners to specify the ground truth copula. This requirement poses a significant challenge for practical applications, as model misspecification can lead to substantial bias. In this work, we propose a flexible deep learning-based survival analysis method that simultaneously accommodate for dependent censoring and eliminates the requirement for specifying the ground truth copula. We theoretically prove the identifiability of our model under a broad family of copulas and survival distributions. Experiments results from a wide range of datasets demonstrate that our approach successfully discerns the underlying dependency structure and significantly reduces survival estimation bias when compared to existing methods. △ Less

Submitted 27 December, 2023; v1 submitted 24 December, 2023; originally announced December 2023.

Comments: To appear in AAAI 2024

arXiv:2312.15373 [pdf, other]

A Multi-day Needs-based Modeling Approach for Activity and Travel Demand Analysis

Authors: Kexin Chen, **** Guan, Ravi Seshadri, Varun Pattabhiraman, Youssef Medhat Aboutaleb, Ali Shamshiripour, Chen Liang, Xiaochun Zhang, Moshe Ben-Akiva

Abstract: This paper proposes a multi-day needs-based model for activity and travel demand analysis. The model captures the multi-day dynamics in activity generation, which enables the modeling of activities with increased flexibility in time and space (e.g., e-commerce and remote working). As an enhancement to activity-based models, the proposed model captures the underlying decision-making process of acti… ▽ More This paper proposes a multi-day needs-based model for activity and travel demand analysis. The model captures the multi-day dynamics in activity generation, which enables the modeling of activities with increased flexibility in time and space (e.g., e-commerce and remote working). As an enhancement to activity-based models, the proposed model captures the underlying decision-making process of activity generation by accounting for psychological needs as the drivers of activities. The level of need satisfaction is modeled as a psychological inventory, whose utility is optimized via decisions on activity participation, location, and duration. The utility includes both the benefit in the inventory gained and the cost in time, monetary expense as well as maintenance of safety stock. The model includes two sub-models, a Deterministic Model that optimizes the utility of the inventory, and an Empirical Model that accounts for heterogeneity and stochasticity. Numerical experiments are conducted to demonstrate model scalability. A maximum likelihood estimator is proposed, the properties of the log-likelihood function are examined and the recovery of true parameters is tested. This research contributes to the literature on transportation demand models in the following three aspects. First, it is arguably better grounded in psychological theory than traditional models and allows the generation of activity patterns to be policy-sensitive (while avoiding the need for ad hoc utility definitions). Second, it contributes to the development of needs-based models with a non-myopic approach to model multi-day activity patterns. Third, it proposes a tractable model formulation via problem reformulation and computational enhancements, which allows for maximum likelihood parameter estimation. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: 38 pages, 11 figures

arXiv:2312.09862 [pdf, other]

Wasserstein-based Minimax Estimation of Dependence in Multivariate Regularly Varying Extremes

Authors: Xuhui Zhang, Jose Blanchet, Youssef Marzouk, Viet Anh Nguyen, Sven Wang

Abstract: We study minimax risk bounds for estimators of the spectral measure in multivariate linear factor models, where observations are linear combinations of regularly varying latent factors. Non-asymptotic convergence rates are derived for the multivariate Peak-over-Threshold estimator in terms of the $p$-th order Wasserstein distance, and information-theoretic lower bounds for the minimax risks are es… ▽ More We study minimax risk bounds for estimators of the spectral measure in multivariate linear factor models, where observations are linear combinations of regularly varying latent factors. Non-asymptotic convergence rates are derived for the multivariate Peak-over-Threshold estimator in terms of the $p$-th order Wasserstein distance, and information-theoretic lower bounds for the minimax risks are established. The convergence rate of the estimator is shown to be minimax optimal under a class of Pareto-type models analogous to the standard class used in the setting of one-dimensional observations known as the Hall-Welsh class. When the estimator is minimax inefficient, a novel two-step estimator is introduced and demonstrated to attain the minimax lower bound. Our analysis bridges the gaps in understanding trade-offs between estimation bias and variance in multivariate extreme value theory. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2312.04026 [pdf, other]

Independent-Set Design of Experiments for Estimating Treatment and Spillover Effects under Network Interference

Authors: Chencheng Cai, Xu Zhang, Edoardo M. Airoldi

Abstract: Interference is ubiquitous when conducting causal experiments over networks. Except for certain network structures, causal inference on the network in the presence of interference is difficult due to the entanglement between the treatment assignments and the interference levels. In this article, we conduct causal inference under interference on an observed, sparse but connected network, and we pro… ▽ More Interference is ubiquitous when conducting causal experiments over networks. Except for certain network structures, causal inference on the network in the presence of interference is difficult due to the entanglement between the treatment assignments and the interference levels. In this article, we conduct causal inference under interference on an observed, sparse but connected network, and we propose a novel design of experiments based on an independent set. Compared to conventional designs, the independent-set design focuses on an independent subset of data and controls their interference exposures through the assignments to the rest (auxiliary set). We provide a lower bound on the size of the independent set from a greedy algorithm , and justify the theoretical performance of estimators under the proposed design. Our approach is capable of estimating both spillover effects and treatment effects. We justify its superiority over conventional methods and illustrate the empirical performance through simulations. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2312.02905 [pdf, other]

E-values, Multiple Testing and Beyond

Authors: Guanxun Li, Xianyang Zhang

Abstract: We discover a connection between the Benjamini-Hochberg (BH) procedure and the recently proposed e-BH procedure [Wang and Ramdas, 2022] with a suitably defined set of e-values. This insight extends to a generalized version of the BH procedure and the model-free multiple testing procedure in Barber and Candès [2015] (BC) with a general form of rejection rules. The connection provides an effective w… ▽ More We discover a connection between the Benjamini-Hochberg (BH) procedure and the recently proposed e-BH procedure [Wang and Ramdas, 2022] with a suitably defined set of e-values. This insight extends to a generalized version of the BH procedure and the model-free multiple testing procedure in Barber and Candès [2015] (BC) with a general form of rejection rules. The connection provides an effective way of develo** new multiple testing procedures by aggregating or assembling e-values resulting from the BH and BC procedures and their use in different subsets of the data. In particular, we propose new multiple testing methodologies in three applications, including a hybrid approach that integrates the BH and BC procedures, a multiple testing procedure aimed at ensuring a new notion of fairness by controlling both the group-wise and overall false discovery rates (FDR), and a structure adaptive multiple testing procedure that can incorporate external covariate information to boost detection power. One notable feature of the proposed methods is that we use a data-dependent approach for assigning weights to e-values, significantly enhancing the efficiency of the resulting e-BH procedure. The construction of the weights is non-trivial and is motivated by the leave-one-out analysis for the BH and BC procedures. In theory, we prove that the proposed e-BH procedures with data-dependent weights in the three applications ensure finite sample FDR control. Furthermore, we demonstrate the efficiency of the proposed methods through numerical studies in the three applications. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.01386 [pdf, ps, other]

Regret Optimality of GP-UCB

Authors: Wenjia Wang, Xiaowei Zhang, Lu Zou

Abstract: Gaussian Process Upper Confidence Bound (GP-UCB) is one of the most popular methods for optimizing black-box functions with noisy observations, due to its simple structure and superior performance. Its empirical successes lead to a natural, yet unresolved question: Is GP-UCB regret optimal? In this paper, we offer the first generally affirmative answer to this important open question in the Bayesi… ▽ More Gaussian Process Upper Confidence Bound (GP-UCB) is one of the most popular methods for optimizing black-box functions with noisy observations, due to its simple structure and superior performance. Its empirical successes lead to a natural, yet unresolved question: Is GP-UCB regret optimal? In this paper, we offer the first generally affirmative answer to this important open question in the Bayesian optimization literature. We establish new upper bounds on both the simple and cumulative regret of GP-UCB when the objective function to optimize admits certain smoothness property. These upper bounds match the known minimax lower bounds (up to logarithmic factors independent of the feasible region's dimensionality) for optimizing functions with the same smoothness. Intriguingly, our findings indicate that, with the same level of exploration, GP-UCB can simultaneously achieve optimality in both simple and cumulative regret. The crux of our analysis hinges on a refined uniform error bound for online estimation of functions in reproducing kernel Hilbert spaces. This error bound, which we derive from empirical process theory, is of independent interest, and its potential applications may reach beyond the scope of this study. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: 23 pages

arXiv:2311.17797 [pdf, other]

Learning to Simulate: Generative Metamodeling via Quantile Regression

Authors: L. Jeff Hong, Yanxi Hou, Qingkai Zhang, Xiaowei Zhang

Abstract: Stochastic simulation models, while effective in capturing the dynamics of complex systems, are often too slow to run for real-time decision-making. Metamodeling techniques are widely used to learn the relationship between a summary statistic of the outputs (e.g., the mean or quantile) and the inputs of the simulator, so that it can be used in real time. However, this methodology requires the know… ▽ More Stochastic simulation models, while effective in capturing the dynamics of complex systems, are often too slow to run for real-time decision-making. Metamodeling techniques are widely used to learn the relationship between a summary statistic of the outputs (e.g., the mean or quantile) and the inputs of the simulator, so that it can be used in real time. However, this methodology requires the knowledge of an appropriate summary statistic in advance, making it inflexible for many practical situations. In this paper, we propose a new metamodeling concept, called generative metamodeling, which aims to construct a "fast simulator of the simulator". This technique can generate random outputs substantially faster than the original simulation model, while retaining an approximately equal conditional distribution given the same inputs. Once constructed, a generative metamodel can instantaneously generate a large amount of random outputs as soon as the inputs are specified, thereby facilitating the immediate computation of any summary statistic for real-time decision-making. Furthermore, we propose a new algorithm -- quantile-regression-based generative metamodeling (QRGMM) -- and study its convergence and rate of convergence. Extensive numerical experiments are conducted to investigate the empirical performance of QRGMM, compare it with other state-of-the-art generative algorithms, and demonstrate its usefulness in practical real-time decision-making. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: Main body: 36 pages, 7 figures; supplemental material: 12 pages

arXiv:2311.17303 [pdf, other]

Enhancing the Performance of Neural Networks Through Causal Discovery and Integration of Domain Knowledge

Authors: Xiaoge Zhang, Xiao-Lin Wang, Fenglei Fan, Yiu-Ming Cheung, Indranil Bose

Abstract: In this paper, we develop a generic methodology to encode hierarchical causality structure among observed variables into a neural network in order to improve its predictive performance. The proposed methodology, called causality-informed neural network (CINN), leverages three coherent steps to systematically map the structural causal knowledge into the layer-to-layer design of neural network while… ▽ More In this paper, we develop a generic methodology to encode hierarchical causality structure among observed variables into a neural network in order to improve its predictive performance. The proposed methodology, called causality-informed neural network (CINN), leverages three coherent steps to systematically map the structural causal knowledge into the layer-to-layer design of neural network while strictly preserving the orientation of every causal relationship. In the first step, CINN discovers causal relationships from observational data via directed acyclic graph (DAG) learning, where causal discovery is recast as a continuous optimization problem to avoid the combinatorial nature. In the second step, the discovered hierarchical causality structure among observed variables is systematically encoded into neural network through a dedicated architecture and customized loss function. By categorizing variables in the causal DAG as root, intermediate, and leaf nodes, the hierarchical causal DAG is translated into CINN with a one-to-one correspondence between nodes in the causal DAG and units in the CINN while maintaining the relative order among these nodes. Regarding the loss function, both intermediate and leaf nodes in the DAG graph are treated as target outputs during CINN training so as to drive co-learning of causal relationships among different types of nodes. As multiple loss components emerge in CINN, we leverage the projection of conflicting gradients to mitigate gradient interference among the multiple learning tasks. Computational experiments across a broad spectrum of UCI data sets demonstrate substantial advantages of CINN in predictive performance over other state-of-the-art methods. In addition, an ablation study underscores the value of integrating structural and quantitative causal knowledge in enhancing the neural network's predictive performance incrementally. △ Less

Submitted 30 November, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.13958 [pdf, other]

Handling The Non-Smooth Challenge in Tensor SVD: A Multi-Objective Tensor Recovery Framework

Authors: **g**g Zheng, Wanglong Lu, Wenzhe Wang, Yankai Cao, Xiaoqin Zhang, Xianta Jiang

Abstract: Recently, numerous tensor singular value decomposition (t-SVD)-based tensor recovery methods have shown promise in processing visual data, such as color images and videos. However, these methods often suffer from severe performance degradation when confronted with tensor data exhibiting non-smooth changes. It has been commonly observed in real-world scenarios but ignored by the traditional t-SVD-b… ▽ More Recently, numerous tensor singular value decomposition (t-SVD)-based tensor recovery methods have shown promise in processing visual data, such as color images and videos. However, these methods often suffer from severe performance degradation when confronted with tensor data exhibiting non-smooth changes. It has been commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. In this work, we introduce a novel tensor recovery model with a learnable tensor nuclear norm to address such a challenge. We develop a new optimization algorithm named the Alternating Proximal Multiplier Method (APMM) to iteratively solve the proposed tensor completion model. Theoretical analysis demonstrates the convergence of the proposed APMM to the Karush-Kuhn-Tucker (KKT) point of the optimization problem. In addition, we propose a multi-objective tensor recovery framework based on APMM to efficiently explore the correlations of tensor data across its various dimensions, providing a new perspective on extending the t-SVD-based method to higher-order tensor cases. Numerical experiments demonstrated the effectiveness of the proposed method in tensor completion. △ Less

Submitted 31 March, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.08504 [pdf, ps, other]

On semi-supervised estimation using exponential tilt mixture models

Authors: Ye Tian, Xinwei Zhang, Zhiqiang Tan

Abstract: Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only the predictors. Logistic regression is equivalent to an exponential tilt model in the labeled population. For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models and maximum nonpar… ▽ More Consider a semi-supervised setting with a labeled dataset of binary responses and predictors and an unlabeled dataset with only the predictors. Logistic regression is equivalent to an exponential tilt model in the labeled population. For semi-supervised estimation, we develop further analysis and understanding of a statistical approach using exponential tilt mixture (ETM) models and maximum nonparametric likelihood estimation, while allowing that the class proportions may differ between the unlabeled and labeled data. We derive asymptotic properties of ETM-based estimation and demonstrate improved efficiency over supervised logistic regression in a random sampling setup and an outcome-stratified sampling setup previously used. Moreover, we reconcile such efficiency improvement with the existing semiparametric efficiency theory when the class proportions in the unlabeled and labeled data are restricted to be the same. We also provide a simulation study to numerically illustrate our theoretical findings. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.07876 [pdf, ps, other]

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

Authors: Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, Xuezhou Zhang, Shuai Li

Abstract: In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we… ▽ More In this work, we study the low-rank MDPs with adversarially changed losses in the full-information feedback setting. In particular, the unknown transition probability kernel admits a low-rank matrix decomposition \citep{REPUCB22}, and the loss functions may change adversarially but are revealed to the learner at the end of each episode. We propose a policy optimization-based algorithm POLO, and we prove that it attains the $\widetilde{O}(K^{\frac{5}{6}}A^{\frac{1}{2}}d\ln(1+M)/(1-γ)^2)$ regret guarantee, where $d$ is rank of the transition kernel (and hence the dimension of the unknown representations), $A$ is the cardinality of the action space, $M$ is the cardinality of the model class, and $γ$ is the discounted factor. Notably, our algorithm is oracle-efficient and has a regret guarantee with no dependence on the size of potentially arbitrarily large state space. Furthermore, we also prove an $Ω(\frac{γ^2}{1-γ} \sqrt{d A K})$ regret lower bound for this problem, showing that low-rank MDPs are statistically more difficult to learn than linear MDPs in the regret minimization setting. To the best of our knowledge, we present the first algorithm that interleaves representation learning, exploration, and exploitation to achieve the sublinear regret guarantee for RL with nonlinear function approximation and adversarial losses. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.06968 [pdf, other]

Physics-Informed Data Denoising for Real-Life Sensing Systems

Authors: Xiyuan Zhang, Xiaohan Fu, Diyan Teng, Chengyu Dong, Keerthivasan Vijayakumar, Jiayun Zhang, Ranak Roy Chowdhury, Junsheng Han, Dezhi Hong, Rashmi Kulkarni, **gbo Shang, Rajesh Gupta

Abstract: Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically r… ▽ More Sensors measuring real-life physical processes are ubiquitous in today's interconnected world. These sensors inherently bear noise that often adversely affects performance and reliability of the systems they support. Classic filtering-based approaches introduce strong assumptions on the time or frequency characteristics of sensory measurements, while learning-based denoising approaches typically rely on using ground truth clean data to train a denoising model, which is often challenging or prohibitive to obtain for many real-world applications. We observe that in many scenarios, the relationships between different sensor measurements (e.g., location and acceleration) are analytically described by laws of physics (e.g., second-order differential equation). By incorporating such physics constraints, we can guide the denoising process to improve even in the absence of ground truth data. In light of this, we design a physics-informed denoising model that leverages the inherent algebraic relationships between different measurements governed by the underlying physics. By obviating the need for ground truth clean data, our method offers a practical denoising solution for real-world applications. We conducted experiments in various domains, including inertial navigation, CO2 monitoring, and HVAC control, and achieved state-of-the-art performance compared with existing denoising methods. Our method can denoise data in real time (4ms for a sequence of 1s) for low-cost noisy sensors and produces results that closely align with those from high-precision, high-cost alternatives, leading to an efficient, cost-effective approach for more accurate sensor-based systems. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: SenSys 2023

arXiv:2311.03289 [pdf, other]

Batch effect correction with sample remeasurement in highly confounded case-control studies

Authors: Hanxuan Ye, Xianyang Zhang, Chen Wang, Ellen L. Goode, Jun Chen

Abstract: Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch effect correction with remeasured samples are severely under-developed. In this study, we developed a framework for batch e… ▽ More Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch effect correction with remeasured samples are severely under-developed. In this study, we developed a framework for batch effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics, and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 45 pages

arXiv:2310.17153 [pdf, other]

Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration

Authors: Longlin Yu, Tianyu Xie, Yu Zhu, Tong Yang, Xiangyu Zhang, Cheng Zhang

Abstract: Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variationa… ▽ More Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 25 pages, 13 figures, NeurIPS 2023

arXiv:2310.07990 [pdf]

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Authors: Chen Zhao, Kuan-Jui Su, Chong Wu, Xuewei Cao, Qiuying Sha, Wu Li, Zhe Luo, Tian Qin, Chuan Qiu, Lan Juan Zhao, Anqi Liu, Lindong Jiang, Xiao Zhang, Hui Shen, Weihua Zhou, Hong-Wen Deng

Abstract: Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information f… ▽ More Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research. △ Less

Submitted 12 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 19 pages, 3 figures

arXiv:2310.04457 [pdf, other]

ProGO: Probabilistic Global Optimizer

Authors: Xinyu Zhang, Sujit Ghosh

Abstract: In the field of global optimization, many existing algorithms face challenges posed by non-convex target functions and high computational complexity or unavailability of gradient information. These limitations, exacerbated by sensitivity to initial conditions, often lead to suboptimal solutions or failed convergence. This is true even for Metaheuristic algorithms designed to amalgamate different o… ▽ More In the field of global optimization, many existing algorithms face challenges posed by non-convex target functions and high computational complexity or unavailability of gradient information. These limitations, exacerbated by sensitivity to initial conditions, often lead to suboptimal solutions or failed convergence. This is true even for Metaheuristic algorithms designed to amalgamate different optimization techniques to improve their efficiency and robustness. To address these challenges, we develop a sequence of multidimensional integration-based methods that we show to converge to the global optima under some mild regularity conditions. Our probabilistic approach does not require the use of gradients and is underpinned by a mathematically rigorous convergence framework anchored in the nuanced properties of nascent optima distribution. In order to alleviate the problem of multidimensional integration, we develop a latent slice sampler that enjoys a geometric rate of convergence in generating samples from the nascent optima distribution, which is used to approximate the global optima. The proposed Probabilistic Global Optimizer (ProGO) provides a scalable unified framework to approximate the global optima of any continuous function defined on a domain of arbitrary dimension. Empirical illustrations of ProGO across a variety of popular non-convex test functions (having finite global optima) reveal that the proposed algorithm outperforms, by order of magnitude, many existing state-of-the-art methods, including gradient-based, zeroth-order gradient-free, and some Bayesian Optimization methods, in term regret value and speed of convergence. It is, however, to be noted that our approach may not be suitable for functions that are expensive to compute. △ Less

Submitted 12 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Showing 1–50 of 504 results for author: Zhang, X