Search | arXiv e-print repository

Consistent community detection in multi-layer networks with heterogeneous differential privacy

Authors: Yaoming Zhen, Shirong Xu, Junhui Wang

Abstract: As network data has become increasingly prevalent, a substantial amount of attention has been paid to the privacy issue in publishing network data. One of the critical challenges for data publishers is to preserve the topological structures of the original network while protecting sensitive information. In this paper, we propose a personalized edge flip** mechanism that allows data publishers to… ▽ More As network data has become increasingly prevalent, a substantial amount of attention has been paid to the privacy issue in publishing network data. One of the critical challenges for data publishers is to preserve the topological structures of the original network while protecting sensitive information. In this paper, we propose a personalized edge flip** mechanism that allows data publishers to protect edge information based on each node's privacy preference. It can achieve differential privacy while preserving the community structure under the multi-layer degree-corrected stochastic block model after appropriately debiasing, and thus consistent community detection in the privatized multi-layer networks is achievable. Theoretically, we establish the consistency of community detection in the privatized multi-layer network and show that better privacy protection of edges can be obtained for a proportion of nodes while allowing other nodes to give up their privacy. Furthermore, the advantage of the proposed personalized edge-flip** mechanism is also supported by its numerical performance on various synthetic networks and a real-life multi-layer network. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.12212 [pdf, other]

Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data

Authors: Jiantong Wang, Heng Lian, Yan Yu, He** Zhang

Abstract: Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables… ▽ More Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: This article is submitted to Journal of the American Statistical Association

arXiv:2406.11011 [pdf, other]

Data Shapley in One Training Run

Authors: Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

Abstract: Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, m… ▽ More Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation. △ Less

Submitted 29 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.09557 [pdf, other]

Measure This, Not That: Optimizing the Cost and Model-Based Information Content of Measurements

Authors: Jialu Wang, Zedong Peng, Ryan Hughes, Debangsu Bhattacharyya, David E. Bernal Neira, Alexander W. Dowling

Abstract: Model-based design of experiments (MBDoE) is a powerful framework for selecting and calibrating science-based mathematical models from data. This work extends popular MBDoE workflows by proposing a convex mixed integer (non)linear programming (MINLP) problem to optimize the selection of measurements. The solver MindtPy is modified to support calculating the D-optimality objective and its gradient… ▽ More Model-based design of experiments (MBDoE) is a powerful framework for selecting and calibrating science-based mathematical models from data. This work extends popular MBDoE workflows by proposing a convex mixed integer (non)linear programming (MINLP) problem to optimize the selection of measurements. The solver MindtPy is modified to support calculating the D-optimality objective and its gradient via an external package, \texttt{SciPy}, using the grey-box module in Pyomo. The new approach is demonstrated in two case studies: estimating highly correlated kinetics from a batch reactor and estimating transport parameters in a large-scale rotary packed bed for CO$_2$ capture. Both case studies show how examining the Pareto-optimal trade-offs between information content measured by A- and D-optimality versus measurement budget offers practical guidance for selecting measurements for scientific experiments. △ Less

Submitted 13 June, 2024; originally announced June 2024.

MSC Class: 90C25; 90C11; 90C30; 90C90; 62K05

arXiv:2406.06980 [pdf, other]

Sensitivity Analysis for the Test-Negative Design

Authors: Soumyabrata Kundu, Peng Ding, Xinran Li, **gshu Wang

Abstract: The test-negative design has become popular for evaluating the effectiveness of post-licensure vaccines using observational data. In addition to its logistical convenience on data collection, the design is also believed to control for the differential health-care-seeking behavior between vaccinated and unvaccinated individuals, which is an important while often unmeasured confounder between the va… ▽ More The test-negative design has become popular for evaluating the effectiveness of post-licensure vaccines using observational data. In addition to its logistical convenience on data collection, the design is also believed to control for the differential health-care-seeking behavior between vaccinated and unvaccinated individuals, which is an important while often unmeasured confounder between the vaccination and infection. Hence, the design has been employed routinely to monitor seasonal flu vaccines and more recently to measure the COVID-19 vaccine effectiveness. Despite its popularity, the design has been questioned, in particular about its ability to fully control for the unmeasured confounding. In this paper, we explore deviations from a perfect test-negative design, and propose various sensitivity analysis methods for estimating the effect of vaccination measured by the causal odds ratio on the subpopulation of individuals with good health-care-seeking behavior. We start with point identification of the causal odds ratio under a test-negative design, considering two forms of assumptions on the unmeasured confounder. These assumptions then lead to two approaches for conducting sensitivity analysis, addressing the influence of the unmeasured confounding in different ways. Specifically, one approach investigates partial control for unmeasured confounder in the test-negative design, while the other examines the impact of unmeasured confounder on both vaccination and infection. Furthermore, these approaches can be combined to provide narrower bounds on the true causal odds ratio, and can be further extended to sharpen the bounds by restricting the treatment effect heterogeneity. Finally, we apply the proposed methods to evaluate the effectiveness of COVID-19 vaccines using observational data from test-negative designs. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.01461 [pdf, other]

Hardness of Learning Neural Networks under the Manifold Hypothesis

Authors: Bobak T. Kiani, Jason Wang, Melanie Weber

Abstract: The manifold hypothesis presumes that high-dimensional data lies on or near a low-dimensional manifold. While the utility of encoding geometric structure has been demonstrated empirically, rigorous analysis of its impact on the learnability of neural networks is largely missing. Several recent results have established hardness results for learning feedforward and equivariant neural networks under… ▽ More The manifold hypothesis presumes that high-dimensional data lies on or near a low-dimensional manifold. While the utility of encoding geometric structure has been demonstrated empirically, rigorous analysis of its impact on the learnability of neural networks is largely missing. Several recent results have established hardness results for learning feedforward and equivariant neural networks under i.i.d. Gaussian or uniform Boolean data distributions. In this paper, we investigate the hardness of learning under the manifold hypothesis. We ask which minimal assumptions on the curvature and regularity of the manifold, if any, render the learning problem efficiently learnable. We prove that learning is hard under input manifolds of bounded curvature by extending proofs of hardness in the SQ and cryptographic settings for Boolean data inputs to the geometric setting. On the other hand, we show that additional assumptions on the volume of the data manifold alleviate these fundamental limitations and guarantee learnability via a simple interpolation argument. Notable instances of this regime are manifolds which can be reliably reconstructed via manifold learning. Looking forward, we comment on and empirically explore intermediate regimes of manifolds, which have heterogeneous features commonly found in real world data. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.20763 [pdf, other]

Improving Generalization and Convergence by Enhancing Implicit Regularization

Authors: Mingze Wang, Haotian He, **bo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, Lei Wu

Abstract: In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that I… ▽ More In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that IRE can be practically incorporated with {\em generic base optimizers} without introducing significant computational overload. Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also achieves a $2\times$ {\em speed-up} compared to AdamW in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical guarantees, showing that IRE can substantially accelerate the convergence towards flat minima in Sharpness-aware Minimization (SAM). △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 35 pages

arXiv:2405.16413 [pdf, other]

Augmented Risk Prediction for the Onset of Alzheimer's Disease from Electronic Health Records with Large Language Models

Authors: Jiankun Wang, Sumyeong Ahn, Taykhoom Dalal, Xiaodan Zhang, Weishen Pan, Qiannan Zhang, Bin Chen, Hiroko H. Dodge, Fei Wang, Jiayu Zhou

Abstract: Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning bas… ▽ More Alzheimer's disease (AD) is the fifth-leading cause of death among Americans aged 65 and older. Screening and early detection of AD and related dementias (ADRD) are critical for timely intervention and for identifying clinical trial participants. The widespread adoption of electronic health records (EHRs) offers an important resource for develo** ADRD screening tools such as machine learning based predictive models. Recent advancements in large language models (LLMs) demonstrate their unprecedented capability of encoding knowledge and performing reasoning, which offers them strong potential for enhancing risk prediction. This paper proposes a novel pipeline that augments risk prediction by leveraging the few-shot inference power of LLMs to make predictions on cases where traditional supervised learning methods (SLs) may not excel. Specifically, we develop a collaborative pipeline that combines SLs and LLMs via a confidence-driven decision-making mechanism, leveraging the strengths of SLs in clear-cut cases and LLMs in more complex scenarios. We evaluate this pipeline using a real-world EHR data warehouse from Oregon Health \& Science University (OHSU) Hospital, encompassing EHRs from over 2.5 million patients and more than 20 million patient encounters. Our results show that our proposed approach effectively combines the power of SLs and LLMs, offering significant improvements in predictive performance. This advancement holds promise for revolutionizing ADRD screening and early detection practices, with potential implications for better strategies of patient management and thus improving healthcare. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.15441 [pdf, other]

Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances

Authors: Jie Wang, March Boedihardjo, Yao Xie

Abstract: Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear map** that reduces data int… ▽ More Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear map** that reduces data into $1$ dimensions before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the SDP solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing. △ Less

Submitted 29 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: 34 pages, 7 figures, 4 tables

arXiv:2405.08759 [pdf, other]

Optimal Sequential Procedure for Early Detection of Multiple Side Effects

Authors: Jiayue Wang, Ben Boukai

Abstract: In this paper, we propose an optimal sequential procedure for the early detection of potential side effects resulting from the administration of some treatment (e.g. a vaccine, say). The results presented here extend previous results obtained in Wang and Boukai (2024) who study the single side effect case to the case of two (or more) side effects. While the sequential procedure we employ, simultan… ▽ More In this paper, we propose an optimal sequential procedure for the early detection of potential side effects resulting from the administration of some treatment (e.g. a vaccine, say). The results presented here extend previous results obtained in Wang and Boukai (2024) who study the single side effect case to the case of two (or more) side effects. While the sequential procedure we employ, simultaneously monitors several of the treatment's side effects, the $(α, β)$-optimal test we propose does not require any information about the inter-correlation between these potential side effects. However, in all of the subsequent analyses, including the derivations of the exact expressions of the Average Sample Number (ASN), the Power function, and the properties of the post-test (or post-detection) estimators, we accounted specifically, for the correlation between the potential side effects. In the real-life application (such as post-marketing surveillance), the number of available observations is large enough to justify asymptotic analyses of the sequential procedure (testing and post-detection estimation) properties. Accordingly, we also derive the consistency and asymptotic normality of our post-test estimators; results which enable us to also provide (asymptotic, post-detection) confidence intervals for the probabilities of various side-effects. Moreover, to compare two specific side effects, their relative risk plays an important role. We derive the distribution of the estimated relative risk in the asymptotic framework to provide appropriate inference. To illustrate the theoretical results presented, we provide two detailed examples based on the data of side effects on COVID-19 vaccine collected in Nigeria (see Nigeria (see Ilori et al. (2022)). △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: A total of 31 with 6 Tables and 8 Figures

MSC Class: 62L10; 62L12

arXiv:2405.03875 [pdf, other]

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Authors: Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

Abstract: Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis te… ▽ More Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2404.15760 [pdf, other]

Debiasing Machine Unlearning with Counterfactual Examples

Authors: Ziheng Chen, Jia Wang, Jun Zhuang, Abbavaram Gowtham Reddy, Fabrizio Silvestri, ** Huang, Kaushiki Nag, Kun Kuang, Xin Ning, Gabriele Tolomei

Abstract: The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1… ▽ More The right to be forgotten (RTBF) seeks to safeguard individuals from the enduring effects of their historical actions by implementing machine-learning techniques. These techniques facilitate the deletion of previously acquired knowledge without requiring extensive model retraining. However, they often overlook a critical issue: unlearning processes bias. This bias emerges from two main sources: (1) data-level bias, characterized by uneven data removal, and (2) algorithm-level bias, which leads to the contamination of the remaining dataset, thereby degrading model accuracy. In this work, we analyze the causal factors behind the unlearning process and mitigate biases at both data and algorithmic levels. Typically, we introduce an intervention-based approach, where knowledge to forget is erased with a debiased dataset. Besides, we guide the forgetting procedure by leveraging counterfactual examples, as they maintain semantic data consistency without hurting performance on the remaining dataset. Experimental results demonstrate that our method outperforms existing machine unlearning baselines on evaluation metrics. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.14786 [pdf, other]

RealTCD: Temporal Causal Discovery from Interventional Data with Large Language Model

Authors: Peiwen Li, Xin Wang, Zeyang Zhang, Yuan Meng, Fang Shen, Yue Li, Jialong Wang, Yang Li, Wenweu Zhu

Abstract: In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional… ▽ More In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures. △ Less

Submitted 26 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.13964 [pdf, other]

An Economic Solution to Copyright Challenges of Generative AI

Authors: Jiachen T. Wang, Zhun Deng, Hiroaki Chiba-Okabe, Boaz Barak, Weijie J. Su

Abstract: Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their cont… ▽ More Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content. The metric for contributions is quantitatively determined by leveraging the probabilistic nature of modern generative AI models and using techniques from cooperative game theory in economics. This framework enables a platform where AI developers benefit from access to high-quality training data, thus improving model performance. Meanwhile, copyright owners receive fair compensation, driving the continued provision of relevant data for generative model training. Experiments demonstrate that our framework successfully identifies the most relevant data sources used in artwork generation, ensuring a fair and interpretable distribution of revenues among copyright owners. △ Less

Submitted 24 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.10561 [pdf, other]

HiGraphDTI: Hierarchical Graph Representation Learning for Drug-Target Interaction Prediction

Authors: Bin Liu, Siqi Wu, ** Wang, Xin Deng, Ao Zhou

Abstract: The discovery of drug-target interactions (DTIs) plays a crucial role in pharmaceutical development. The deep learning model achieves more accurate results in DTI prediction due to its ability to extract robust and expressive features from drug and target chemical structures. However, existing deep learning methods typically generate drug features via aggregating molecular atom representations, ig… ▽ More The discovery of drug-target interactions (DTIs) plays a crucial role in pharmaceutical development. The deep learning model achieves more accurate results in DTI prediction due to its ability to extract robust and expressive features from drug and target chemical structures. However, existing deep learning methods typically generate drug features via aggregating molecular atom representations, ignoring the chemical properties carried by motifs, i.e., substructures of the molecular graph. The atom-drug double-level molecular representation learning can not fully exploit structure information and fails to interpret the DTI mechanism from the motif perspective. In addition, sequential model-based target feature extraction either fuses limited contextual information or requires expensive computational resources. To tackle the above issues, we propose a hierarchical graph representation learning-based DTI prediction method (HiGraphDTI). Specifically, HiGraphDTI learns hierarchical drug representations from triple-level molecular graphs to thoroughly exploit chemical information embedded in atoms, motifs, and molecules. Then, an attentional feature fusion module incorporates information from different receptive fields to extract expressive target features.Last, the hierarchical attention mechanism identifies crucial molecular segments, which offers complementary views for interpreting interaction mechanisms. The experiment results not only demonstrate the superiority of HiGraphDTI to the state-of-the-art methods, but also confirm the practical ability of our model in interaction interpretation and new DTI discovery. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.10207 [pdf, other]

HELLINGER-UCB: A novel algorithm for stochastic multi-armed bandit problem and cold start problem in recommender system

Authors: Ruibo Yang, Jiazhou Wang, Andrew Mullhaupt

Abstract: In this paper, we study the stochastic multi-armed bandit problem, where the reward is driven by an unknown random variable. We propose a new variant of the Upper Confidence Bound (UCB) algorithm called Hellinger-UCB, which leverages the squared Hellinger distance to build the upper confidence bound. We prove that the Hellinger-UCB reaches the theoretical lower bound. We also show that the Helling… ▽ More In this paper, we study the stochastic multi-armed bandit problem, where the reward is driven by an unknown random variable. We propose a new variant of the Upper Confidence Bound (UCB) algorithm called Hellinger-UCB, which leverages the squared Hellinger distance to build the upper confidence bound. We prove that the Hellinger-UCB reaches the theoretical lower bound. We also show that the Hellinger-UCB has a solid statistical interpretation. We show that Hellinger-UCB is effective in finite time horizons with numerical experiments between Hellinger-UCB and other variants of the UCB algorithm. As a real-world example, we apply the Hellinger-UCB algorithm to solve the cold-start problem for a content recommender system of a financial app. With reasonable assumption, the Hellinger-UCB algorithm has a convenient but important lower latency feature. The online experiment also illustrates that the Hellinger-UCB outperforms both KL-UCB and UCB1 in the sense of a higher click-through rate (CTR). △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.04992 [pdf, other]

Efficient Surgical Tool Recognition via HMM-Stabilized Deep Learning

Authors: Haifeng Wang, Hao Xu, Jun Wang, Jian Zhou, Ke Deng

Abstract: Recognizing various surgical tools, actions and phases from surgery videos is an important problem in computer vision with exciting clinical applications. Existing deep-learning-based methods for this problem either process each surgical video as a series of independent images without considering their dependence, or rely on complicated deep learning models to count for dependence of video frames.… ▽ More Recognizing various surgical tools, actions and phases from surgery videos is an important problem in computer vision with exciting clinical applications. Existing deep-learning-based methods for this problem either process each surgical video as a series of independent images without considering their dependence, or rely on complicated deep learning models to count for dependence of video frames. In this study, we revealed from exploratory data analysis that surgical videos enjoy relatively simple semantic structure, where the presence of surgical phases and tools can be well modeled by a compact hidden Markov model (HMM). Based on this observation, we propose an HMM-stabilized deep learning method for tool presence detection. A wide range of experiments confirm that the proposed approaches achieve better performance with lower training and running costs, and support more flexible ways to construct and utilize training data in scenarios where not all surgery videos of interest are extensively labelled. These results suggest that popular deep learning approaches with over-complicated model structures may suffer from inefficient utilization of data, and integrating ingredients of deep learning and statistical learning wisely may lead to more powerful algorithms that enjoy competitive performance, transparent interpretation and convenient model training simultaneously. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2404.01466 [pdf, other]

TS-CausalNN: Learning Temporal Causal Relations from Non-linear Non-stationary Time Series Data

Authors: Omar Faruque, Sahara Ali, Xue Zheng, Jianwu Wang

Abstract: The growing availability and importance of time series data across various domains, including environmental science, epidemiology, and economics, has led to an increasing need for time-series causal discovery methods that can identify the intricate relationships in the non-stationary, non-linear, and often noisy real world data. However, the majority of current time series causal discovery methods… ▽ More The growing availability and importance of time series data across various domains, including environmental science, epidemiology, and economics, has led to an increasing need for time-series causal discovery methods that can identify the intricate relationships in the non-stationary, non-linear, and often noisy real world data. However, the majority of current time series causal discovery methods assume stationarity and linear relations in data, making them infeasible for the task. Further, the recent deep learning-based methods rely on the traditional causal structure learning approaches making them computationally expensive. In this paper, we propose a Time-Series Causal Neural Network (TS-CausalNN) - a deep learning technique to discover contemporaneous and lagged causal relations simultaneously. Our proposed architecture comprises (i) convolutional blocks comprising parallel custom causal layers, (ii) acyclicity constraint, and (iii) optimization techniques using the augmented Lagrangian approach. In addition to the simple parallel design, an advantage of the proposed model is that it naturally handles the non-stationarity and non-linearity of the data. Through experiments on multiple synthetic and real world datasets, we demonstrate the empirical proficiency of our proposed approach as compared to several state-of-the-art methods. The inferred graphs for the real world dataset are in good agreement with the domain understanding. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.14822 [pdf, other]

Non-Convex Robust Hypothesis Testing using Sinkhorn Uncertainty Sets

Authors: Jie Wang, Rui Gao, Yao Xie

Abstract: We present a new framework to address the non-convex robust hypothesis testing problem, wherein the goal is to seek the optimal detector that minimizes the maximum of worst-case type-I and type-II risk functions. The distributional uncertainty sets are constructed to center around the empirical distribution derived from samples based on Sinkhorn discrepancy. Given that the objective involves non-c… ▽ More We present a new framework to address the non-convex robust hypothesis testing problem, wherein the goal is to seek the optimal detector that minimizes the maximum of worst-case type-I and type-II risk functions. The distributional uncertainty sets are constructed to center around the empirical distribution derived from samples based on Sinkhorn discrepancy. Given that the objective involves non-convex, non-smooth probabilistic functions that are often intractable to optimize, existing methods resort to approximations rather than exact solutions. To tackle the challenge, we introduce an exact mixed-integer exponential conic reformulation of the problem, which can be solved into a global optimum with a moderate amount of input data. Subsequently, we propose a convex approximation, demonstrating its superiority over current state-of-the-art methodologies in literature. Furthermore, we establish connections between robust hypothesis testing and regularized formulations of non-robust risk functions, offering insightful interpretations. Our numerical study highlights the satisfactory testing performance and computational efficiency of the proposed framework. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 26 pages, 2 figures

arXiv:2402.17366 [pdf]

The risks of risk assessment: causal blind spots when using prediction models for treatment decisions

Authors: Nan van Geloven, Ruth H Keogh, Wouter van Amsterdam, Giovanni Cinà, Jesse H. Krijthe, Niels Peek, Kim Luijken, Sara Magliacane, Paweł Morzywołek, Thijs van Ommen, Hein Putter, Matthew Sperrin, Junfeng Wang, Daniala L. Weir, Vanessa Didelez

Abstract: Prediction models are increasingly proposed for guiding treatment decisions, but most fail to address the special role of treatments, leading to inappropriate use. This paper highlights the limitations of using standard prediction models for treatment decision support. We identify `causal blind spots' in three common approaches to handling treatments in prediction modelling: including treatment as… ▽ More Prediction models are increasingly proposed for guiding treatment decisions, but most fail to address the special role of treatments, leading to inappropriate use. This paper highlights the limitations of using standard prediction models for treatment decision support. We identify `causal blind spots' in three common approaches to handling treatments in prediction modelling: including treatment as a predictor, restricting data based on treatment status and ignoring treatments. When predictions are used to inform treatment decisions, confounders, colliders and mediators, as well as changes in treatment protocols over time may lead to misinformed decision-making. We illustrate potential harmful consequences in several medical applications. We advocate for an extension of guidelines for development, reporting and evaluation of prediction models to ensure that the intended use of the model is matched to an appropriate risk estimand. When prediction models are intended to inform treatment decisions, prediction models should specify upfront the treatment decisions they aim to support and target a prediction estimand in line with that goal. This requires a shift towards develo** predictions under the specific treatment options under consideration (`predictions under interventions'). Predictions under interventions need causal reasoning and inference techniques during development and validation. We argue that this will improve the efficacy of prediction models in guiding treatment decisions and prevent potential negative effects on patient outcomes. △ Less

Submitted 6 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.11948 [pdf]

Mini-Hes: A Parallelizable Second-order Latent Factor Analysis Model

Authors: Jialiang Wang, Weiling Li, Yurong Zhong, Xin Luo

Abstract: Interactions among large number of entities is naturally high-dimensional and incomplete (HDI) in many big data related tasks. Behavioral characteristics of users are hidden in these interactions, hence, effective representation of the HDI data is a fundamental task for understanding user behaviors. Latent factor analysis (LFA) model has proven to be effective in representing HDI data. The perform… ▽ More Interactions among large number of entities is naturally high-dimensional and incomplete (HDI) in many big data related tasks. Behavioral characteristics of users are hidden in these interactions, hence, effective representation of the HDI data is a fundamental task for understanding user behaviors. Latent factor analysis (LFA) model has proven to be effective in representing HDI data. The performance of an LFA model relies heavily on its training process, which is a non-convex optimization. It has been proven that incorporating local curvature and preprocessing gradients during its training process can lead to superior performance compared to LFA models built with first-order family methods. However, with the escalation of data volume, the feasibility of second-order algorithms encounters challenges. To address this pivotal issue, this paper proposes a mini-block diagonal hessian-free (Mini-Hes) optimization for building an LFA model. It leverages the dominant diagonal blocks in the generalized Gauss-Newton matrix based on the analysis of the Hessian matrix of LFA model and serves as an intermediary strategy bridging the gap between first-order and second-order optimization methods. Experiment results indicate that, with Mini-Hes, the LFA model outperforms several state-of-the-art models in addressing missing data estimation task on multiple real HDI datasets from recommender system. (The source code of Mini-Hes is available at https://github.com/Goallow/Mini-Hes) △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: 6 pages

arXiv:2402.02368 [pdf, other]

Timer: Generative Pre-trained Transformers Are Large Time Series Models

Authors: Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, Mingsheng Long

Abstract: Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous prog… ▽ More Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with small models on current benchmarks. Meanwhile, large models have demonstrated great powers in these scenarios through large-scale pre-training. Continuous progress has been achieved with the emergence of large language models, exhibiting unprecedented abilities such as few-shot generalization, scalability, and task generality, which are however absent in small deep models. To change the status quo of training scenario-specific small models from scratch, this paper aims at the early development of large time series models (LTSM). During pre-training, we curate large-scale datasets with up to 1 billion time points, unify heterogeneous time series into single-series sequence (S3) format, and develop the GPT-style architecture toward LTSMs. To meet diverse application needs, we convert forecasting, imputation, and anomaly detection of time series into a unified generative task. The outcome of this study is a Time Series Transformer (Timer), which is generative pre-trained by next token prediction and adapted to various downstream tasks with promising capabilities as an LTSM. Code and datasets are available at: https://github.com/thuml/Large-Time-Series-Model. △ Less

Submitted 4 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.13760 [pdf, other]

Early Detection of Treatments Side Effect: A Sequential Approach

Authors: Jiayue Wang, Ben Boukai

Abstract: With the emergence and spread of infectious diseases with pandemic potential, such as COVID- 19, the urgency for vaccine development have led to unprecedented compressed and accelerated schedules that shortened the standard development timeline. In a relatively short time, the leading pharmaceutical companies1, received an Emergency Use Authorization (EUA) for vaccine\prime s en-mass deployment To… ▽ More With the emergence and spread of infectious diseases with pandemic potential, such as COVID- 19, the urgency for vaccine development have led to unprecedented compressed and accelerated schedules that shortened the standard development timeline. In a relatively short time, the leading pharmaceutical companies1, received an Emergency Use Authorization (EUA) for vaccine\prime s en-mass deployment To monitor the potential side effect(s) of the vaccine during the (initial) vaccination campaign, we developed an optimal sequential test that allows for the early detection of potential side effect(s). This test employs a rule to stop the vaccination process once the observed number of side effect incidents exceeds a certain (pre-determined) threshold. The optimality of the proposed sequential test is justified when compared with the (α, β) optimality of the non-randomized fixed-sample Uniformly Most Powerful (UMP) test. In the case of a single side effect, we study the properties of the sequential test and derive the exact expressions of the Average Sample Number (ASN) curve of the stop** time (and its variance) via the regularized incomplete beta function. Additionally, we derive the asymptotic distribution of the relative savings in ASN as compared to maximal sample size. Moreover, we construct the post-test parameter estimate and studied its sampling properties, including its asymptotic behavior under local-type alternatives. These limiting behavior results are the consistency and asymptotic normality of the post-test parameter estimator. We conclude the paper with a small simulation study illustrating the asymptotic performance of the point and interval estimation and provide a detailed example, based on COVID-19 side effect data (see Beatty et al. (2021)) of our suggested testing procedure. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: There are 21 pages, 8 pictures and 4 tables

MSC Class: 62L10; 62L12

arXiv:2401.13335 [pdf, other]

Full Bayesian Significance Testing for Neural Networks

Authors: Zehua Liu, Zimeng Li, **gyuan Wang, Yue He

Abstract: Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called \tex… ▽ More Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called \textit{n}FBST, to overcome the limitation in relationship characterization of traditional approaches. A Bayesian neural network is utilized to fit the nonlinear and multi-dimensional relationships with small errors and avoid hard theoretical derivation by computing the evidence value. Besides, \textit{n}FBST can test not only global significance but also local and instance-wise significance, which previous testing methods don't focus on. Moreover, \textit{n}FBST is a general framework that can be extended based on the measures selected, such as Grad-\textit{n}FBST, LRP-\textit{n}FBST, DeepLIFT-\textit{n}FBST, LIME-\textit{n}FBST. A range of experiments on both simulated and real data are conducted to show the advantages of our method. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Published as a conference paper at AAAI 2024

arXiv:2401.11103 [pdf, other]

Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

Authors: Jiachen T. Wang, Prateek Mittal, Ruoxi Jia

Abstract: This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting… ▽ More This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: AISTATS 2024 Oral

arXiv:2401.09125 [pdf, other]

Understanding Heterophily for Graph Neural Networks

Authors: Junfu Wang, Yuanfang Guo, Liang Yang, Yunhong Wang

Abstract: Graphs with heterophily have been regarded as challenging scenarios for Graph Neural Networks (GNNs), where nodes are connected with dissimilar neighbors through various patterns. In this paper, we present theoretical understandings of the impacts of different heterophily patterns for GNNs by incorporating the graph convolution (GC) operations into fully connected networks via the proposed Heterop… ▽ More Graphs with heterophily have been regarded as challenging scenarios for Graph Neural Networks (GNNs), where nodes are connected with dissimilar neighbors through various patterns. In this paper, we present theoretical understandings of the impacts of different heterophily patterns for GNNs by incorporating the graph convolution (GC) operations into fully connected networks via the proposed Heterophilous Stochastic Block Models (HSBM), a general random graph model that can accommodate diverse heterophily patterns. Firstly, we show that by applying a GC operation, the separability gains are determined by two factors, i.e., the Euclidean distance of the neighborhood distributions and $\sqrt{\mathbb{E}\left[\operatorname{deg}\right]}$, where $\mathbb{E}\left[\operatorname{deg}\right]$ is the averaged node degree. It reveals that the impact of heterophily on classification needs to be evaluated alongside the averaged node degree. Secondly, we show that the topological noise has a detrimental impact on separability, which is equivalent to degrading $\mathbb{E}\left[\operatorname{deg}\right]$. Finally, when applying multiple GC operations, we show that the separability gains are determined by the normalized distance of the $l$-powered neighborhood distributions. It indicates that the nodes still possess separability as $l$ goes to infinity in a wide range of regimes. Extensive experiments on both synthetic and real-world data verify the effectiveness of our theory. △ Less

Submitted 4 June, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: ICML 2024

arXiv:2401.04693 [pdf, other]

Co-Clustering Multi-View Data Using the Latent Block Model

Authors: Joshua Tobin, Michaela Black, James Ng, Debbie Rankin, Jonathan Wallace, Catherine Hughes, Leane Hoey, Adrian Moore, **ling Wang, Geraldine Horigan, Paul Carlin, Helene McNulty, Anne M Molloy, Mimi Zhang

Abstract: The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block cluster and allowing the use of well-grounded model selection methods. The LBM, while adapted in literature to handle different feature types, cannot be applied to datasets consisting of multiple disjoint sets of features, termed views, for a common set of observations.… ▽ More The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block cluster and allowing the use of well-grounded model selection methods. The LBM, while adapted in literature to handle different feature types, cannot be applied to datasets consisting of multiple disjoint sets of features, termed views, for a common set of observations. In this work, we introduce the multi-view LBM, extending the LBM method to multi-view data, where each view marginally follows an LBM. In the case of two views, the dependence between them is captured by a cluster membership matrix, and we aim to learn the structure of this matrix. We develop a likelihood-based approach in which parameter estimation uses a stochastic EM algorithm integrating a Gibbs sampler, and an ICL criterion is derived to determine the number of row and column clusters in each view. To motivate the application of multi-view methods, we extend recent work develo** hypothesis tests for the null hypothesis that clusters of observations in each view are independent of each other. The testing procedure is integrated into the model estimation strategy. Furthermore, we introduce a penalty scheme to generate sparse row clusterings. We verify the performance of the developed algorithm using synthetic datasets, and provide guidance for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems. △ Less

Submitted 9 January, 2024; originally announced January 2024.

arXiv:2312.17122 [pdf, other]

Large Language Model for Causal Decision Making

Authors: Haitao Jiang, Lin Ge, Yuhe Gao, Jianian Wang, Rui Song

Abstract: Large Language Models (LLMs) have shown their success in language understanding and reasoning on general topics. However, their capability to perform inference based on user-specified structured data and knowledge in corpus-rare concepts, such as causal decision-making is still limited. In this work, we explore the possibility of fine-tuning an open-sourced LLM into LLM4Causal, which can identify… ▽ More Large Language Models (LLMs) have shown their success in language understanding and reasoning on general topics. However, their capability to perform inference based on user-specified structured data and knowledge in corpus-rare concepts, such as causal decision-making is still limited. In this work, we explore the possibility of fine-tuning an open-sourced LLM into LLM4Causal, which can identify the causal task, execute a corresponding function, and interpret its numerical results based on users' queries and the provided dataset. Meanwhile, we propose a data generation process for more controllable GPT prompting and present two instruction-tuning datasets: (1) Causal-Retrieval-Bench for causal problem identification and input parameter extraction for causal function calling and (2) Causal-Interpret-Bench for in-context causal interpretation. By conducting end-to-end evaluations and two ablation studies, we showed that LLM4Causal can deliver end-to-end solutions for causal problems and provide easy-to-understand answers, which significantly outperforms the baselines. △ Less

Submitted 11 April, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.10563 [pdf, other]

Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS Integration

Authors: Rita Qiuran Lyu, Chong Wu, Xinwei Ma, **gshen Wang

Abstract: Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables.… ▽ More Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables. We develop a novel MR framework for mediation analysis with genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework employs carefully crafted estimating equations, allowing for different sets of genetic variations to instrument the exposure and the mediator, to efficiently integrate information stored in three independent GWAS. As part of this endeavor, we demonstrate that in mediation analysis, the challenge raised by instrument selection goes beyond the well-known winner's curse issue, and therefore, addressing it requires special treatment. We then develop bias correction techniques to address the instrument selection issue and commonly encountered measurement error bias issue. Collectively, through our theoretical investigations, we show that our framework provides valid statistical inference for both direct and mediation effects with enhanced statistical efficiency compared to existing methods. We further illustrate the finite-sample performance of our approach through simulation experiments and a case study. △ Less

Submitted 17 May, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

arXiv:2312.06883 [pdf, other]

Adaptive Experiments Toward Learning Treatment Effect Heterogeneity

Authors: Waverly Wei, Xinwei Ma, **gshen Wang

Abstract: Understanding treatment effect heterogeneity has become an increasingly popular task in various fields, as it helps design personalized advertisements in e-commerce or targeted treatment in biomedical studies. However, most of the existing work in this research area focused on either analyzing observational data based on strong causal assumptions or conducting post hoc analyses of randomized contr… ▽ More Understanding treatment effect heterogeneity has become an increasingly popular task in various fields, as it helps design personalized advertisements in e-commerce or targeted treatment in biomedical studies. However, most of the existing work in this research area focused on either analyzing observational data based on strong causal assumptions or conducting post hoc analyses of randomized controlled trial data, and there has been limited effort dedicated to the design of randomized experiments specifically for uncovering treatment effect heterogeneity. In the manuscript, we develop a framework for designing and analyzing response adaptive experiments toward better learning treatment effect heterogeneity. Concretely, we provide response adaptive experimental design frameworks that sequentially revise the data collection mechanism according to the accrued evidence during the experiment. Such design strategies allow for the identification of subgroups with the largest treatment effects with enhanced statistical efficiency. The proposed frameworks not only unify adaptive enrichment designs and response-adaptive randomization designs but also complement A/B test designs in e-commerce and randomized trial designs in clinical settings. We demonstrate the merit of our design with theoretical justifications and in simulation studies with synthetic e-commerce and clinical trial data. △ Less

Submitted 13 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.05771 [pdf, other]

Hacking Task Confounder in Meta-Learning

Authors: **gyao Wang, Yi Ren, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

Abstract: Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain thi… ▽ More Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain this phenomenon, we conduct Structural Causal Models (SCMs) for causal analysis. Our investigation uncovers the presence of spurious correlations between task-specific causal factors and labels in meta-learning. Furthermore, the confounding factors differ across different batches. We refer to these confounding factors as "Task Confounders". Based on these findings, we propose a plug-and-play Meta-learning Causal Representation Learner (MetaCRL) to eliminate task confounders. It encodes decoupled generating factors from multiple tasks and utilizes an invariant-based bi-level optimization mechanism to ensure their causality for meta-learning. Extensive experiments on various benchmark datasets demonstrate that our work achieves state-of-the-art (SOTA) performance. △ Less

Submitted 29 May, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: Accepted by IJCAI 2024, 9 pages, 5 figures, 4 tables

arXiv:2312.05549 [pdf, other]

Multi-granularity Causal Structure Learning

Authors: Jiaxuan Liang, Jun Wang, Guoxian Yu, Shuyin Xia, Guoyin Wang

Abstract: Unveil, model, and comprehend the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collect… ▽ More Unveil, model, and comprehend the causal mechanisms underpinning natural phenomena stand as fundamental endeavors across myriad scientific disciplines. Meanwhile, new knowledge emerges when discovering causal relationships from data. Existing causal learning algorithms predominantly focus on the isolated effects of variables, overlook the intricate interplay of multiple variables and their collective behavioral patterns. Furthermore, the ubiquity of high-dimensional data exacts a substantial temporal cost for causal algorithms. In this paper, we develop a novel method called MgCSL (Multi-granularity Causal Structure Learning), which first leverages sparse auto-encoder to explore coarse-graining strategies and causal abstractions from micro-variables to macro-ones. MgCSL then takes multi-granularity variables as inputs to train multilayer perceptrons and to delve the causality between variables. To enhance the efficacy on high-dimensional data, MgCSL introduces a simplified acyclicity constraint to adeptly search the directed acyclic graph among variables. Experimental results show that MgCSL outperforms competitive baselines, and finds out explainable causal connections on fMRI datasets. △ Less

Submitted 12 December, 2023; v1 submitted 9 December, 2023; originally announced December 2023.

Comments: Accepted by the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI2024)

arXiv:2312.03438 [pdf, ps, other]

On the Estimation Performance of Generalized Power Method for Heteroscedastic Probabilistic PCA

Authors: **xin Wang, Chonghe Jiang, Huikang Liu, Anthony Man-Cho So

Abstract: The heteroscedastic probabilistic principal component analysis (PCA) technique, a variant of the classic PCA that considers data heterogeneity, is receiving more and more attention in the data science and signal processing communities. In this paper, to estimate the underlying low-dimensional linear subspace (simply called \emph{ground truth}) from available heterogeneous data samples, we consider… ▽ More The heteroscedastic probabilistic principal component analysis (PCA) technique, a variant of the classic PCA that considers data heterogeneity, is receiving more and more attention in the data science and signal processing communities. In this paper, to estimate the underlying low-dimensional linear subspace (simply called \emph{ground truth}) from available heterogeneous data samples, we consider the associated non-convex maximum-likelihood estimation problem, which involves maximizing a sum of heterogeneous quadratic forms over an orthogonality constraint (HQPOC). We propose a first-order method -- generalized power method (GPM) -- to tackle the problem and establish its \emph{estimation performance} guarantee. Specifically, we show that, given a suitable initialization, the distances between the iterates generated by GPM and the ground truth decrease at least geometrically to some threshold associated with the residual part of certain "population-residual decomposition". In establishing the estimation performance result, we prove a novel local error bound property of another closely related optimization problem, namely quadratic optimization with orthogonality constraint (QPOC), which is new and can be of independent interest. Numerical experiments are conducted to demonstrate the superior performance of GPM in both Gaussian noise and sub-Gaussian noise settings. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: 22 pages

arXiv:2311.17547 [pdf, other]

Risk-based decision making: estimands for sequential prediction under interventions

Authors: Kim Luijken, Paweł Morzywołek, Wouter van Amsterdam, Giovanni Cinà, Jeroen Hoogland, Ruth Keogh, Jesse Krijthe, Sara Magliacane, Thijs van Ommen, Niels Peek, Hein Putter, Maarten van Smeden, Matthew Sperrin, Junfeng Wang, Daniala Weir, Vanessa Didelez, Nan van Geloven

Abstract: Prediction models are used amongst others to inform medical decisions on interventions. Typically, individuals with high risks of adverse outcomes are advised to undergo an intervention while those at low risk are advised to refrain from it. Standard prediction models do not always provide risks that are relevant to inform such decisions: e.g., an individual may be estimated to be at low risk beca… ▽ More Prediction models are used amongst others to inform medical decisions on interventions. Typically, individuals with high risks of adverse outcomes are advised to undergo an intervention while those at low risk are advised to refrain from it. Standard prediction models do not always provide risks that are relevant to inform such decisions: e.g., an individual may be estimated to be at low risk because similar individuals in the past received an intervention which lowered their risk. Therefore, prediction models supporting decisions should target risks belonging to defined intervention strategies. Previous works on prediction under interventions assumed that the prediction model was used only at one time point to make an intervention decision. In clinical practice, intervention decisions are rarely made only once: they might be repeated, deferred and re-evaluated. This requires estimated risks under interventions that can be reconsidered at several potential decision moments. In the current work, we highlight key considerations for formulating estimands in sequential prediction under interventions that can inform such intervention decisions. We illustrate these considerations by giving examples of estimands for a case study about choosing between vaginal delivery and cesarean section for women giving birth. Our formalization of prediction tasks in a sequential, causal, and estimand context provides guidance for future studies to ensure that the right question is answered and appropriate causal estimation approaches are chosen to develop sequential prediction models that can inform intervention decisions. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 32 pages, 2 figures

arXiv:2311.16856 [pdf, other]

Attentional Graph Neural Networks for Robust Massive Network Localization

Authors: Wenzhong Yan, Juntao Wang, Feng Yin, Yang Tian, Abdelhak M. Zoubir

Abstract: In recent years, Graph neural networks (GNNs) have emerged as a prominent tool for classification tasks in machine learning. However, their application in regression tasks remains underexplored. To tap the potential of GNNs in regression, this paper integrates GNNs with attention mechanism, a technique that revolutionized sequential learning tasks with its adaptability and robustness, to tackle a… ▽ More In recent years, Graph neural networks (GNNs) have emerged as a prominent tool for classification tasks in machine learning. However, their application in regression tasks remains underexplored. To tap the potential of GNNs in regression, this paper integrates GNNs with attention mechanism, a technique that revolutionized sequential learning tasks with its adaptability and robustness, to tackle a challenging nonlinear regression problem: network localization. We first introduce a novel network localization method based on graph convolutional network (GCN), which exhibits exceptional precision even under severe non-line-of-sight (NLOS) conditions, thereby diminishing the need for laborious offline calibration or NLOS identification. We further propose an attentional graph neural network (AGNN) model, aimed at improving the limited flexibility and mitigating the high sensitivity to the hyperparameter of the GCN-based method. The AGNN comprises two crucial modules, each designed with distinct attention architectures to address specific issues associated with the GCN-based method, rendering it more practical in real-world scenarios. Experimental results substantiate the efficacy of our proposed GCN-based method and AGNN model, as well as the enhancements of AGNN model. Additionally, we delve into the performance improvements of AGNN model by analyzing it from the perspectives of dynamic attention and computational complexity. △ Less

Submitted 14 February, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.13825 [pdf, other]

Online Prediction of Extreme Conditional Quantiles via B-Spline Interpolation

Authors: Zhengpin Li, Jian Wang, Yanxi Hou

Abstract: Extreme quantiles are critical for understanding the behavior of data in the tail region of a distribution. It is challenging to estimate extreme quantiles, particularly when dealing with limited data in the tail. In such cases, extreme value theory offers a solution by approximating the tail distribution using the Generalized Pareto Distribution (GPD). This allows for the extrapolation beyond the… ▽ More Extreme quantiles are critical for understanding the behavior of data in the tail region of a distribution. It is challenging to estimate extreme quantiles, particularly when dealing with limited data in the tail. In such cases, extreme value theory offers a solution by approximating the tail distribution using the Generalized Pareto Distribution (GPD). This allows for the extrapolation beyond the range of observed data, making it a valuable tool for various applications. However, when it comes to conditional cases, where estimation relies on covariates, existing methods may require computationally expensive GPD fitting for different observations. This computational burden becomes even more problematic as the volume of observations increases, sometimes approaching infinity. To address this issue, we propose an interpolation-based algorithm named EMI. EMI facilitates the online prediction of extreme conditional quantiles with finite offline observations. Combining quantile regression and GPD-based extrapolation, EMI formulates as a bilevel programming problem, efficiently solvable using classic optimization methods. Once estimates for offline observations are obtained, EMI employs B-spline interpolation for covariate-dependent variables, enabling estimation for online observations with finite GPD fitting. Simulations and real data analysis demonstrate the effectiveness of EMI across various scenarios. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: 22 pages, 16 figures

arXiv:2311.13196 [pdf, other]

Optimal Time of Arrival Estimation for MIMO Backscatter Channels

Authors: Chen He, Luyang Han, Z. Jane Wang

Abstract: In this paper, we propose a novel time of arrival (TOA) estimator for multiple-input-multiple-output (MIMO) backscatter channels in closed form. The proposed estimator refines the estimation precision from the topological structure of the MIMO backscatter channels, and can considerably enhance the estimation accuracy. Particularly, we show that for the general $M \times N$ bistatic topology, the m… ▽ More In this paper, we propose a novel time of arrival (TOA) estimator for multiple-input-multiple-output (MIMO) backscatter channels in closed form. The proposed estimator refines the estimation precision from the topological structure of the MIMO backscatter channels, and can considerably enhance the estimation accuracy. Particularly, we show that for the general $M \times N$ bistatic topology, the mean square error (MSE) is $\frac{M+N-1}{MN}σ^2_0$, and for the general $M \times M$ monostatic topology, it is $\frac{2M-1}{M^2}σ^2_0$ for the diagonal subchannels, and $\frac{M-1}{M^2}σ^2_0$ for the off-diagonal subchannels, where $σ^2_0$ is the MSE of the conventional least square estimator. In addition, we derive the Cramer-Rao lower bound (CRLB) for MIMO backscatter TOA estimation which indicates that the proposed estimator is optimal. Simulation results verify that the proposed TOA estimator can considerably improve both estimation and positioning accuracy, especially when the MIMO scale is large. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2311.12379 [pdf, other]

Infinite forecast combinations based on Dirichlet process

Authors: Yinuo Ren, Feng Li, Yanfei Kang, Jue Wang

Abstract: Forecast combination integrates information from various sources by consolidating multiple forecast results from the target time series. Instead of the need to select a single optimal forecasting model, this paper introduces a deep learning ensemble forecasting model based on the Dirichlet process. Initially, the learning rate is sampled with three basis distributions as hyperparameters to convert… ▽ More Forecast combination integrates information from various sources by consolidating multiple forecast results from the target time series. Instead of the need to select a single optimal forecasting model, this paper introduces a deep learning ensemble forecasting model based on the Dirichlet process. Initially, the learning rate is sampled with three basis distributions as hyperparameters to convert the infinite mixture into a finite one. All checkpoints are collected to establish a deep learning sub-model pool, and weight adjustment and diversity strategies are developed during the combination process. The main advantage of this method is its ability to generate the required base learners through a single training process, utilizing the decaying strategy to tackle the challenge posed by the stochastic nature of gradient descent in determining the optimal learning rate. To ensure the method's generalizability and competitiveness, this paper conducts an empirical analysis using the weekly dataset from the M4 competition and explores sensitivity to the number of models to be combined. The results demonstrate that the ensemble model proposed offers substantial improvements in prediction accuracy and stability compared to a single benchmark model. △ Less

Submitted 24 November, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.02532 [pdf, other]

Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making

Authors: Ting Li, Chengchun Shi, Jianing Wang, Fan Zhou, Hongtu Zhu

Abstract: A/B testing is critical for modern technological companies to evaluate the effectiveness of newly developed products against standard baselines. This paper studies optimal designs that aim to maximize the amount of information obtained from online experiments to estimate treatment effects accurately. We propose three optimal allocation strategies in a dynamic setting where treatments are sequentia… ▽ More A/B testing is critical for modern technological companies to evaluate the effectiveness of newly developed products against standard baselines. This paper studies optimal designs that aim to maximize the amount of information obtained from online experiments to estimate treatment effects accurately. We propose three optimal allocation strategies in a dynamic setting where treatments are sequentially assigned over time. These strategies are designed to minimize the variance of the treatment effect estimator when data follow a non-Markov decision process or a (time-varying) Markov decision process. We further develop estimation procedures based on existing off-policy evaluation (OPE) methods and conduct extensive experiments in various environments to demonstrate the effectiveness of the proposed methodologies. In theory, we prove the optimality of the proposed treatment allocation design and establish upper bounds for the mean squared errors of the resulting treatment effect estimators. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.20460 [pdf, other]

Aggregating Dependent Signals with Heavy-Tailed Combination Tests

Authors: Lin Gui, Yuchao Jiang, **gshu Wang

Abstract: Combining dependent p-values to evaluate the global null hypothesis presents a longstanding challenge in statistical inference, particularly when aggregating results from diverse methods to boost signal detection. P-value combination tests using heavy-tailed distribution based transformations, such as the Cauchy combination test and the harmonic mean p-value, have recently garnered significant int… ▽ More Combining dependent p-values to evaluate the global null hypothesis presents a longstanding challenge in statistical inference, particularly when aggregating results from diverse methods to boost signal detection. P-value combination tests using heavy-tailed distribution based transformations, such as the Cauchy combination test and the harmonic mean p-value, have recently garnered significant interest for their potential to efficiently handle arbitrary p-value dependencies. Despite their growing popularity in practical applications, there is a gap in comprehensive theoretical and empirical evaluations of these methods. This paper conducts an extensive investigation, revealing that, theoretically, while these combination tests are asymptotically valid for pairwise quasi-asymptotically independent test statistics, such as bivariate normal variables, they are also asymptotically equivalent to the Bonferroni test under the same conditions. However, extensive simulations unveil their practical utility, especially in scenarios where stringent type-I error control is not necessary and signals are dense. Both the heaviness of the distribution and its support substantially impact the tests' non-asymptotic validity and power, and we recommend using a truncated Cauchy distribution in practice. Moreover, we show that under the violation of quasi-asymptotic independence among test statistics, these tests remain valid and, in fact, can be considerably less conservative than the Bonferroni test. We also present two case studies in genetics and genomics, showcasing the potential of the combination tests to significantly enhance statistical power while effectively controlling type-I errors. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.16290 [pdf, other]

Fair Adaptive Experiments

Authors: Waverly Wei, Xinwei Ma, **gshen Wang

Abstract: Randomized experiments have been the gold standard for assessing the effectiveness of a treatment or policy. The classical complete randomization approach assigns treatments based on a prespecified probability and may lead to inefficient use of data. Adaptive experiments improve upon complete randomization by sequentially learning and updating treatment assignment probabilities. However, their app… ▽ More Randomized experiments have been the gold standard for assessing the effectiveness of a treatment or policy. The classical complete randomization approach assigns treatments based on a prespecified probability and may lead to inefficient use of data. Adaptive experiments improve upon complete randomization by sequentially learning and updating treatment assignment probabilities. However, their application can also raise fairness and equity concerns, as assignment probabilities may vary drastically across groups of participants. Furthermore, when treatment is expected to be extremely beneficial to certain groups of participants, it is more appropriate to expose many of these participants to favorable treatment. In response to these challenges, we propose a fair adaptive experiment strategy that simultaneously enhances data use efficiency, achieves an envy-free treatment assignment guarantee, and improves the overall welfare of participants. An important feature of our proposed strategy is that we do not impose parametric modeling assumptions on the outcome variables, making it more versatile and applicable to a wider array of applications. Through our theoretical investigation, we characterize the convergence rate of the estimated treatment effects and the associated standard deviations at the group level and further prove that our adaptive treatment assignment algorithm, despite not having a closed-form expression, approaches the optimal allocation rule asymptotically. Our proof strategy takes into account the fact that the allocation decisions in our design depend on sequentially accumulated data, which poses a significant challenge in characterizing the properties and conducting statistical inference of our method. We further provide simulation evidence to showcase the performance of our fair adaptive experiment strategy. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.16203 [pdf, other]

Multivariate Dynamic Mediation Analysis under a Reinforcement Learning Framework

Authors: Lan Luo, Chengchun Shi, Jitao Wang, Zhenke Wu, Lexin Li

Abstract: Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treat… ▽ More Mediation analysis is an important analytic tool commonly used in a broad range of scientific applications. In this article, we study the problem of mediation analysis when there are multivariate and conditionally dependent mediators, and when the variables are observed over multiple time points. The problem is challenging, because the effect of a mediator involves not only the path from the treatment to this mediator itself at the current time point, but also all possible paths pointed to this mediator from its upstream mediators, as well as the carryover effects from all previous time points. We propose a novel multivariate dynamic mediation analysis approach. Drawing inspiration from the Markov decision process model that is frequently employed in reinforcement learning, we introduce a Markov mediation process paired with a system of time-varying linear structural equation models to formulate the problem. We then formally define the individual mediation effect, built upon the idea of simultaneous interventions and intervention calculus. We next derive the closed-form expression and propose an iterative estimation procedure under the Markov mediation process model. We study both the asymptotic property and the empirical performance of the proposed estimator, and further illustrate our method with a mobile health application. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.10239 [pdf, other]

Structural transfer learning of non-Gaussian DAG

Authors: Mingyang Ren, Xin He, Junhui Wang

Abstract: Directed acyclic graph (DAG) has been widely employed to represent directional relationships among a set of collected nodes. Yet, the available data in one single study is often limited for accurate DAG reconstruction, whereas heterogeneous data may be collected from multiple relevant studies. It remains an open question how to pool the heterogeneous data together for better DAG structure reconstr… ▽ More Directed acyclic graph (DAG) has been widely employed to represent directional relationships among a set of collected nodes. Yet, the available data in one single study is often limited for accurate DAG reconstruction, whereas heterogeneous data may be collected from multiple relevant studies. It remains an open question how to pool the heterogeneous data together for better DAG structure reconstruction in the target study. In this paper, we first introduce a novel set of structural similarity measures for DAG and then present a transfer DAG learning framework by effectively leveraging information from auxiliary DAGs of different levels of similarities. Our theoretical analysis shows substantial improvement in terms of DAG reconstruction in the target study, even when no auxiliary DAG is overall similar to the target DAG, which is in sharp contrast to most existing transfer learning methods. The advantage of the proposed transfer DAG learning is also supported by extensive numerical experiments on both synthetic data and multi-site brain functional connectivity network data. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 35 pages, 3 figures, 3 tables

arXiv:2310.09583 [pdf, other]

Two Sides of The Same Coin: Bridging Deep Equilibrium Models and Neural ODEs via Homotopy Continuation

Authors: Shutong Ding, Tianyu Cui, **gya Wang, Ye Shi

Abstract: Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations (Neural ODEs) are two branches of implicit models that have achieved remarkable success owing to their superior performance and low memory consumption. While both are implicit models, DEQs and Neural ODEs are derived from different mathematical formulations. Inspired by homotopy continuation, we establish a connection betwee… ▽ More Deep Equilibrium Models (DEQs) and Neural Ordinary Differential Equations (Neural ODEs) are two branches of implicit models that have achieved remarkable success owing to their superior performance and low memory consumption. While both are implicit models, DEQs and Neural ODEs are derived from different mathematical formulations. Inspired by homotopy continuation, we establish a connection between these two models and illustrate that they are actually two sides of the same coin. Homotopy continuation is a classical method of solving nonlinear equations based on a corresponding ODE. Given this connection, we proposed a new implicit model called HomoODE that inherits the property of high accuracy from DEQs and the property of stability from Neural ODEs. Unlike DEQs, which explicitly solve an equilibrium-point-finding problem via Newton's methods in the forward pass, HomoODE solves the equilibrium-point-finding problem implicitly using a modified Neural ODE via homotopy continuation. Further, we developed an acceleration method for HomoODE with a shared learnable initial point. It is worth noting that our model also provides a better understanding of why Augmented Neural ODEs work as long as the augmented part is regarded as the equilibrium point to find. Comprehensive experiments with several image classification tasks demonstrate that HomoODE surpasses existing implicit models in terms of both accuracy and memory consumption. △ Less

Submitted 21 December, 2023; v1 submitted 14 October, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS2023

arXiv:2310.08268 [pdf, other]

Change point detection in dynamic heterogeneous networks via subspace tracking

Authors: Yuzhao Zhang, **gnan Zhang, Yifan Sun, Junhui Wang

Abstract: Dynamic networks consist of a sequence of time-varying networks, and it is of great importance to detect the network change points. Most existing methods focus on detecting abrupt change points, necessitating the assumption that the underlying network probability matrix remains constant between adjacent change points. This paper introduces a new model that allows the network probability matrix to… ▽ More Dynamic networks consist of a sequence of time-varying networks, and it is of great importance to detect the network change points. Most existing methods focus on detecting abrupt change points, necessitating the assumption that the underlying network probability matrix remains constant between adjacent change points. This paper introduces a new model that allows the network probability matrix to undergo continuous shifting, while the latent network structure, represented via the embedding subspace, only changes at certain time points. Two novel statistics are proposed to jointly detect these network subspace change points, followed by a carefully refined detection procedure. Theoretically, we show that the proposed method is asymptotically consistent in terms of change point detection, and also establish the impossibility region for detecting these network subspace change points. The advantage of the proposed method is also supported by extensive numerical experiments on both synthetic networks and a UK politician social network. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.00646 [pdf, other]

WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data

Authors: **gtan Wang, Xinyang Lu, Zitong Zhao, Zhongxiang Dai, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low

Abstract: The impressive performances of large language models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the intellectual property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to (a) identify the data provider who… ▽ More The impressive performances of large language models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the intellectual property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to (a) identify the data provider who contributed to the generation of a synthetic text by an LLM (source attribution) and (b) verify whether the text data from a data provider has been used to train an LLM (data provenance). In this paper, we show that both problems can be solved by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a WAtermarking for Source Attribution (WASA) framework that satisfies these key properties due to our algorithmic designs. Our WASA framework enables an LLM to learn an accurate map** from the texts of different data providers to their corresponding unique watermarks, which sets the foundation for effective source attribution (and hence data provenance). Extensive empirical evaluations show that our WASA framework achieves effective source attribution and data provenance. △ Less

Submitted 1 October, 2023; originally announced October 2023.

arXiv:2309.08039 [pdf, other]

Flexible Functional Treatment Effect Estimation

Authors: Jiayi Wang, Raymond K. W. Wong, Xiaoke Zhang, Kwun Chuen Gary Chan

Abstract: We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weight… ▽ More We study treatment effect estimation with functional treatments where the average potential outcome functional is a function of functions, in contrast to continuous treatment effect estimation where the target is a function of real numbers. By considering a flexible scalar-on-function marginal structural model, a weight-modified kernel ridge regression (WMKRR) is adopted for estimation. The weights are constructed by directly minimizing the uniform balancing error resulting from a decomposition of the WMKRR estimator, instead of being estimated under a particular treatment selection model. Despite the complex structure of the uniform balancing error derived under WMKRR, finite-dimensional convex algorithms can be applied to efficiently solve for the proposed weights thanks to a representer theorem. The optimal convergence rate is shown to be attainable by the proposed WMKRR estimator without any smoothness assumption on the true weight function. Corresponding empirical performance is demonstrated by a simulation study and a real data application. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2309.06991 [pdf, other]

Unsupervised Contrast-Consistent Ranking with Language Models

Authors: Niklas Stoehr, Pengxiang Cheng, **g Wang, Daniel Preotiuc-Pietro, Rajarshi Bhowmik

Abstract: Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank product reviews by sentiment. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful cal… ▽ More Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank product reviews by sentiment. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce. This motivates us to explore an alternative approach that is inspired by an unsupervised probing method called Contrast-Consistent Search (CCS). The idea is to train a probe guided by a logical constraint: a language model's representation of a statement and its negation must be mapped to contrastive true-false poles consistently across multiple statements. We hypothesize that similar constraints apply to ranking tasks where all items are related via consistent, pairwise or listwise comparisons. To this end, we extend the binary CCS method to Contrast-Consistent Ranking (CCR) by adapting existing ranking methods such as the Max-Margin Loss, Triplet Loss and an Ordinal Regression objective. Across different models and datasets, our results confirm that CCR probing performs better or, at least, on a par with prompting. △ Less

Submitted 3 February, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: Long Paper at EACL 2024

arXiv:2309.04957 [pdf, other]

Winner's Curse Free Robust Mendelian Randomization with Summary Data

Authors: Zhongming Xie, Wanheng Zhang, **gshen Wang, Chong Wu

Abstract: In the past decade, the increased availability of genome-wide association studies summary data has popularized Mendelian Randomization (MR) for conducting causal inference. MR analyses, incorporating genetic variants as instrumental variables, are known for their robustness against reverse causation bias and unmeasured confounders. Nevertheless, classical MR analyses utilizing summary data may sti… ▽ More In the past decade, the increased availability of genome-wide association studies summary data has popularized Mendelian Randomization (MR) for conducting causal inference. MR analyses, incorporating genetic variants as instrumental variables, are known for their robustness against reverse causation bias and unmeasured confounders. Nevertheless, classical MR analyses utilizing summary data may still produce biased causal effect estimates due to the winner's curse and pleiotropic issues. To address these two issues and establish valid causal conclusions, we propose a unified robust Mendelian Randomization framework with summary data, which systematically removes the winner's curse and screens out invalid genetic instruments with pleiotropic effects. Different from existing robust MR literature, our framework delivers valid statistical inference on the causal effect neither requiring the genetic pleiotropy effects to follow any parametric distribution nor relying on perfect instrument screening property. Under appropriate conditions, we show that our proposed estimator converges to a normal distribution and its variance can be well estimated. We demonstrate the performance of our proposed estimator through Monte Carlo simulations and two case studies. The codes implementing the procedures are available at https://github.com/ChongWuLab/CARE/. △ Less

Submitted 10 September, 2023; originally announced September 2023.

arXiv:2309.04626 [pdf, other]

Perceptual adjustment queries and an inverted measurement paradigm for low-rank metric learning

Authors: Austin Xu, Andrew D. McRae, **gyan Wang, Mark A. Davenport, Ashwin Pananjady

Abstract: We introduce a new type of query mechanism for collecting human feedback, called the perceptual adjustment query ( PAQ). Being both informative and cognitively lightweight, the PAQ adopts an inverted measurement scheme, and combines advantages from both cardinal and ordinal queries. We showcase the PAQ in the metric learning problem, where we collect PAQ measurements to learn an unknown Mahalanobi… ▽ More We introduce a new type of query mechanism for collecting human feedback, called the perceptual adjustment query ( PAQ). Being both informative and cognitively lightweight, the PAQ adopts an inverted measurement scheme, and combines advantages from both cardinal and ordinal queries. We showcase the PAQ in the metric learning problem, where we collect PAQ measurements to learn an unknown Mahalanobis distance. This gives rise to a high-dimensional, low-rank matrix estimation problem to which standard matrix estimators cannot be applied. Consequently, we develop a two-stage estimator for metric learning from PAQs, and provide sample complexity guarantees for this estimator. We present numerical simulations demonstrating the performance of the estimator and its notable properties. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 42 pages, 6 figures

Showing 1–50 of 620 results for author: Wang, J