-
Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion
Authors:
Yi Li,
Jiangmeng Li,
Fei Song,
Qingmeng Zhu,
Changwen Zheng,
Wenwen Qiang
Abstract:
Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods…
▽ More
Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $β$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Introducing Diminutive Causal Structure into Graph Representation Learning
Authors:
Hang Gao,
Peng Qiao,
Yifan **,
Fengge Wu,
Jiangmeng Li,
Changwen Zheng
Abstract:
When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. Howeve…
▽ More
When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. However, within the domain of graph representation learning, the inherent complexity of graph data obstructs the derivation of a comprehensive causal structure that encapsulates universal rules or relationships governing the entire dataset. Instead, only specialized diminutive causal structures, delineating specific causal relationships within constrained subsets of graph data, emerge as discernible. Motivated by empirical insights, it is observed that GNN models exhibit a tendency to converge towards such specialized causal structures during the training process. Consequently, we posit that the introduction of these specific causal structures is advantageous for the training of GNN models. Building upon this proposition, we introduce a novel method that enables GNN models to glean insights from these specialized diminutive causal structures, thereby enhancing overall performance. Our method specifically extracts causal knowledge from the model representation of these diminutive causal structures and incorporates interchange intervention to optimize the learning process. Theoretical analysis serves to corroborate the efficacy of our proposed method. Furthermore, empirical experiments consistently demonstrate significant performance improvements across diverse datasets.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability
Authors:
Chenyu Zheng,
Wei Huang,
Rongzhen Wang,
Guoqiang Wu,
Jun Zhu,
Chongxuan Li
Abstract:
Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objecti…
▽ More
Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process $x_{t+1} = W x_t$. First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns $W$ by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned $\widehat{W}$ for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Be Aware of the Neighborhood Effect: Modeling Selection Bias under Interference
Authors:
Haoxuan Li,
Chunyuan Zheng,
Sihao Ding,
Peng Wu,
Zhi Geng,
Fuli Feng,
Xiangnan He
Abstract:
Selection bias in recommender system arises from the recommendation process of system filtering and the interactive process of user selection. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but ignore the fact that potential outcomes for a given user-item pair may vary with the treatments assigned to other user-item pairs, name…
▽ More
Selection bias in recommender system arises from the recommendation process of system filtering and the interactive process of user selection. Many previous studies have focused on addressing selection bias to achieve unbiased learning of the prediction model, but ignore the fact that potential outcomes for a given user-item pair may vary with the treatments assigned to other user-item pairs, named neighborhood effect. To fill the gap, this paper formally formulates the neighborhood effect as an interference problem from the perspective of causal inference and introduces a treatment representation to capture the neighborhood effect. On this basis, we propose a novel ideal loss that can be used to deal with selection bias in the presence of neighborhood effect. We further develop two new estimators for estimating the proposed ideal loss. We theoretically establish the connection between the proposed and previous debiasing methods ignoring the neighborhood effect, showing that the proposed methods can achieve unbiased learning when both selection bias and neighborhood effect are present, while the existing methods are biased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Rethinking Causal Relationships Learning in Graph Neural Networks
Authors:
Hang Gao,
Chengyu Yao,
Jiangmeng Li,
Lingyu Si,
Yifan **,
Fengge Wu,
Changwen Zheng,
Hua** Liu
Abstract:
Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conductin…
▽ More
Graph Neural Networks (GNNs) demonstrate their significance by effectively modeling complex interrelationships within graph-structured data. To enhance the credibility and robustness of GNNs, it becomes exceptionally crucial to bolster their ability to capture causal relationships. However, despite recent advancements that have indeed strengthened GNNs from a causal learning perspective, conducting an in-depth analysis specifically targeting the causal modeling prowess of GNNs remains an unresolved issue. In order to comprehensively analyze various GNN models from a causal learning perspective, we constructed an artificially synthesized dataset with known and controllable causal relationships between data and labels. The rationality of the generated data is further ensured through theoretical foundations. Drawing insights from analyses conducted using our dataset, we introduce a lightweight and highly adaptable GNN module designed to strengthen GNNs' causal learning capabilities across a diverse range of tasks. Through a series of experiments conducted on both synthetic datasets and other real-world datasets, we empirically validate the effectiveness of the proposed module.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Hacking Task Confounder in Meta-Learning
Authors:
**gyao Wang,
Yi Ren,
Zeen Song,
Jianqi Zhang,
Changwen Zheng,
Wenwen Qiang
Abstract:
Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain thi…
▽ More
Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain this phenomenon, we conduct Structural Causal Models (SCMs) for causal analysis. Our investigation uncovers the presence of spurious correlations between task-specific causal factors and labels in meta-learning. Furthermore, the confounding factors differ across different batches. We refer to these confounding factors as "Task Confounders". Based on these findings, we propose a plug-and-play Meta-learning Causal Representation Learner (MetaCRL) to eliminate task confounders. It encodes decoupled generating factors from multiple tasks and utilizes an invariant-based bi-level optimization mechanism to ensure their causality for meta-learning. Extensive experiments on various benchmark datasets demonstrate that our work achieves state-of-the-art (SOTA) performance.
△ Less
Submitted 29 May, 2024; v1 submitted 10 December, 2023;
originally announced December 2023.
-
Variable selection with FDR control for noisy data -- an application to screening metabolites that are associated with breast and colorectal cancer
Authors:
Runqiu Wang,
Ran Dai,
Ying Huang,
Marian L. Neuhouser,
Johanna W. Lampe,
Daniel Raftery,
Fred K. Tabung,
Cheng Zheng
Abstract:
The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in develo** reliable and reproducible methodologies for disease association studies. Therefore, there i…
▽ More
The rapidly expanding field of metabolomics presents an invaluable resource for understanding the associations between metabolites and various diseases. However, the high dimensionality, presence of missing values, and measurement errors associated with metabolomics data can present challenges in develo** reliable and reproducible methodologies for disease association studies. Therefore, there is a compelling need to develop robust statistical methods that can navigate these complexities to achieve reliable and reproducible disease association studies. In this paper, we focus on develo** such a methodology with an emphasis on controlling the False Discovery Rate during the screening of mutual metabolomic signals for multiple disease outcomes. We illustrate the versatility and performance of this procedure in a variety of scenarios, dealing with missing data and measurement errors. As a specific application of this novel methodology, we target two of the most prevalent cancers among US women: breast cancer and colorectal cancer. By applying our method to the Wome's Health Initiative data, we successfully identify metabolites that are associated with either or both of these cancers, demonstrating the practical utility and potential of our method in identifying consistent risk factors and understanding shared mechanisms between diseases.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Toward Understanding Generative Data Augmentation
Authors:
Chenyu Zheng,
Guoqiang Wu,
Chongxuan Li
Abstract:
Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we…
▽ More
Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we establish a general stability bound in this not independently and identically distributed (non-i.i.d.) setting, where the learned distribution is dependent on the original train set and generally not the same as the true distribution. Our theoretical result includes the divergence between the learned distribution and the true distribution. It shows that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(\max\left( \log(m)β_m, 1 / \sqrt{m})\right)$, where $m$ is the train set size and $β_m$ is the corresponding stability constant. We further specify the learning setup to the Gaussian mixture model and generative adversarial nets. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small, which is significant when the awful overfitting occurs. Simulation results on the Gaussian mixture model and empirical results on generative adversarial nets support our theoretical conclusions. Our code is available at https://github.com/ML-GSAI/Understanding-GDA.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information
Authors:
Zhiwei Wang,
Fa Zhang,
Cong Zheng,
Xianghong Hu,
Mingxuan Cai,
Can Yang
Abstract:
In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on s…
▽ More
In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
△ Less
Submitted 12 February, 2024; v1 submitted 4 March, 2023;
originally announced March 2023.
-
Controlling FDR in selecting group-level simultaneous signals from multiple data sources with application to the National Covid Collaborative Cohort data
Authors:
Runqiu Wang,
Ran Dai,
Cheng Zheng
Abstract:
One challenge in exploratory association studies using observational data is that the signals are potentially weak and the features have complex correlation structures. False discovery rate (FDR) controlling procedures can provide important statistical guarantees for replicability in risk factor identification in exploratory research. In the recently established National COVID Collaborative Cohort…
▽ More
One challenge in exploratory association studies using observational data is that the signals are potentially weak and the features have complex correlation structures. False discovery rate (FDR) controlling procedures can provide important statistical guarantees for replicability in risk factor identification in exploratory research. In the recently established National COVID Collaborative Cohort (N3C), electronic health record (EHR) data on the same set of candidate features are independently collected in multiple different sites, offering opportunities to identify signals by combining information from different sources. This paper presents a general knockoff-based variable selection algorithm to identify mutual signals from unions of group-level conditional independence tests with exact FDR control guarantees under finite sample settings. This algorithm can work with general regression settings, allowing heterogeneity of both the predictors and the outcomes across multiple data sources. We demonstrate the performance of this method with extensive numerical studies and an application to the N3C data.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Authors:
Chenyu Zheng,
Guoqiang Wu,
Fan Bao,
Yue Cao,
Chongxuan Li,
Jun Zhu
Abstract:
A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the sta…
▽ More
A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the "two regimes" phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.
△ Less
Submitted 29 May, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Testing for context-dependent changes in neural encoding in naturalistic experiments
Authors:
Yenho Chen,
Carl W. Harris,
Xiaoyu Ma,
Zheng Li,
Francisco Pereira,
Charles Y. Zheng
Abstract:
We propose a decoding-based approach to detect context effects on neural codes in longitudinal neural recording data. The approach is agnostic to how information is encoded in neural activity, and can control for a variety of possible confounding factors present in the data. We demonstrate our approach by determining whether it is possible to decode location encoding from prefrontal cortex in the…
▽ More
We propose a decoding-based approach to detect context effects on neural codes in longitudinal neural recording data. The approach is agnostic to how information is encoded in neural activity, and can control for a variety of possible confounding factors present in the data. We demonstrate our approach by determining whether it is possible to decode location encoding from prefrontal cortex in the mouse and, further, testing whether the encoding changes due to task engagement.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
Robust Causal Graph Representation Learning against Confounding Effects
Authors:
Hang Gao,
Jiangmeng Li,
Wenwen Qiang,
Lingyu Si,
Bing Xu,
Changwen Zheng,
Fuchun Sun
Abstract:
The prevailing graph neural network models have achieved significant progress in graph representation learning. However, in this paper, we uncover an ever-overlooked phenomenon: the pre-trained graph representation learning model tested with full graphs underperforms the model tested with well-pruned graphs. This observation reveals that there exist confounders in graphs, which may interfere with…
▽ More
The prevailing graph neural network models have achieved significant progress in graph representation learning. However, in this paper, we uncover an ever-overlooked phenomenon: the pre-trained graph representation learning model tested with full graphs underperforms the model tested with well-pruned graphs. This observation reveals that there exist confounders in graphs, which may interfere with the model learning semantic information, and current graph representation learning methods have not eliminated their influence. To tackle this issue, we propose Robust Causal Graph Representation Learning (RCGRL) to learn robust graph representations against confounding effects. RCGRL introduces an active approach to generate instrumental variables under unconditional moment restrictions, which empowers the graph representation learning model to eliminate confounders, thereby capturing discriminative information that is causally related to downstream predictions. We offer theorems and proofs to guarantee the theoretical effectiveness of the proposed approach. Empirically, we conduct extensive experiments on a synthetic dataset and multiple benchmark datasets. The results demonstrate that compared with state-of-the-art methods, RCGRL achieves better prediction performance and generalization ability.
△ Less
Submitted 10 February, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Quantification of follow-up time in oncology clinical trials with a time-to-event endpoint: Asking the right questions
Authors:
Kaspar Rufibach,
Lynda Grinsted,
Jiang Li,
Hans-Jochen Weber,
Cheng Zheng,
Jiangxiu Zhou
Abstract:
For the analysis of a time-to-event endpoint in a single-arm or randomized clinical trial it is generally perceived that interpretation of a given estimate of the survival function, or the comparison between two groups, hinges on some quantification of the amount of follow-up. Typically, a median of some loosely defined quantity is reported. However, whatever median is reported, is typically not a…
▽ More
For the analysis of a time-to-event endpoint in a single-arm or randomized clinical trial it is generally perceived that interpretation of a given estimate of the survival function, or the comparison between two groups, hinges on some quantification of the amount of follow-up. Typically, a median of some loosely defined quantity is reported. However, whatever median is reported, is typically not answering the question(s) trialists actually have in terms of follow-up quantification. In this paper, inspired by the estimand framework, we formulate a comprehensive list of relevant scientific questions that trialists have when reporting time-to-event data. We illustrate how these questions should be answered, and that reference to an unclearly defined follow-up quantity is not needed at all. In drug development, key decisions are made based on randomized controlled trials, and we therefore also discuss relevant scientific questions not only when looking at a time-to-event endpoint in one group, but also for comparisons. We find that different thinking about some of the relevant scientific questions around follow-up is required depending on whether a proportional hazards assumption can be made or other patterns of survival functions are anticipated, e.g. delayed separation, crossing survival functions, or the potential for cure. We conclude the paper with practical recommendations.
△ Less
Submitted 13 March, 2023; v1 submitted 10 June, 2022;
originally announced June 2022.
-
StableDR: Stabilized Doubly Robust Learning for Recommendation on Data Missing Not at Random
Authors:
Haoxuan Li,
Chunyuan Zheng,
Peng Wu
Abstract:
In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance…
▽ More
In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. Moreover, the fact that DR relies more on extrapolation will lead to suboptimal performance. To address the above limitations while retaining double robustness, we propose a stabilized doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. Theoretical analysis shows that StableDR has bounded bias, variance, and generalization error bound simultaneously under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approaches significantly outperform the existing methods.
△ Less
Submitted 23 August, 2023; v1 submitted 10 May, 2022;
originally announced May 2022.
-
VICE: Variational Interpretable Concept Embeddings
Authors:
Lukas Muttenthaler,
Charles Y. Zheng,
Patrick McClure,
Robert A. Vandermeulen,
Martin N. Hebart,
Francisco Pereira
Abstract:
A central goal in the cognitive sciences is the development of numerical models for mental representations of object concepts. This paper introduces Variational Interpretable Concept Embeddings (VICE), an approximate Bayesian method for embedding object concepts in a vector space using data collected from humans in a triplet odd-one-out task. VICE uses variational inference to obtain sparse, non-n…
▽ More
A central goal in the cognitive sciences is the development of numerical models for mental representations of object concepts. This paper introduces Variational Interpretable Concept Embeddings (VICE), an approximate Bayesian method for embedding object concepts in a vector space using data collected from humans in a triplet odd-one-out task. VICE uses variational inference to obtain sparse, non-negative representations of object concepts with uncertainty estimates for the embedding values. These estimates are used to automatically select the dimensions that best explain the data. We derive a PAC learning bound for VICE that can be used to estimate generalization performance or determine a sufficient sample size for experimental design. VICE rivals or outperforms its predecessor, SPoSE, at predicting human behavior in the triplet odd-one-out task. Furthermore, VICE's object representations are more reproducible and consistent across random initializations, highlighting the unique advantage of using VICE for deriving interpretable embeddings from human behavior.
△ Less
Submitted 6 October, 2022; v1 submitted 2 May, 2022;
originally announced May 2022.
-
GCF: Generalized Causal Forest for Heterogeneous Treatment Effect Estimation in Online Marketplace
Authors:
Shu Wan,
Chen Zheng,
Zhonggen Sun,
Mengfan Xu,
Xiaoqing Yang,
Hongtu Zhu,
Jiecheng Guo
Abstract:
Uplift modeling is a rapidly growing approach that utilizes causal inference and machine learning methods to directly estimate the heterogeneous treatment effects, which has been widely applied to various online marketplaces to assist large-scale decision-making in recent years. The existing popular models, like causal forest (CF), are limited to either discrete treatments or posing parametric ass…
▽ More
Uplift modeling is a rapidly growing approach that utilizes causal inference and machine learning methods to directly estimate the heterogeneous treatment effects, which has been widely applied to various online marketplaces to assist large-scale decision-making in recent years. The existing popular models, like causal forest (CF), are limited to either discrete treatments or posing parametric assumptions on the outcome-treatment relationship that may suffer model misspecification. However, continuous treatments (e.g., price, duration) often arise in marketplaces. To alleviate these restrictions, we use a kernel-based doubly robust estimator to recover the non-parametric dose-response functions that can flexibly model continuous treatment effects. Moreover, we propose a generic distance-based splitting criterion to capture the heterogeneity for the continuous treatments. We call the proposed algorithm generalized causal forest (GCF) as it generalizes the use case of CF to a much broader setting. We show the effectiveness of GCF by deriving the asymptotic property of the estimator and comparing it to popular uplift modeling methods on both synthetic and real-world datasets. We implement GCF on Spark and successfully deploy it into a large-scale online pricing system at a leading ride-sharing company. Online A/B testing results further validate the superiority of GCF.
△ Less
Submitted 23 September, 2022; v1 submitted 21 March, 2022;
originally announced March 2022.
-
TDR-CL: Targeted Doubly Robust Collaborative Learning for Debiased Recommendations
Authors:
Haoxuan Li,
Yan Lyu,
Chunyuan Zheng,
Peng Wu
Abstract:
Bias is a common problem inherent in recommender systems, which is entangled with users' preferences and poses a great challenge to unbiased learning. For debiasing tasks, the doubly robust (DR) method and its variants show superior performance due to the double robustness property, that is, DR is unbiased when either imputed errors or learned propensities are accurate. However, our theoretical an…
▽ More
Bias is a common problem inherent in recommender systems, which is entangled with users' preferences and poses a great challenge to unbiased learning. For debiasing tasks, the doubly robust (DR) method and its variants show superior performance due to the double robustness property, that is, DR is unbiased when either imputed errors or learned propensities are accurate. However, our theoretical analysis reveals that DR usually has a large variance. Meanwhile, DR would suffer unexpectedly large bias and poor generalization caused by inaccurate imputed errors and learned propensities, which usually occur in practice. In this paper, we propose a principled approach that can effectively reduce bias and variance simultaneously for existing DR approaches when the error imputation model is misspecified. In addition, we further propose a novel semi-parametric collaborative learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.
△ Less
Submitted 2 March, 2023; v1 submitted 19 March, 2022;
originally announced March 2022.
-
FDR Controlled Multiple Testing for Union Null Hypotheses: A Knockoff-based Approach
Authors:
Ran Dai,
Cheng Zheng
Abstract:
False discovery rate (FDR) controlling procedures provide important statistical guarantees for the replicability in signal identification based on multiple hypotheses testing. In many fields of study, FDR controlling procedures are used in high-dimensional (HD) analyses to discover features that are truly associated with the outcome. In some recent applications, data on the same set of candidate f…
▽ More
False discovery rate (FDR) controlling procedures provide important statistical guarantees for the replicability in signal identification based on multiple hypotheses testing. In many fields of study, FDR controlling procedures are used in high-dimensional (HD) analyses to discover features that are truly associated with the outcome. In some recent applications, data on the same set of candidate features are independently collected in multiple different studies. For example, gene expression data are collected at different facilities and with different cohorts, to identify the genetic biomarkers of multiple types of cancers. These studies provide us opportunities to identify signals by considering information from different sources (with potential heterogeneity) jointly. This paper is about how to provide FDR control guarantees for the tests of union null hypotheses of conditional independence. We present a knockoff-based variable selection method (\textit{Simultaneous knockoffs}) to identify mutual signals from multiple independent data sets, providing exact FDR control guarantees under finite sample settings. This method can work with very general model settings and test statistics. We demonstrate the performance of this method with extensive numerical studies and two real data examples.
△ Less
Submitted 3 October, 2022; v1 submitted 23 June, 2021;
originally announced June 2021.
-
Surprise: Result List Truncation via Extreme Value Theory
Authors:
Dara Bahri,
Che Zheng,
Yi Tay,
Donald Metzler,
Andrew Tomkins
Abstract:
Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user. The problem of result list truncation, or where to truncate the ranked list of results, however, has received less attention despite being crucial in a variety of applications. Such truncation is a balancing act between the overall rel…
▽ More
Work in information retrieval has largely been centered around ranking and relevance: given a query, return some number of results ordered by relevance to the user. The problem of result list truncation, or where to truncate the ranked list of results, however, has received less attention despite being crucial in a variety of applications. Such truncation is a balancing act between the overall relevance, or usefulness of the results, with the user cost of processing more results. Result list truncation can be challenging because relevance scores are often not well-calibrated. This is particularly true in large-scale IR systems where documents and queries are embedded in the same metric space and a query's nearest document neighbors are returned during inference. Here, relevance is inversely proportional to the distance between the query and candidate document, but what distance constitutes relevance varies from query to query and changes dynamically as more documents are added to the index. In this work, we propose Surprise scoring, a statistical method that leverages the Generalized Pareto distribution that arises in extreme value theory to produce interpretable and calibrated relevance scores at query time using nothing more than the ranked scores. We demonstrate its effectiveness on the result list truncation task across image, text, and IR datasets and compare it to both classical and recent baselines. We draw connections to hypothesis testing and $p$-values.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
Authors:
Dara Bahri,
Yi Tay,
Che Zheng,
Donald Metzler,
Cliff Brunk,
Andrew Tomkins
Abstract:
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality con…
▽ More
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrap** of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
Twitter discussions and emotions about COVID-19 pandemic: a machine learning approach
Authors:
Jia Xue,
Junxiang Chen,
Ran Hu,
Chen Chen,
ChengDa Zheng,
Xiaoqian Liu,
Tingshao Zhu
Abstract:
The objective of the study is to examine coronavirus disease (COVID-19) related discussions, concerns, and sentiments that emerged from tweets posted by Twitter users. We analyze 4 million Twitter messages related to the COVID-19 pandemic using a list of 25 hashtags such as "coronavirus," "COVID-19," "quarantine" from March 1 to April 21 in 2020. We use a machine learning approach, Latent Dirichle…
▽ More
The objective of the study is to examine coronavirus disease (COVID-19) related discussions, concerns, and sentiments that emerged from tweets posted by Twitter users. We analyze 4 million Twitter messages related to the COVID-19 pandemic using a list of 25 hashtags such as "coronavirus," "COVID-19," "quarantine" from March 1 to April 21 in 2020. We use a machine learning approach, Latent Dirichlet Allocation (LDA), to identify popular unigram, bigrams, salient topics and themes, and sentiments in the collected Tweets. Popular unigrams include "virus," "lockdown," and "quarantine." Popular bigrams include "COVID-19," "stay home," "corona virus," "social distancing," and "new cases." We identify 13 discussion topics and categorize them into five different themes, such as "public health measures to slow the spread of COVID-19," "social stigma associated with COVID-19," "coronavirus news cases and deaths," "COVID-19 in the United States," and "coronavirus cases in the rest of the world". Across all identified topics, the dominant sentiments for the spread of coronavirus are anticipation that measures that can be taken, followed by a mixed feeling of trust, anger, and fear for different topics. The public reveals a significant feeling of fear when they discuss the coronavirus new cases and deaths than other topics. The study shows that Twitter data and machine learning approaches can be leveraged for infodemiology study by studying the evolving public discussions and sentiments during the COVID-19. Real-time monitoring and assessment of the Twitter discussion and concerns can be promising for public health emergency responses and planning. Already emerged pandemic fear, stigma, and mental health concerns may continue to influence public trust when there occurs a second wave of COVID-19 or a new surge of the imminent pandemic.
△ Less
Submitted 18 June, 2020; v1 submitted 26 May, 2020;
originally announced May 2020.
-
Choppy: Cut Transformer For Ranked List Truncation
Authors:
Dara Bahri,
Yi Tay,
Che Zheng,
Donald Metzler,
Andrew Tomkins
Abstract:
Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user. However, the problem of determining how many results to return, i.e. how to optimally truncate the ranked result list, has received less attention despite being of critical importance in a range of applications. Such truncation is a balanc…
▽ More
Work in information retrieval has traditionally focused on ranking and relevance: given a query, return some number of results ordered by relevance to the user. However, the problem of determining how many results to return, i.e. how to optimally truncate the ranked result list, has received less attention despite being of critical importance in a range of applications. Such truncation is a balancing act between the overall relevance, or usefulness of the results, with the user cost of processing more results. In this work, we propose Choppy, an assumption-free model based on the widely successful Transformer architecture, to the ranked list truncation problem. Needing nothing more than the relevance scores of the results, the model uses a powerful multi-head attention mechanism to directly optimize any user-defined IR metric. We show Choppy improves upon recent state-of-the-art methods.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
Multi-Lead ECG Classification via an Information-Based Attention Convolutional Neural Network
Authors:
Hao Tung,
Chao Zheng,
Xinsheng Mao,
Dahong Qian
Abstract:
Objective: A novel structure based on channel-wise attention mechanism is presented in this paper. Embedding with the proposed structure, an efficient classification model that accepts multi-lead electrocardiogram (ECG) as input is constructed.
Methods: One-dimensional convolutional neural networks (CNN) have proven to be effective in pervasive classification tasks, enabling the automatic extract…
▽ More
Objective: A novel structure based on channel-wise attention mechanism is presented in this paper. Embedding with the proposed structure, an efficient classification model that accepts multi-lead electrocardiogram (ECG) as input is constructed.
Methods: One-dimensional convolutional neural networks (CNN) have proven to be effective in pervasive classification tasks, enabling the automatic extraction of features while classifying targets. We implement the Residual connection and design a structure which can learn the weights from the information contained in different channels in the input feature map during the training process. An indicator named mean square deviation is introduced to monitor the performance of a particular model segment in the classification task on the two out of the five ECG classes. The data in the MIT-BIH arrhythmia database is used and a series of control experiments is conducted.
Results: Utilizing both leads of the ECG signals as input to the neural network classifier can achieve better classification results than those from using single channel inputs in different application scenarios. Models embedded with the channel-wise attention structure always achieve better scores on sensitivity and precision than the plain Resnet models. The proposed model exceeds the performance of most of the state-of-the-art models in ventricular ectopic beats (VEB) classification, and achieves competitive scores for supraventricular ectopic beats (SVEB).
Conclusion: Adopting more lead ECG signals as input can increase the dimensions of the input feature maps, hel** to improve both the performance and generalization of the network model.
Significance: Due to its end-to-end characteristics, and the extensible intrinsic for multi-lead heart diseases diagnosing, the proposed model can be used for the real-time ECG tracking of ECG waveforms for Holter or wearable devices.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
One Man's Trash is Another Man's Treasure: Resisting Adversarial Examples by Adversarial Examples
Authors:
Chang Xiao,
Changxi Zheng
Abstract:
Modern image classification systems are often built on deep neural networks, which suffer from adversarial examples--images with deliberately crafted, imperceptible noise to mislead the network's classification. To defend against adversarial examples, a plausible idea is to obfuscate the network's gradient with respect to the input image. This general idea has inspired a long line of defense metho…
▽ More
Modern image classification systems are often built on deep neural networks, which suffer from adversarial examples--images with deliberately crafted, imperceptible noise to mislead the network's classification. To defend against adversarial examples, a plausible idea is to obfuscate the network's gradient with respect to the input image. This general idea has inspired a long line of defense methods. Yet, almost all of them have proven vulnerable. We revisit this seemingly flawed idea from a radically different perspective. We embrace the omnipresence of adversarial examples and the numerical procedure of crafting them, and turn this harmful attacking process into a useful defense mechanism. Our defense method is conceptually simple: before feeding an input image for classification, transform it by finding an adversarial example on a pre-trained external model. We evaluate our method against a wide range of possible attacks. On both CIFAR-10 and Tiny ImageNet datasets, our method is significantly more robust than state-of-the-art methods. Particularly, in comparison to adversarial training, our method offers lower training cost as well as stronger robustness.
△ Less
Submitted 27 November, 2019; v1 submitted 25 November, 2019;
originally announced November 2019.
-
On Data Enriched Logistic Regression
Authors:
Cheng Zheng,
Sayan Dasgupta,
Yuxiang Xie,
Asad Haris,
Ying Qing Chen
Abstract:
Biomedical researchers usually study the effects of certain exposures on disease risks among a well-defined population. To achieve this goal, the gold standard is to design a trial with an appropriate sample from that population. Due to the high cost of such trials, usually the sample size collected is limited and is not enough to accurately estimate some exposures' effect. In this paper, we discu…
▽ More
Biomedical researchers usually study the effects of certain exposures on disease risks among a well-defined population. To achieve this goal, the gold standard is to design a trial with an appropriate sample from that population. Due to the high cost of such trials, usually the sample size collected is limited and is not enough to accurately estimate some exposures' effect. In this paper, we discuss how to leverage the information from external `big data' (data with much larger sample size) to improve the estimation accuracy at the risk of introducing small bias. We proposed a family of weighted estimators to balance the bias increase and variance reduction when including the big data. We connect our proposed estimator to the established penalized regression estimators. We derive the optimal weights using both second order and higher order asymptotic expansions. Using extensive simulation studies, we showed that the improvement in terms of mean square error (MSE) for the regression coefficient can be substantial even with finite sample sizes and our weighted method outperformed the existing methods such as penalized regression and James Stein's approach. Also we provide theoretical guarantee that the proposed estimators will never lead to asymptotic MSE larger than the maximum likelihood estimator using small data only in general. We applied our proposed methods to the Asia Cohort Consortium China cohort data to estimate the relationships between age, BMI, smoking, alcohol use and mortality.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
A Simulation-free Group Sequential Design with Max-combo Tests in the Presence of Non-proportional Hazards
Authors:
Lili Wang,
Xiaodong Luo,
Cheng Zheng
Abstract:
Non-proportional hazards (NPH) have been observed recently in many immuno-oncology clinical trials. Weighted log-rank tests (WLRT) with suitably chosen weights can be used to improve the power of detecting the difference of the two survival curves in the presence of NPH. However, it is not easy to choose a proper WLRT in practice when both robustness and efficiency are considered. A versatile maxc…
▽ More
Non-proportional hazards (NPH) have been observed recently in many immuno-oncology clinical trials. Weighted log-rank tests (WLRT) with suitably chosen weights can be used to improve the power of detecting the difference of the two survival curves in the presence of NPH. However, it is not easy to choose a proper WLRT in practice when both robustness and efficiency are considered. A versatile maxcombo test was proposed to achieve the balance of robustness and efficiency and has received increasing attentions in both methodology development and application. However, survival trials often warrant interim analyses due to its high cost and long duration. The integration and application of maxcombo tests in interim analyses often require extensive simulation studies. In this paper, we propose a simulation-free approach for group sequential design with maxcombo test in survival trials. The simulation results support that the proposed approaches successfully control both the type I error rate and offer great accuracy and flexibility in estimating sample sizes, at the expense of light computation burden. Notably, our methods display a strong robustness towards various model misspecifications, and have been implemented in an R package for free access online.
△ Less
Submitted 16 January, 2023; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Consistency of a range of penalised cost approaches for detecting multiple changepoints
Authors:
Chao Zheng,
Idris A. Eckley,
Paul Fearnhead
Abstract:
A common approach to detect multiple changepoints is to minimise a measure of data fit plus a penalty that is linear in the number of changepoints. This paper shows that the general finite sample behaviour of such a method can be related to its behaviour when analysing data with either none or one changepoint. This results in simpler conditions for verifying whether the method will consistently es…
▽ More
A common approach to detect multiple changepoints is to minimise a measure of data fit plus a penalty that is linear in the number of changepoints. This paper shows that the general finite sample behaviour of such a method can be related to its behaviour when analysing data with either none or one changepoint. This results in simpler conditions for verifying whether the method will consistently estimate the number and locations of the changepoints. We apply and demonstrate the usefulness of this result for a range of changepoint problems. Our new results include a weaker condition on the choice of penalty required to have consistency in a change-in-slope model; and the first results for the accuracy of recently-proposed methods for detecting spikes.
△ Less
Submitted 12 August, 2022; v1 submitted 5 November, 2019;
originally announced November 2019.
-
Learning Nearly Decomposable Value Functions Via Communication Minimization
Authors:
Tonghan Wang,
Jianhao Wang,
Chongyi Zheng,
Chongjie Zhang
Abstract:
Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value functions, which are not efficient for tasks requiring com…
▽ More
Reinforcement learning encounters major challenges in multi-agent settings, such as scalability and non-stationarity. Recently, value function factorization learning emerges as a promising way to address these challenges in collaborative multi-agent systems. However, existing methods have been focusing on learning fully decentralized value functions, which are not efficient for tasks requiring communication. To address this limitation, this paper presents a novel framework for learning nearly decomposable Q-functions (NDQ) via communication minimization, with which agents act on their own most of the time but occasionally send messages to other agents in order for effective coordination. This framework hybridizes value function factorization learning and communication learning by introducing two information-theoretic regularizers. These regularizers are maximizing mutual information between agents' action selection and communication messages while minimizing the entropy of messages between agents. We show how to optimize these regularizers in a way that is easily integrated with existing value function factorization methods such as QMIX. Finally, we demonstrate that, on the StarCraft unit micromanagement benchmark, our framework significantly outperforms baseline methods and allows us to cut off more than $80\%$ of communication without sacrificing the performance. The videos of our experiments are available at https://sites.google.com/view/ndq.
△ Less
Submitted 18 July, 2020; v1 submitted 11 October, 2019;
originally announced October 2019.
-
Supervised Discriminative Sparse PCA for Com-Characteristic Gene Selection and Tumor Classification on Multiview Biological Data
Authors:
Chun-Mei Feng,
Yong Xu,
**-Xing Liu,
Ying-Lian Gao,
Chun-Hou Zheng
Abstract:
Principal Component Analysis (PCA) has been used to study the pathogenesis of diseases. To enhance the interpretability of classical PCA, various improved PCA methods have been proposed to date. Among these, a typical method is the so-called sparse PCA, which focuses on seeking sparse loadings. However, the performance of these methods is still far from satisfactory due to their limitation of usin…
▽ More
Principal Component Analysis (PCA) has been used to study the pathogenesis of diseases. To enhance the interpretability of classical PCA, various improved PCA methods have been proposed to date. Among these, a typical method is the so-called sparse PCA, which focuses on seeking sparse loadings. However, the performance of these methods is still far from satisfactory due to their limitation of using unsupervised learning methods; moreover, the class ambiguity within the sample is high. To overcome this problem, this study developed a new PCA method, which is named the Supervised Discriminative Sparse PCA (SDSPCA). The main innovation of this method is the incorporation of discriminative information and sparsity into the PCA model. Specifically, in contrast to the traditional sparse PCA, which imposes sparsity on the loadings, here, sparse components are obtained to represent the data. Furthermore, via linear transformation, the sparse components approximate the given label information. On the one hand, sparse components improve interpretability over traditional PCA, while on the other hand, they are have discriminative abilities suitable for classification purposes. A simple algorithm is developed and its convergence proof is provided. The SDSPCA has been applied to common characteristic gene selection (com-characteristic gene) and tumor classification on multi-view biological data. The sparsity and classification performance of the SDSPCA are empirically verified via abundant, reasonable, and effective experiments, and the obtained results demonstrate that SDSPCA outperforms other state-of-the-art methods.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.
-
Enhancing Adversarial Defense by k-Winners-Take-All
Authors:
Chang Xiao,
Peilin Zhong,
Changxi Zheng
Abstract:
We propose a simple change to existing neural network structures for better defending against gradient-based adversarial attacks. Instead of using popular activation functions (such as ReLU), we advocate the use of k-Winners-Take-All (k-WTA) activation, a C0 discontinuous function that purposely invalidates the neural network model's gradient at densely distributed input data points. The proposed…
▽ More
We propose a simple change to existing neural network structures for better defending against gradient-based adversarial attacks. Instead of using popular activation functions (such as ReLU), we advocate the use of k-Winners-Take-All (k-WTA) activation, a C0 discontinuous function that purposely invalidates the neural network model's gradient at densely distributed input data points. The proposed k-WTA activation can be readily used in nearly all existing networks and training methods with no significant overhead. Our proposal is theoretically rationalized. We analyze why the discontinuities in k-WTA networks can largely prevent gradient-based search of adversarial examples and why they at the same time remain innocuous to the network training. This understanding is also empirically backed. We test k-WTA activation on various network structures optimized by a training method, be it adversarial training or not. In all cases, the robustness of k-WTA networks outperforms that of traditional networks under white-box attacks.
△ Less
Submitted 28 October, 2019; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Multimodal Deep Network Embedding with Integrated Structure and Attribute Information
Authors:
Conghui Zheng,
Li Pan,
Peng Wu
Abstract:
Network embedding is the process of learning low-dimensional representations for nodes in a network, while preserving node features. Existing studies only leverage network structure information and focus on preserving structural features. However, nodes in real-world networks often have a rich set of attributes providing extra semantic information. It has been demonstrated that both structural and…
▽ More
Network embedding is the process of learning low-dimensional representations for nodes in a network, while preserving node features. Existing studies only leverage network structure information and focus on preserving structural features. However, nodes in real-world networks often have a rich set of attributes providing extra semantic information. It has been demonstrated that both structural and attribute features are important for network analysis tasks. To preserve both features, we investigate the problem of integrating structure and attribute information to perform network embedding and propose a Multimodal Deep Network Embedding (MDNE) method. MDNE captures the non-linear network structures and the complex interactions among structures and attributes, using a deep model consisting of multiple layers of non-linear functions. Since structures and attributes are two different types of information, a multimodal learning method is adopted to pre-process them and help the model to better capture the correlations between node structure and attribute information. We employ both structural proximity and attribute proximity in the loss function to preserve the respective features and the representations are obtained by minimizing the loss function. Results of extensive experiments on four real-world datasets show that the proposed method performs significantly better than baselines on a variety of tasks, which demonstrate the effectiveness and generality of our method.
△ Less
Submitted 28 March, 2019;
originally announced March 2019.
-
Rethinking Generative Mode Coverage: A Pointwise Guaranteed Approach
Authors:
Peilin Zhong,
Yuchen Mo,
Chang Xiao,
Pengyu Chen,
Changxi Zheng
Abstract:
Many generative models have to combat $\textit{missing modes}$. The conventional wisdom to this end is by reducing through training a statistical distance (such as $f$-divergence) between the generated distribution and provided data distribution. But this is more of a heuristic than a guarantee. The statistical distance measures a $\textit{global}$, but not $\textit{local}$, similarity between two…
▽ More
Many generative models have to combat $\textit{missing modes}$. The conventional wisdom to this end is by reducing through training a statistical distance (such as $f$-divergence) between the generated distribution and provided data distribution. But this is more of a heuristic than a guarantee. The statistical distance measures a $\textit{global}$, but not $\textit{local}$, similarity between two distributions. Even if it is small, it does not imply a plausible mode coverage. Rethinking this problem from a game-theoretic perspective, we show that a complete mode coverage is firmly attainable. If a generative model can approximate a data distribution moderately well under a global statistical distance measure, then we will be able to find a mixture of generators that collectively covers $\textit{every}$ data point and thus $\textit{every}$ mode, with a lower-bounded generation probability. Constructing the generator mixture has a connection to the multiplicative weights update rule, upon which we propose our algorithm. We prove that our algorithm guarantees complete mode coverage. And our experiments on real and synthetic datasets confirm better mode coverage over recent approaches, ones that also use generator mixtures but rely on global statistical distances.
△ Less
Submitted 24 October, 2019; v1 submitted 12 February, 2019;
originally announced February 2019.
-
Revealing interpretable object representations from human behavior
Authors:
Charles Y. Zheng,
Francisco Pereira,
Chris I. Baker,
Martin N. Hebart
Abstract:
To study how mental object representations are related to behavior, we estimated sparse, non-negative representations of objects using human behavioral judgments on images representative of 1,854 object categories. These representations predicted a latent similarity structure between objects, which captured most of the explainable variance in human behavioral judgments. Individual dimensions in th…
▽ More
To study how mental object representations are related to behavior, we estimated sparse, non-negative representations of objects using human behavioral judgments on images representative of 1,854 object categories. These representations predicted a latent similarity structure between objects, which captured most of the explainable variance in human behavioral judgments. Individual dimensions in the low-dimensional embedding were found to be highly reproducible and interpretable as conveying degrees of taxonomic membership, functionality, and perceptual attributes. We further demonstrated the predictive power of the embeddings for explaining other forms of human behavior, including categorization, typicality judgments, and feature ratings, suggesting that the dimensions reflect human conceptual representations of objects beyond the specific task.
△ Less
Submitted 9 January, 2019;
originally announced January 2019.
-
On High Dimensional Covariate Adjustment for Estimating Causal Effects in Randomized Trials with Survival Outcomes
Authors:
Ran Dai,
Cheng Zheng,
Mei-Jie Zhang
Abstract:
The purpose of this work is to improve the efficiency in estimating the average causal effect (ACE) on the survival scale where right-censoring exists and high-dimensional covariate information is available. We propose new estimators using regularized survival regression and survival random forests (SRF) to make the adjustment for the high dimensional covariates to improve efficiency. We study the…
▽ More
The purpose of this work is to improve the efficiency in estimating the average causal effect (ACE) on the survival scale where right-censoring exists and high-dimensional covariate information is available. We propose new estimators using regularized survival regression and survival random forests (SRF) to make the adjustment for the high dimensional covariates to improve efficiency. We study the behavior of the adjusted estimator under mild assumptions and show theoretical guarantees that the proposed estimators are more efficient than the unadjusted ones asymptotically when using SRF for adjustment. In addition, these adjusted estimators are $\sqrt{n}$- consistent and asymptotically normally distributed. The finite sample behavior of our methods are studied by simulation, and the results are in agreement with the theoretical results. We also illustrate our methods by analyzing the real data from transplant research to identify the relative effectiveness of identical sibling donors compared to unrelated donors with the adjustment of cytogenetic abnormalities.
△ Less
Submitted 25 June, 2021; v1 submitted 5 December, 2018;
originally announced December 2018.
-
Knowing what you know in brain segmentation using Bayesian deep neural networks
Authors:
Patrick McClure,
Nao Rho,
John A. Lee,
Jakub R. Kaczmarzyk,
Charles Zheng,
Satrajit S. Ghosh,
Dylan Nielson,
Adam G. Thomas,
Peter Bandettini,
Francisco Pereira
Abstract:
In this paper, we describe a Bayesian deep neural network (DNN) for predicting FreeSurfer segmentations of structural MRI volumes, in minutes rather than hours. The network was trained and evaluated on a large dataset (n = 11,480), obtained by combining data from more than a hundred different sites, and also evaluated on another completely held-out dataset (n = 418). The network was trained using…
▽ More
In this paper, we describe a Bayesian deep neural network (DNN) for predicting FreeSurfer segmentations of structural MRI volumes, in minutes rather than hours. The network was trained and evaluated on a large dataset (n = 11,480), obtained by combining data from more than a hundred different sites, and also evaluated on another completely held-out dataset (n = 418). The network was trained using a novel spike-and-slab dropout-based variational inference approach. We show that, on these datasets, the proposed Bayesian DNN outperforms previously proposed methods, in terms of the similarity between the segmentation predictions and the FreeSurfer labels, and the usefulness of the estimate uncertainty of these predictions. In particular, we demonstrated that the prediction uncertainty of this network at each voxel is a good indicator of whether the network has made an error and that the uncertainty across the whole brain can predict the manual quality control ratings of a scan. The proposed Bayesian DNN method should be applicable to any new network architecture for addressing the segmentation problem.
△ Less
Submitted 18 September, 2019; v1 submitted 3 December, 2018;
originally announced December 2018.
-
Auto-Encoding Knockoff Generator for FDR Controlled Variable Selection
Authors:
Ying Liu,
Cheng Zheng
Abstract:
A new statistical procedure (Model-X \cite{candes2018}) has provided a way to identify important factors using any supervised learning method controlling for FDR. This line of research has shown great potential to expand the horizon of machine learning methods beyond the task of prediction, to serve the broader needs in scientific researches for interpretable findings. However, the lack of a pract…
▽ More
A new statistical procedure (Model-X \cite{candes2018}) has provided a way to identify important factors using any supervised learning method controlling for FDR. This line of research has shown great potential to expand the horizon of machine learning methods beyond the task of prediction, to serve the broader needs in scientific researches for interpretable findings. However, the lack of a practical and flexible method to generate knockoffs remains the major obstacle for wide application of Model-X procedure. This paper fills in the gap by proposing a model-free knockoff generator which approximates the correlation structure between features through latent variable representation. We demonstrate our proposed method can achieve FDR control and better power than two existing methods in various simulated settings and a real data example for finding mutations associated with drug resistance in HIV-1 patients.
△ Less
Submitted 27 September, 2018;
originally announced September 2018.
-
Distributed Weight Consolidation: A Brain Segmentation Case Study
Authors:
Patrick McClure,
Charles Y. Zheng,
Jakub R. Kaczmarzyk,
John A. Lee,
Satrajit S. Ghosh,
Dylan Nielson,
Peter Bandettini,
Francisco Pereira
Abstract:
Collecting the large datasets needed to train deep neural networks can be very difficult, particularly for the many applications for which sharing and pooling data is complicated by practical, ethical, or legal concerns. However, it may be the case that derivative datasets or predictive models developed within individual sites can be shared and combined with fewer restrictions. Training on distrib…
▽ More
Collecting the large datasets needed to train deep neural networks can be very difficult, particularly for the many applications for which sharing and pooling data is complicated by practical, ethical, or legal concerns. However, it may be the case that derivative datasets or predictive models developed within individual sites can be shared and combined with fewer restrictions. Training on distributed data and combining the resulting networks is often viewed as continual learning, but these methods require networks to be trained sequentially. In this paper, we introduce distributed weight consolidation (DWC), a continual learning method to consolidate the weights of separate neural networks, each trained on an independent dataset. We evaluated DWC with a brain segmentation case study, where we consolidated dilated convolutional neural networks trained on independent structural magnetic resonance imaging (sMRI) datasets from different sites. We found that DWC led to increased performance on test sets from the different sites, while maintaining generalization performance for a very large and completely independent multi-site dataset, compared to an ensemble baseline.
△ Less
Submitted 16 January, 2019; v1 submitted 28 May, 2018;
originally announced May 2018.
-
BourGAN: Generative Networks with Metric Embeddings
Authors:
Chang Xiao,
Peilin Zhong,
Changxi Zheng
Abstract:
This paper addresses the mode collapse for generative adversarial networks (GANs). We view modes as a geometric structure of data distribution in a metric space. Under this geometric lens, we embed subsamples of the dataset from an arbitrary metric space into the l2 space, while preserving their pairwise distance distribution. Not only does this metric embedding determine the dimensionality of the…
▽ More
This paper addresses the mode collapse for generative adversarial networks (GANs). We view modes as a geometric structure of data distribution in a metric space. Under this geometric lens, we embed subsamples of the dataset from an arbitrary metric space into the l2 space, while preserving their pairwise distance distribution. Not only does this metric embedding determine the dimensionality of the latent space automatically, it also enables us to construct a mixture of Gaussians to draw latent space random vectors. We use the Gaussian mixture model in tandem with a simple augmentation of the objective function to train GANs. Every major step of our method is supported by theoretical analysis, and our experiments on real and synthetic data confirm that the generator is able to produce samples spreading over most of the modes while avoiding unwanted samples, outperforming several recent GAN variants on a number of metrics and offering new features.
△ Less
Submitted 2 December, 2018; v1 submitted 19 May, 2018;
originally announced May 2018.
-
Extrapolating Expected Accuracies for Large Multi-Class Problems
Authors:
Charles Zheng,
Rakesh Achanta,
Yuval Benjamini
Abstract:
The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumptions that the classes are sampled identically and independently from a population, and that the classifier is based on independently learned scoring functions, we s…
▽ More
The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumptions that the classes are sampled identically and independently from a population, and that the classifier is based on independently learned scoring functions, we show that the expected accuracy when the classifier is trained on k classes is the (k-1)st moment of a certain distribution that can be estimated from data. We present an unbiased estimation method based on the theory, and demonstrate its application on a facial recognition example.
△ Less
Submitted 27 December, 2017;
originally announced December 2017.
-
Model Selection Confidence Sets by Likelihood Ratio Testing
Authors:
Chao Zheng,
Davide Ferrari,
Yuhong Yang
Abstract:
The traditional activity of model selection aims at discovering a single model superior to other candidate models. In the presence of pronounced noise, however, multiple models are often found to explain the same data equally well. To resolve this model selection ambiguity, we introduce the general approach of model selection confidence sets (MSCSs) based on likelihood ratio testing. A MSCS is def…
▽ More
The traditional activity of model selection aims at discovering a single model superior to other candidate models. In the presence of pronounced noise, however, multiple models are often found to explain the same data equally well. To resolve this model selection ambiguity, we introduce the general approach of model selection confidence sets (MSCSs) based on likelihood ratio testing. A MSCS is defined as a list of models statistically indistinguishable from the true model at a user-specified level of confidence, which extends the familiar notion of confidence intervals to the model-selection framework. Our approach guarantees asymptotically correct coverage probability of the true model when both sample size and model dimension increase. We derive conditions under which the MSCS contains all the relevant information about the true model structure. In addition, we propose natural statistics based on the MSCS to measure importance of variables in a principled way that accounts for the overall model uncertainty. When the space of feasible models is large, MSCS is implemented by an adaptive stochastic search algorithm which samples MSCS models with high probability. The MSCS methodology is illustrated through numerical experiments on synthetic data and real data examples.
△ Less
Submitted 13 September, 2017;
originally announced September 2017.
-
Estimating mutual information in high dimensions via classification error
Authors:
Charles Y. Zheng,
Yuval Benjamini
Abstract:
Multivariate pattern analyses approaches in neuroimaging are fundamentally concerned with investigating the quantity and type of information processed by various regions of the human brain; typically, estimates of classification accuracy are used to quantify information. While a extensive and powerful library of methods can be applied to train and assess classifiers, it is not always clear how to…
▽ More
Multivariate pattern analyses approaches in neuroimaging are fundamentally concerned with investigating the quantity and type of information processed by various regions of the human brain; typically, estimates of classification accuracy are used to quantify information. While a extensive and powerful library of methods can be applied to train and assess classifiers, it is not always clear how to use the resulting measures of classification performance to draw scientific conclusions: e.g. for the purpose of evaluating redundancy between brain regions. An additional confound for interpreting classification performance is the dependence of the error rate on the number and choice of distinct classes obtained for the classification task. In contrast, mutual information is a quantity defined independently of the experimental design, and has ideal properties for comparative analyses. Unfortunately, estimating the mutual information based on observations becomes statistically infeasible in high dimensions without some kind of assumption or prior.
In this paper, we construct a novel classification-based estimator of mutual information based on high-dimensional asymptotics. We show that in a particular limiting regime, the mutual information is an invertible function of the expected $k$-class Bayes error. While the theory is based on a large-sample, high-dimensional limit, we demonstrate through simulations that our proposed estimator has superior performance to the alternatives in problems of moderate dimensionality.
△ Less
Submitted 10 October, 2016; v1 submitted 16 June, 2016;
originally announced June 2016.
-
How many faces can be recognized? Performance extrapolation for multi-class classification
Authors:
Charles Y. Zheng,
Rakesh Achanta,
Yuval Benjamini
Abstract:
The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumption that the classes are sampled exchangeably, and under the assumption that the classifier is generative (e.g. QDA or Naive Bayes), we show that the expected accur…
▽ More
The difficulty of multi-class classification generally increases with the number of classes. Using data from a subset of the classes, can we predict how well a classifier will scale with an increased number of classes? Under the assumption that the classes are sampled exchangeably, and under the assumption that the classifier is generative (e.g. QDA or Naive Bayes), we show that the expected accuracy when the classifier is trained on $k$ classes is the $k-1$st moment of a \emph{conditional accuracy distribution}, which can be estimated from data. This provides the theoretical foundation for performance extrapolation based on pseudolikelihood, unbiased estimation, and high-dimensional asymptotics. We investigate the robustness of our methods to non-generative classifiers in simulations and one optical character recognition example.
△ Less
Submitted 16 June, 2016;
originally announced June 2016.
-
On a Shape-Invariant Hazard Regression Model
Authors:
Cheng Zheng,
Ying Qing Chen
Abstract:
In survival analysis, Cox model is widely used for most clinical trial data. Alternatives include the additive hazard model, the accelerated failure time (AFT) model and a more general transformation model. All these models assume that the effects for all covariates are on the same scale. However, it is possible that for different covariates, the effects are on different scales. In this paper, we…
▽ More
In survival analysis, Cox model is widely used for most clinical trial data. Alternatives include the additive hazard model, the accelerated failure time (AFT) model and a more general transformation model. All these models assume that the effects for all covariates are on the same scale. However, it is possible that for different covariates, the effects are on different scales. In this paper, we propose a shape-invariant hazard regression model that allows us to estimate the multiplicative treatment effect with adjustment of covariates that have non-multiplicative effects. We propose moment-based inference procedures for the regression parameters. We also discuss the risk prediction and goodness of fit test for our proposed model. Numerical studies show good finite sample performance of our proposed estimator. We applied our method to Veteran's Administration (VA) lung cancer data and the HIVNET 012 data. For the latter, we found that single-dose nevirapine treatment has a significant improvement for 18-month survival with appropriate adjustment for maternal CD4 counts and virus load.
△ Less
Submitted 22 March, 2016;
originally announced March 2016.
-
Instrumental Variable with Competing Risk Model
Authors:
Cheng Zheng,
Ran Dai,
Parameswaran Hari,
Mei-Jie Zhang
Abstract:
In this paper, we discuss causal inference on the efficacy of a treatment or medication on a time-to-event outcome with competing risks. Although the treatment group can be randomized, there can be confoundings between the compliance and the outcome. Unmeasured confoundings may exist even after adjustment for measured co- variates. Instrumental variable (IV) methods are commonly used to yield cons…
▽ More
In this paper, we discuss causal inference on the efficacy of a treatment or medication on a time-to-event outcome with competing risks. Although the treatment group can be randomized, there can be confoundings between the compliance and the outcome. Unmeasured confoundings may exist even after adjustment for measured co- variates. Instrumental variable (IV) methods are commonly used to yield consistent estimations of causal parameters in the presence of unmeasured confoundings. Based on a semi-parametric additive hazard model for the subdistribution hazard, we pro- pose an instrumental variable estimator to yield consistent estimation of efficacy in the presence of unmeasured confoundings for competing risk settings. We derived the asymptotic properties for the proposed estimator. The estimator is shown to be well per- formed under finite sample size according to simulation results. We applied our method to a real transplant data example and showed that the unmeasured confoundings lead to significant bias in the estimation of the effect (about 50% attenuated).
△ Less
Submitted 4 December, 2016; v1 submitted 6 March, 2016;
originally announced March 2016.
-
On estimating causal controlled direct and mediator effects for count outcomes without assuming sequential ignorability
Authors:
Cheng Zheng,
David C. Atkins,
Melissa A. Lewis,
Xiao-Hua Zhou
Abstract:
Causal mediation analysis is an important statistical method in social and medical studies, as it can provide insights about why an intervention works and inform the development of future interventions. Currently, most causal mediation methods focus on mediation effects defined on a mean scale. However, in health-risk studies, such as alcohol or risky sex, outcomes are typically count data and hea…
▽ More
Causal mediation analysis is an important statistical method in social and medical studies, as it can provide insights about why an intervention works and inform the development of future interventions. Currently, most causal mediation methods focus on mediation effects defined on a mean scale. However, in health-risk studies, such as alcohol or risky sex, outcomes are typically count data and heavily skewed. Thus, mediation effects in these setting would be natural on a rate ratio scale, such as in Poisson and negative binomial regression methods. Existing methods also mainly rely on the assumption of no unmeasured confounding between mediator and outcome. To allow for potential confounders between the mediator and outcome, we define the direct and mediator effects on a new scale and propose a multiplicative structural mean model for mediation analysis with count outcomes. The estimator is compared with both Poisson and negative binomial regression methods assuming sequential ignorability using a simulation study and a real world example about an alcohol-related intervention study. Mediation analyses using the new methods confirm the study hypothesis that the intervention decreases drinking by decreasing individual's normative perceptions of alcohol use.
△ Less
Submitted 25 January, 2016;
originally announced January 2016.
-
Ranking genetic factors related to age-related maculardegeneration by variable selection confidence sets
Authors:
Chao Zheng,
Davide Ferrari,
Michael Zhang,
Paul Baird
Abstract:
The widespread use of generalized linear models in case-control genetic studies has helped identify many disease-associated risk factors typically defined as DNA variants, or single nucleotide polymorphisms (SNPs). Up to now, most literature has focused on selecting a unique best subset of SNPs based on some statistical perspectives. In the presence of pronounced noise, however, multiple biologica…
▽ More
The widespread use of generalized linear models in case-control genetic studies has helped identify many disease-associated risk factors typically defined as DNA variants, or single nucleotide polymorphisms (SNPs). Up to now, most literature has focused on selecting a unique best subset of SNPs based on some statistical perspectives. In the presence of pronounced noise, however, multiple biological paths are often found to be equally supported by a given dataset when dealing with complex genetic diseases. We address the ambiguity related to SNP selection by constructing a list of models called variable selection confidence set (VSCS), which contains the collection of all well-supported SNP combinations at a user-specified confidence level. The VSCS extends the familiar notion of confidence intervals in the variable selection setting and provides the practitioner with new tools aiding the variable selection activity beyond trusting a single model. Based on the VSCS, we consider natural graphical and numerical statistics measuring the inclusion importance of a SNP based on its frequency in the most parsimonious VSCS models. This work is motivated by available case-control genetic data on age-related macular degeneration, a widespread complex disease and leading cause of vision loss.
△ Less
Submitted 4 March, 2019; v1 submitted 16 December, 2015;
originally announced December 2015.
-
Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network
Authors:
Bai Jiang,
Tung-yu Wu,
Charles Zheng,
Wing H. Wong
Abstract:
Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct…
▽ More
Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors.
△ Less
Submitted 16 March, 2017; v1 submitted 7 October, 2015;
originally announced October 2015.
-
Two-Sample Smooth Tests for the Equality of Distributions
Authors:
Wen-Xin Zhou,
Chao Zheng,
Zhen Zhang
Abstract:
This paper considers the problem of testing the equality of two unspecified distributions. The classical omnibus tests such as the Kolmogorov-Smirnov and Cramèr-von Mises are known to suffer from low power against essentially all but location-scale alternatives. We propose a new two-sample test that modifies the Neyman's smooth test and extend it to the multivariate case based on the idea of proje…
▽ More
This paper considers the problem of testing the equality of two unspecified distributions. The classical omnibus tests such as the Kolmogorov-Smirnov and Cramèr-von Mises are known to suffer from low power against essentially all but location-scale alternatives. We propose a new two-sample test that modifies the Neyman's smooth test and extend it to the multivariate case based on the idea of projection pursue. The asymptotic null property of the test and its power against local alternatives are studied. The multiplier bootstrap method is employed to compute the critical value of the multivariate test. We establish validity of the bootstrap approximation in the case where the dimension is allowed to grow with the sample size. Numerical studies show that the new testing procedures perform well even for small sample sizes and are powerful in detecting local features or high-frequency components.
△ Less
Submitted 14 September, 2015; v1 submitted 11 September, 2015;
originally announced September 2015.
-
Reliable inference for complex models by discriminative composite likelihood estimation
Authors:
Davide Ferrari,
Chao Zheng
Abstract:
Composite likelihood estimation has an important role in the analysis of multivariate data for which the full likelihood function is intractable. An important issue in composite likelihood inference is the choice of the weights associated with lower-dimensional data sub-sets, since the presence of incompatible sub-models can deteriorate the accuracy of the resulting estimator. In this paper, we in…
▽ More
Composite likelihood estimation has an important role in the analysis of multivariate data for which the full likelihood function is intractable. An important issue in composite likelihood inference is the choice of the weights associated with lower-dimensional data sub-sets, since the presence of incompatible sub-models can deteriorate the accuracy of the resulting estimator. In this paper, we introduce a new approach for simultaneous parameter estimation by tilting, or re-weighting, each sub-likelihood component called discriminative composite likelihood estimation (D-McLE). The data-adaptive weights maximize the composite likelihood function, subject to moving a given distance from uniform weights; then, the resulting weights can be used to rank lower-dimensional likelihoods in terms of their influence in the composite likelihood function. Our analytical findings and numerical examples support the stability of the resulting estimator compared to estimators constructed using standard composition strategies based on uniform weights. The properties of the new method are illustrated through simulated data and real spatial data on multivariate precipitation extremes.
△ Less
Submitted 14 December, 2015; v1 submitted 16 February, 2015;
originally announced February 2015.