Assumption-lean and Data-adaptive Post-Prediction Inference
Authors:
Jiacheng Miao,
Xinran Miao,
Yixuan Wu,
Jiwei Zhao,
Qiongshi Lu
Abstract:
A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent st…
▽ More
A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce an assumption-lean and data-adaptive Post-Prediction Inference (POP-Inf) procedure that allows valid and powerful inference based on ML-predicted outcomes. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML-prediction, for a wide range of statistical quantities. Its "data-adaptive'" feature guarantees an efficiency gain over existing post-prediction inference methods, regardless of the accuracy of ML-prediction. We demonstrate the superiority and applicability of our method through simulations and large-scale genomic data.
△ Less
Submitted 6 February, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
Source data selection for out-of-domain generalization
Authors:
Xinran Miao,
Kris Sankaran
Abstract:
Models that perform out-of-domain generalization borrow knowledge from heterogeneous source data and apply it to a related but distinct target task. Transfer learning has proven effective for accomplishing this generalization in many applications. However, poor selection of a source dataset can lead to poor performance on the target, a phenomenon called negative transfer. In order to take full adv…
▽ More
Models that perform out-of-domain generalization borrow knowledge from heterogeneous source data and apply it to a related but distinct target task. Transfer learning has proven effective for accomplishing this generalization in many applications. However, poor selection of a source dataset can lead to poor performance on the target, a phenomenon called negative transfer. In order to take full advantage of available source data, this work studies source data selection with respect to a target task. We propose two source selection methods that are based on the multi-bandit theory and random search, respectively. We conduct a thorough empirical evaluation on both simulated and real data. Our proposals can be also viewed as diagnostics for the existence of a reweighted source subsamples that perform better than the random selection of available samples.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.
DeGNN: Characterizing and Improving Graph Neural Networks with Graph Decomposition
Authors:
Xupeng Miao,
Nezihe Merve Gürel,
Wentao Zhang,
Zhichao Han,
Bo Li,
Wei Min,
Xi Rao,
Hansheng Ren,
Yinan Shan,
Yingxia Shao,
Yujie Wang,
Fan Wu,
Hui Xue,
Yaming Yang,
Zitao Zhang,
Yang Zhao,
Shuai Zhang,
Yu**g Wang,
Bin Cui,
Ce Zhang
Abstract:
Despite the wide application of Graph Convolutional Network (GCN), one major limitation is that it does not benefit from the increasing depth and suffers from the oversmoothing problem. In this work, we first characterize this phenomenon from the information-theoretic perspective and show that under certain conditions, the mutual information between the output after $l$ layers and the input of GCN…
▽ More
Despite the wide application of Graph Convolutional Network (GCN), one major limitation is that it does not benefit from the increasing depth and suffers from the oversmoothing problem. In this work, we first characterize this phenomenon from the information-theoretic perspective and show that under certain conditions, the mutual information between the output after $l$ layers and the input of GCN converges to 0 exponentially with respect to $l$. We also show that, on the other hand, graph decomposition can potentially weaken the condition of such convergence rate, which enabled our analysis for GraphCNN. While different graph structures can only benefit from the corresponding decomposition, in practice, we propose an automatic connectivity-aware graph decomposition algorithm, DeGNN, to improve the performance of general graph neural networks. Extensive experiments on widely adopted benchmark datasets demonstrate that DeGNN can not only significantly boost the performance of corresponding GNNs, but also achieves the state-of-the-art performances.
△ Less
Submitted 29 June, 2020; v1 submitted 10 October, 2019;
originally announced October 2019.