Search | arXiv e-print repository

Spatially Adaptive Variable Screening in Presurgical fMRI Data Analysis

Abstract: Accurate delineation of tumor-adjacent functional brain regions is essential for planning function-preserving neurosurgery. Functional magnetic resonance imaging (fMRI) is increasingly used for presurgical counseling and planning. When analyzing presurgical fMRI data, false negatives are more dangerous to the patients than false positives because patients are more likely to experience significant… ▽ More Accurate delineation of tumor-adjacent functional brain regions is essential for planning function-preserving neurosurgery. Functional magnetic resonance imaging (fMRI) is increasingly used for presurgical counseling and planning. When analyzing presurgical fMRI data, false negatives are more dangerous to the patients than false positives because patients are more likely to experience significant harm from failing to identify functional regions and subsequently resecting critical tissues. In this paper, we propose a novel spatially adaptive variable screening procedure to enable effective control of false negatives while leveraging the spatial structure of fMRI data. Compared to existing statistical methods in fMRI data analysis, the new procedure directly controls false negatives at a desirable level and is completely data-driven. The new method is also substantially different from existing false-negative control procedures which do not take spatial information into account. Numerical examples show that the new method outperforms several state-of-the-art methods in retaining signal voxels, especially the subtle ones at the boundaries of functional regions, while providing cleaner separation of functional regions from background noise. Such results could be valuable to preserve critical tissues in neurosurgery. △ Less

Submitted 20 December, 2023; originally announced December 2023.

arXiv:2212.13574 [pdf, other]

Weak Signal Inclusion Under Dependence and Applications in Genome-wide Association Study

Authors: X. Jessie Jeng, Yifei Hu, Quan Sun, Yun Li

Abstract: Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regu… ▽ More Motivated by the inquiries of weak signals in underpowered genome-wide association studies (GWASs), we consider the problem of retaining true signals that are not strong enough to be individually separable from a large amount of noise. We address the challenge from the perspective of false negative control and present false negative control (FNC) screening, a data-driven method to efficiently regulate false negative proportion at a user-specified level. FNC screening is developed in a realistic setting with arbitrary covariance dependence between variables. We calibrate the overall dependence through a parameter whose scale is compatible with the existing phase diagram in high-dimensional sparse inference. Utilizing the new calibration, we asymptotically explicate the joint effect of covariance dependence, signal sparsity, and signal intensity on the proposed method. We interpret the results using a new phase diagram, which shows that FNC screening can efficiently select a set of candidate variables to retain a high proportion of signals even when the signals are not individually separable from noise. Finite sample performance of FNC screening is compared to those of several existing methods in simulation studies. The proposed method outperforms the others in adapting to a user-specified false negative control level. We implement FNC screening to empower a two-stage GWAS procedure, which demonstrates substantial power gain when working with limited sample sizes in real applications. △ Less

Submitted 2 February, 2024; v1 submitted 27 December, 2022; originally announced December 2022.

Comments: arXiv admin note: text overlap with arXiv:2006.15667

arXiv:2102.09053 [pdf, other]

Estimating The Proportion of Signal Variables Under Arbitrary Covariance Dependence

Authors: X. Jessie Jeng

Abstract: Estimating the proportion of signals hidden in a large amount of noise variables is of interest in many scientific inquires. In this paper, we consider realistic but theoretically challenging settings with arbitrary covariance dependence between variables. We define mean absolute correlation (MAC) to measure the overall dependence level and investigate a family of estimators for their performances… ▽ More Estimating the proportion of signals hidden in a large amount of noise variables is of interest in many scientific inquires. In this paper, we consider realistic but theoretically challenging settings with arbitrary covariance dependence between variables. We define mean absolute correlation (MAC) to measure the overall dependence level and investigate a family of estimators for their performances in the full range of MAC. We explicit the joint effect of MAC dependence and signal sparsity on the performances of the family of estimators and discover that no single estimator in the family is most powerful under different MAC dependence levels. Informed by the theoretical insight, we propose a new estimator to better adapt to arbitrary covariance dependence. The proposed method compares favorably to several existing methods in extensive finite-sample settings with strong to weak covariance dependence and real dependence structures from genetic association studies. △ Less

Submitted 9 April, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

arXiv:2006.15667 [pdf, other]

Weak Signal Inclusion Under Sparsity and Dependence

Authors: X. Jessie Jeng, Yifei Hu

Abstract: We consider the scenario where important signals are not strong enough to be separable from a large amount of noise. Such weak signals commonly exist in large-scale data analysis and play vital roles in many biomedical applications. Existing methods however are mostly underpowered for such weak signals. We address the challenge from the perspective of false negative control and develop a new metho… ▽ More We consider the scenario where important signals are not strong enough to be separable from a large amount of noise. Such weak signals commonly exist in large-scale data analysis and play vital roles in many biomedical applications. Existing methods however are mostly underpowered for such weak signals. We address the challenge from the perspective of false negative control and develop a new method to efficiently regulate false negative proportion at a user-specified level. The new method is developed in a realistic setting with arbitrary covariance dependence between variables. We calibrate the overall dependence through a parameter whose scale is compatible with the existing phase diagram in high-dimensional sparse inference. Utilizing the new calibration, we asymptotically explicate the joint effect of covariance dependence, signal sparsity, and signal intensity on the proposed method. We interpret the results using a new phase diagram, which shows that the proposed method can efficiently retain a high proportion of signals even when they cannot be well-separated from noise. Finite sample performance of the proposed method is compared to those of several existing methods in simulation studies. The proposed method outperforms the others in adapting to a user-specified false negative control level. We apply the new method to analyze an fMRI dataset to locate voxels that are functionally relevant to saccadic eye movements. The new method exhibits a nice balance in identifying functional relevant regions and avoiding excessive noise voxels. △ Less

Submitted 24 January, 2022; v1 submitted 28 June, 2020; originally announced June 2020.

arXiv:1806.06304 [pdf, other]

Post-Lasso Inference for High-Dimensional Regression

Authors: X. Jessie Jeng, Huimin Peng, Wenbin Lu

Abstract: Among the most popular variable selection procedures in high-dimensional regression, Lasso provides a solution path to rank the variables and determines a cut-off position on the path to select variables and estimate coefficients. In this paper, we consider variable selection from a new perspective motivated by the frequently occurred phenomenon that relevant variables are not completely distingui… ▽ More Among the most popular variable selection procedures in high-dimensional regression, Lasso provides a solution path to rank the variables and determines a cut-off position on the path to select variables and estimate coefficients. In this paper, we consider variable selection from a new perspective motivated by the frequently occurred phenomenon that relevant variables are not completely distinguishable from noise variables on the solution path. We propose to characterize the positions of the first noise variable and the last relevant variable on the path. We then develop a new variable selection procedure to control over-selection of the noise variables ranking after the last relevant variable, and, at the same time, retain a high proportion of relevant variables ranking before the first noise variable. Our procedure utilizes the recently developed covariance test statistic and Q statistic in post-selection inference. In numerical examples, our method compares favorably with other existing methods in selection accuracy and the ability to interpret its results. △ Less

Submitted 16 June, 2018; originally announced June 2018.

arXiv:1805.10570 [pdf, other]

Efficient Signal Inclusion With Genomic Applications

Authors: X. Jessie Jeng, Teng Zhang, Jung-Ying Tzeng

Abstract: This paper addresses the challenge of efficiently capturing a high proportion of true signals for subsequent data analyses when sample sizes are relatively limited with respect to data dimension. We propose the signal missing rate as a new measure for false negative control to account for the variability of false negative proportion. Novel data-adaptive procedures are developed to control signal m… ▽ More This paper addresses the challenge of efficiently capturing a high proportion of true signals for subsequent data analyses when sample sizes are relatively limited with respect to data dimension. We propose the signal missing rate as a new measure for false negative control to account for the variability of false negative proportion. Novel data-adaptive procedures are developed to control signal missing rate without incurring many unnecessary false positives under dependence. We justify the efficiency and adaptivity of the proposed methods via theory and simulation. The proposed methods are applied to GWAS on human height to effectively remove irrelevant SNPs while retaining a high proportion of relevant SNPs for subsequent polygenic analysis. △ Less

Submitted 28 August, 2018; v1 submitted 26 May, 2018; originally announced May 2018.

arXiv:1805.05170 [pdf, ps, other]

FastLORS: Joint Modeling for eQTL Map** in R

Authors: Jacob Rhyne, Eric Chi, Jung-Ying Tzeng, X. Jessie Jeng

Abstract: Yang et al. (2013) introduced LORS, a method that jointly models the expression of genes, SNPs, and hidden factors for eQTL map**. LORS solves a convex optimization problem and has guaranteed convergence. However, it can be computationally expensive for large datasets. In this paper we introduce Fast-LORS which uses the proximal gradient method to solve the LORS problem with significantly reduce… ▽ More Yang et al. (2013) introduced LORS, a method that jointly models the expression of genes, SNPs, and hidden factors for eQTL map**. LORS solves a convex optimization problem and has guaranteed convergence. However, it can be computationally expensive for large datasets. In this paper we introduce Fast-LORS which uses the proximal gradient method to solve the LORS problem with significantly reduced computational burden. We apply Fast-LORS and LORS to data from the third phase of the International HapMap Project and obtain comparable results. Nevertheless, Fast-LORS shows substantial computational improvement compared to LORS. △ Less

Submitted 14 May, 2018; originally announced May 2018.

Comments: All functions are available in the FastLORS R package, available at https://github.com/jdrhyne2/FastLORS

arXiv:1804.03274 [pdf, other]

Efficient Predictor Ranking and False Discovery Proportion Control in High-Dimensional Regression

Authors: X. Jessie Jeng, Xiongzhi Chen

Abstract: We propose a ranking and selection procedure to prioritize relevant predictors and control false discovery proportion (FDP) of variable selection. Our procedure utilizes a new ranking method built upon the de-sparsified Lasso estimator. We show that the new ranking method achieves the optimal order of minimum non-zero effects in ranking relevant predictors ahead of irrelevant ones. Adopting the ne… ▽ More We propose a ranking and selection procedure to prioritize relevant predictors and control false discovery proportion (FDP) of variable selection. Our procedure utilizes a new ranking method built upon the de-sparsified Lasso estimator. We show that the new ranking method achieves the optimal order of minimum non-zero effects in ranking relevant predictors ahead of irrelevant ones. Adopting the new ranking method, we develop a variable selection procedure to asymptotically control FDP at a user-specified level. We show that our procedure can consistently estimate the FDP of variable selection as long as the de-sparsified Lasso estimator is asymptotically normal. In numerical analyses, our procedure compares favorably to existing methods in ranking efficiency and FDP control when the regression model is relatively sparse. △ Less

Submitted 10 December, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: 16 pages; 3 rigures; this version accepted by Journal of Multivariate Analysis

MSC Class: 62H12; 62F12

arXiv:1804.02737 [pdf, other]

eQTL Map** via Effective SNP Ranking and Screening

Authors: Jacob Rhyne, Jung-Ying Tzeng, Teng Zhang, X. Jessie Jeng

Abstract: Genome-wide eQTL map** explores the relationship between gene expression values and DNA variants to understand genetic causes of human disease. Due to the large number of genes and DNA variants that need to be assessed simultaneously, current methods for eQTL map** often suffer from low detection power, especially for identifying trans-eQTLs. In this paper, we propose a new method that utilize… ▽ More Genome-wide eQTL map** explores the relationship between gene expression values and DNA variants to understand genetic causes of human disease. Due to the large number of genes and DNA variants that need to be assessed simultaneously, current methods for eQTL map** often suffer from low detection power, especially for identifying trans-eQTLs. In this paper, we propose a new method that utilizes advanced techniques in large-scale signal detection to pursue the structure of eQTL data and improve the power for eQTL map**. The new method greatly reduces the burden of joint modeling by develo** a new ranking and screening strategy based on the higher criticism statistic. Numerical results in simulation studies demonstrate the superior performance of our method in detecting true eQTLs with reduced computational expense. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs. △ Less

Submitted 8 April, 2018; originally announced April 2018.

arXiv:1305.0220 [pdf, ps, other]

Identification of Signal, Noise, and Indistinguishable Subsets in High-Dimensional Data Analysis

Authors: X. Jessie Jeng

Abstract: Motivated by applications in high-dimensional data analysis where strong signals often stand out easily and weak ones may be indistinguishable from the noise, we develop a statistical framework to provide a novel categorization of the data into the signal, noise, and indistinguishable subsets. The three-subset categorization is especially relevant under high-dimensionality as a large proportion of… ▽ More Motivated by applications in high-dimensional data analysis where strong signals often stand out easily and weak ones may be indistinguishable from the noise, we develop a statistical framework to provide a novel categorization of the data into the signal, noise, and indistinguishable subsets. The three-subset categorization is especially relevant under high-dimensionality as a large proportion of signals can be obscured by the large amount of noise. Understanding the three-subset phenomenon is important for the researchers in real applications to design efficient follow-up studies. %For example, candidates belonging to the signal subset may have priority for more focused study, while those in the noise subset can be removed; and, for candidates in the indistinguishable subset, additional data may be collected to further separate weak signals from the noise. We develop an efficient data-driven procedure to identify the three subsets. Theoretical study shows that, under certain conditions, only signals are included in the identified signal subset while the remaining signals are included in the identified indistinguishable subsets with high probability. Moreover, the proposed procedure adapts to the unknown signal intensity, so that the identified indistinguishable subset shrinks with the true indistinguishable subset when signals become stronger. The procedure is examined and compared with methods based on FDR control using Monte Carlo simulation. Further, it is applied successfully in a real-data application to identify genomic variants having different signal intensity. △ Less

Submitted 1 May, 2013; originally announced May 2013.

Comments: 30 pages

Showing 1–10 of 10 results for author: Jeng, X J