-
Inference with non-differentiable surrogate loss in a general high-dimensional classification framework
Authors:
Muxuan Liang,
Yang Ning,
Maureen A Smith,
Ying-Qi Zhao
Abstract:
Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In…
▽ More
Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In this work, we propose a kernel-smoothed decorrelated score to construct hypothesis testing and interval estimations for the linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version in a high-dimensional setup. Simulation and real data analysis are conducted to demonstrate the validity and superiority of the proposed method.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Handling Nonmonotone Missing Data with Available Complete-Case Missing Value Assumption
Authors:
Gang Cheng,
Yen-Chi Chen,
Maureen A. Smith,
Ying-Qi Zhao
Abstract:
Nonmonotone missing data is a common problem in scientific studies. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. In this paper, we introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR)…
▽ More
Nonmonotone missing data is a common problem in scientific studies. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. In this paper, we introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR) problems. Our ACCMV assumption is applicable to data set with a small set of complete observations and we show that the ACCMV assumption leads to nonparametric identification of the distribution for the variables of interest. We further propose an inverse probability weighting estimator, a regression adjustment estimator, and a multiply-robust estimator for estimating a parameter of interest. We studied the underlying asymptotic and efficiency theories of the proposed estimators. We show the validity of our method with simulation studies and further illustrate the applicability of our method by applying it to a diabetes data set from electronic health records.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Cross-Population Amplitude Coupling in High-Dimensional Oscillatory Neural Time Series
Authors:
Heejong Bong,
Valérie Ventura,
Eric A. Yttri,
Matthew A. Smith,
Robert E. Kass
Abstract:
An important outstanding problem in analysis of neural data is to characterize interactions across brain regions from high-dimensional multiple-electrode recordings during a behavioral experiment. A leading theory, based on a considerable body of research, is that oscillations represent coordinated activity across populations of neurons. We sought to quantify time-varying covariation of oscillator…
▽ More
An important outstanding problem in analysis of neural data is to characterize interactions across brain regions from high-dimensional multiple-electrode recordings during a behavioral experiment. A leading theory, based on a considerable body of research, is that oscillations represent coordinated activity across populations of neurons. We sought to quantify time-varying covariation of oscillatory amplitudes across two brain regions, during a memory task, based on neural potentials recorded from 96 electrodes in each region. We extended probabilistic Canonical Correlation Analysis (CCA) to the time series setting, which provides a new interpretation of multiset CCA based on cross-correlation of latent time series. Because the latent time series covariance matrix is high-dimensional, we assumed sparsity of partial correlations within a range of possible interesting time series lead-lag effects to derive procedures for estimation and inference. We found the resulting methodology to perform well in realistic settings, and we applied it to data recorded from prefrontal cortex and visual area V4 to produce results that are highly plausible based on existing literature.
△ Less
Submitted 17 January, 2023; v1 submitted 7 May, 2021;
originally announced May 2021.
-
Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score
Authors:
Muxuan Liang,
Young-Geun Choi,
Yang Ning,
Maureen A Smith,
Ying-Qi Zhao
Abstract:
With the increasing adoption of electronic health records, there is an increasing interest in develo** individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work,…
▽ More
With the increasing adoption of electronic health records, there is an increasing interest in develo** individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal utilizes the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method.
△ Less
Submitted 3 May, 2021; v1 submitted 8 July, 2020;
originally announced July 2020.
-
False discovery rate regression: an application to neural synchrony detection in primary visual cortex
Authors:
James G. Scott,
Ryan C. Kelly,
Matthew A. Smith,
Pengcheng Zhou,
Robert E. Kass
Abstract:
Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subset…
▽ More
Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subsets of the experiment. To address this issue, we introduce an approach called false-discovery-rate regression that directly uses this auxiliary information to inform the outcome of each test. The method can be motivated by a two-groups model in which covariates are allowed to influence the local false discovery rate, or equivalently, the posterior probability that a given observation is a signal. This poses many subtle issues at the interface between inference and computation, and we investigate several variations of the overall approach. Simulation evidence suggests that: (1) when covariate effects are present, FDR regression improves power for a fixed false-discovery rate; and (2) when covariate effects are absent, the method is robust, in the sense that it does not lead to inflated error rates. We apply the method to neural recordings from primary visual cortex. The goal is to detect pairs of neurons that exhibit fine-time-scale interactions, in the sense that they fire together more often than expected due to chance. Our method detects roughly 50% more synchronous pairs versus a standard FDR-controlling analysis. The companion R package FDRreg implements all methods described in the paper.
△ Less
Submitted 8 June, 2014; v1 submitted 12 July, 2013;
originally announced July 2013.