Search | arXiv e-print repository

Inference with non-differentiable surrogate loss in a general high-dimensional classification framework

Authors: Muxuan Liang, Yang Ning, Maureen A Smith, Ying-Qi Zhao

Abstract: Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In… ▽ More Penalized empirical risk minimization with a surrogate loss function is often used to derive a high-dimensional linear decision rule in classification problems. Although much of the literature focuses on the generalization error, there is a lack of valid inference procedures to identify the driving factors of the estimated decision rule, especially when the surrogate loss is non-differentiable. In this work, we propose a kernel-smoothed decorrelated score to construct hypothesis testing and interval estimations for the linear decision rule estimated using a piece-wise linear surrogate loss, which has a discontinuous gradient and non-regular Hessian. Specifically, we adopt kernel approximations to smooth the discontinuous gradient near discontinuity points and approximate the non-regular Hessian of the surrogate loss. In applications where additional nuisance parameters are involved, we propose a novel cross-fitted version to accommodate flexible nuisance estimates and kernel approximations. We establish the limiting distribution of the kernel-smoothed decorrelated score and its cross-fitted version in a high-dimensional setup. Simulation and real data analysis are conducted to demonstrate the validity and superiority of the proposed method. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 27 pages, 4 figures

arXiv:2207.02289 [pdf, other]

Handling Nonmonotone Missing Data with Available Complete-Case Missing Value Assumption

Authors: Gang Cheng, Yen-Chi Chen, Maureen A. Smith, Ying-Qi Zhao

Abstract: Nonmonotone missing data is a common problem in scientific studies. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. In this paper, we introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR)… ▽ More Nonmonotone missing data is a common problem in scientific studies. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. In this paper, we introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR) problems. Our ACCMV assumption is applicable to data set with a small set of complete observations and we show that the ACCMV assumption leads to nonparametric identification of the distribution for the variables of interest. We further propose an inverse probability weighting estimator, a regression adjustment estimator, and a multiply-robust estimator for estimating a parameter of interest. We studied the underlying asymptotic and efficiency theories of the proposed estimators. We show the validity of our method with simulation studies and further illustrate the applicability of our method by applying it to a diabetes data set from electronic health records. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: 48 pages

arXiv:2105.03508 [pdf, other]

Cross-Population Amplitude Coupling in High-Dimensional Oscillatory Neural Time Series

Authors: Heejong Bong, Valérie Ventura, Eric A. Yttri, Matthew A. Smith, Robert E. Kass

Abstract: An important outstanding problem in analysis of neural data is to characterize interactions across brain regions from high-dimensional multiple-electrode recordings during a behavioral experiment. A leading theory, based on a considerable body of research, is that oscillations represent coordinated activity across populations of neurons. We sought to quantify time-varying covariation of oscillator… ▽ More An important outstanding problem in analysis of neural data is to characterize interactions across brain regions from high-dimensional multiple-electrode recordings during a behavioral experiment. A leading theory, based on a considerable body of research, is that oscillations represent coordinated activity across populations of neurons. We sought to quantify time-varying covariation of oscillatory amplitudes across two brain regions, during a memory task, based on neural potentials recorded from 96 electrodes in each region. We extended probabilistic Canonical Correlation Analysis (CCA) to the time series setting, which provides a new interpretation of multiset CCA based on cross-correlation of latent time series. Because the latent time series covariance matrix is high-dimensional, we assumed sparsity of partial correlations within a range of possible interesting time series lead-lag effects to derive procedures for estimation and inference. We found the resulting methodology to perform well in realistic settings, and we applied it to data recorded from prefrontal cortex and visual area V4 to produce results that are highly plausible based on existing literature. △ Less

Submitted 17 January, 2023; v1 submitted 7 May, 2021; originally announced May 2021.

Comments: 21 pages, 12 figures, submitted to The Annals of Applied Statistics

MSC Class: 62P10 (Primary) 62H22; 62H25; 92B20 (Secondary)

arXiv:2007.04445 [pdf, ps, other]

Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score

Authors: Muxuan Liang, Young-Geun Choi, Yang Ning, Maureen A Smith, Ying-Qi Zhao

Abstract: With the increasing adoption of electronic health records, there is an increasing interest in develo** individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work,… ▽ More With the increasing adoption of electronic health records, there is an increasing interest in develo** individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However, there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data. We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal utilizes the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method. △ Less

Submitted 3 May, 2021; v1 submitted 8 July, 2020; originally announced July 2020.

Comments: 15 pages, 2 figures, 2 tables

arXiv:1307.3495 [pdf, other]

False discovery rate regression: an application to neural synchrony detection in primary visual cortex

Authors: James G. Scott, Ryan C. Kelly, Matthew A. Smith, Pengcheng Zhou, Robert E. Kass

Abstract: Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subset… ▽ More Many approaches for multiple testing begin with the assumption that all tests in a given study should be combined into a global false-discovery-rate analysis. But this may be inappropriate for many of today's large-scale screening problems, where auxiliary information about each test is often available, and where a combined analysis can lead to poorly calibrated error rates within different subsets of the experiment. To address this issue, we introduce an approach called false-discovery-rate regression that directly uses this auxiliary information to inform the outcome of each test. The method can be motivated by a two-groups model in which covariates are allowed to influence the local false discovery rate, or equivalently, the posterior probability that a given observation is a signal. This poses many subtle issues at the interface between inference and computation, and we investigate several variations of the overall approach. Simulation evidence suggests that: (1) when covariate effects are present, FDR regression improves power for a fixed false-discovery rate; and (2) when covariate effects are absent, the method is robust, in the sense that it does not lead to inflated error rates. We apply the method to neural recordings from primary visual cortex. The goal is to detect pairs of neurons that exhibit fine-time-scale interactions, in the sense that they fire together more often than expected due to chance. Our method detects roughly 50% more synchronous pairs versus a standard FDR-controlling analysis. The companion R package FDRreg implements all methods described in the paper. △ Less

Submitted 8 June, 2014; v1 submitted 12 July, 2013; originally announced July 2013.

Showing 1–5 of 5 results for author: Smith, M A