Search | arXiv e-print repository

Conformal Classification with Equalized Coverage for Adaptively Selected Groups

Abstract: This paper introduces a conformal inference method to evaluate uncertainty in classification by generating prediction sets with valid coverage conditional on adaptively chosen features. These features are carefully selected to reflect potential model limitations or biases. This can be useful to find a practical compromise between efficiency -- by providing informative predictions -- and algorithmi… ▽ More This paper introduces a conformal inference method to evaluate uncertainty in classification by generating prediction sets with valid coverage conditional on adaptively chosen features. These features are carefully selected to reflect potential model limitations or biases. This can be useful to find a practical compromise between efficiency -- by providing informative predictions -- and algorithmic fairness -- by ensuring equalized coverage for the most sensitive groups. We demonstrate the validity and effectiveness of this method on simulated and real data sets. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.17561 [pdf, other]

Structured Conformal Inference for Matrix Completion with Applications to Group Recommender Systems

Authors: Ziyi Liang, Tianmin Xie, Xin Tong, Matteo Sesia

Abstract: We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which… ▽ More We develop a conformal inference method to construct joint confidence regions for structured groups of missing entries within a sparsely observed matrix. This method is useful to provide reliable uncertainty estimation for group-level collaborative filtering; for example, it can be applied to help suggest a movie for a group of friends to watch together. Unlike standard conformal techniques, which make inferences for one individual at a time, our method achieves stronger group-level guarantees by carefully assembling a structured calibration data set mimicking the patterns expected among the test group of interest. We propose a generalized weighted conformalization framework to deal with the lack of exchangeability arising from such structured calibration, and in this process we introduce several innovations to overcome computational challenges. The practicality and effectiveness of our method are demonstrated through extensive numerical experiments and an analysis of the MovieLens 100K data set. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.03163 [pdf, other]

Uncertainty in Language Models: Assessment through Rank-Calibration

Authors: Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Hamed Hassani, Insup Lee, Osbert Bastani, Edgar Dobriban

Abstract: Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been pr… ▽ More Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges ($e.g.$, $[0,\infty)$ or $[0,1]$). In this work, we address this issue by develo** a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score ($e.g.$, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2402.09623 [pdf, other]

Conformalized Adaptive Forecasting of Heterogeneous Trajectories

Authors: Yanfei Zhou, Lars Lindemann, Matteo Sesia

Abstract: This paper presents a new conformal method for generating simultaneous forecasting bands guaranteed to cover the entire path of a new random trajectory with sufficiently high probability. Prompted by the need for dependable uncertainty estimates in motion planning applications where the behavior of diverse objects may be more or less unpredictable, we blend different techniques from online conform… ▽ More This paper presents a new conformal method for generating simultaneous forecasting bands guaranteed to cover the entire path of a new random trajectory with sufficiently high probability. Prompted by the need for dependable uncertainty estimates in motion planning applications where the behavior of diverse objects may be more or less unpredictable, we blend different techniques from online conformal prediction of single and multiple time series, as well as ideas for addressing heteroscedasticity in regression. This solution is both principled, providing precise finite-sample guarantees, and effective, often leading to more informative predictions than prior methods. △ Less

Submitted 15 May, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2309.15408 [pdf, other]

A smoothed-Bayesian approach to frequency recovery from sketched data

Authors: Mario Beraha, Stefano Favaro, Matteo Sesia

Abstract: We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods… ▽ More We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a {\em smoothed-Bayesian} method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our approach is supported by rigorous frequentist properties, including unbiasedness and optimality under a squared error loss function within an intuitive class of linear estimators. For sketches with multiple hash functions, we introduce an approach based on \emph{multi-view} learning to construct computationally efficient frequency estimators. We validate our method on synthetic and real data, comparing its performance to that of existing alternatives. △ Less

Submitted 12 June, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.05092 [pdf, other]

Adaptive conformal classification with noisy labels

Authors: Matteo Sesia, Y. X. Rachel Wang, Xin Tong

Abstract: This paper develops novel conformal prediction methods for classification tasks that can automatically adapt to random label contamination in the calibration sample, leading to more informative prediction sets with stronger coverage guarantees compared to state-of-the-art approaches. This is made possible by a precise characterization of the effective coverage inflation (or deflation) suffered by… ▽ More This paper develops novel conformal prediction methods for classification tasks that can automatically adapt to random label contamination in the calibration sample, leading to more informative prediction sets with stronger coverage guarantees compared to state-of-the-art approaches. This is made possible by a precise characterization of the effective coverage inflation (or deflation) suffered by standard conformal inferences in the presence of label contamination, which is then made actionable through new calibration algorithms. Our solution is flexible and can leverage different modeling assumptions about the label contamination process, while requiring no knowledge of the underlying data distribution or of the inner workings of the machine-learning classifier. The advantages of the proposed methods are demonstrated through extensive simulations and an application to object classification with the CIFAR-10H image data set. △ Less

Submitted 21 February, 2024; v1 submitted 10 September, 2023; originally announced September 2023.

Comments: 28 pages (127 pages including references and appendices)

arXiv:2303.15029 [pdf, other]

Random measure priors in Bayesian recovery from sketches

Authors: Mario Beraha, Stefano Favaro, Matteo Sesia

Abstract: This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the… ▽ More This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general ``traits'' setting, where each data point has integer levels of association with multiple symbols, typically referred to as ``traits''. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions. △ Less

Submitted 4 June, 2024; v1 submitted 27 March, 2023; originally announced March 2023.

arXiv:2302.07294 [pdf, other]

Derandomized Novelty Detection with FDR Control via Conformal E-values

Authors: Meshi Bashari, Amir Epstein, Yaniv Romano, Matteo Sesia

Abstract: Conformal inference provides a general distribution-free method to rigorously calibrate the output of any machine learning algorithm for novelty detection. While this approach has many strengths, it has the limitation of being randomized, in the sense that it may lead to different results when analyzing twice the same data, and this can hinder the interpretation of any findings. We propose to make… ▽ More Conformal inference provides a general distribution-free method to rigorously calibrate the output of any machine learning algorithm for novelty detection. While this approach has many strengths, it has the limitation of being randomized, in the sense that it may lead to different results when analyzing twice the same data, and this can hinder the interpretation of any findings. We propose to make conformal inferences more stable by leveraging suitable conformal e-values instead of p-values to quantify statistical significance. This solution allows the evidence gathered from multiple analyses of the same data to be aggregated effectively while provably controlling the false discovery rate. Further, we show that the proposed method can reduce randomness without much loss of power compared to standard conformal inference, partly thanks to an innovative way of weighting conformal e-values based on additional side information carefully extracted from the same data. Simulations with synthetic and real data confirm this solution can be effective at eliminating random noise in the inferences obtained with state-of-the-art alternative techniques, sometimes also leading to higher power. △ Less

Submitted 23 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: 35 pages, 24 figures

arXiv:2301.11556 [pdf, other]

Conformal inference is (almost) free for neural networks trained with early stop**

Authors: Ziyi Liang, Yanfei Zhou, Matteo Sesia

Abstract: Early stop** based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stop** often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses th… ▽ More Early stop** based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stop** often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stop**: a novel method that combines early stop** with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data. △ Less

Submitted 26 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

Comments: Updates: extension to quantile regression, some further details about methodology, more numerical experiments

arXiv:2211.04612 [pdf, other]

Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries

Authors: Matteo Sesia, Stefano Favaro, Edgar Dobriban

Abstract: This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and va… ▽ More This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and variations thereof. After explaining how to achieve marginal coverage for exchangeable random queries, we extend our solution to provide stronger inferences that can account for the discreteness of the data and for heterogeneous query frequencies, increasing also robustness to possible distribution shifts. These results are facilitated by a novel conformal calibration technique that guarantees valid coverage for a large fraction of distinct random queries. Finally, we show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations as well as in examples of text and SARS-CoV-2 DNA data. △ Less

Submitted 15 August, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: 79 pages, 47 figures, 2 tables. Extended version of arXiv:2204.04270

arXiv:2209.02135 [pdf, other]

Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

Authors: Stefano Favaro, Matteo Sesia

Abstract: The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on… ▽ More The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses. △ Less

Submitted 5 September, 2022; originally announced September 2022.

Comments: 35 pages

arXiv:2208.11111 [pdf, other]

Integrative conformal p-values for powerful out-of-distribution testing with labeled outliers

Authors: Ziyi Liang, Matteo Sesia, Wenguang Sun

Abstract: This paper develops novel conformal methods to test whether a new observation was sampled from the same distribution as a reference set. Blending inductive and transductive conformal inference in an innovative way, the described methods can re-weight standard conformal p-values based on dependent side information from known out-of-distribution data in a principled way, and can automatically take a… ▽ More This paper develops novel conformal methods to test whether a new observation was sampled from the same distribution as a reference set. Blending inductive and transductive conformal inference in an innovative way, the described methods can re-weight standard conformal p-values based on dependent side information from known out-of-distribution data in a principled way, and can automatically take advantage of the most powerful model from any collection of one-class and binary classifiers. The solution can be implemented either through sample splitting or via a novel transductive cross-validation+ scheme which may also be useful in other applications of conformal inference, due to tighter guarantees compared to existing cross-validation approaches. After studying false discovery rate control and power within a multiple testing framework with several possible outliers, the proposed solution is shown to outperform standard conformal p-values through simulations as well as applications to image recognition and tabular data. △ Less

Submitted 23 August, 2022; originally announced August 2022.

arXiv:2206.00885 [pdf, other]

Coordinated Double Machine Learning

Authors: Nitai Fingerhut, Matteo Sesia, Yaniv Romano

Abstract: Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for… ▽ More Double machine learning is a statistical method for leveraging complex black-box models to construct approximately unbiased treatment effect estimates given observational data with high-dimensional covariates, under the assumption of a partially linear model. The idea is to first fit on a subset of the samples two non-linear predictive models, one for the continuous outcome of interest and one for the observed treatment, and then to estimate a linear coefficient for the treatment using the remaining samples through a simple orthogonalized regression. While this methodology is flexible and can accommodate arbitrary predictive models, typically trained independently of one another, this paper argues that a carefully coordinated learning algorithm for deep neural networks may reduce the estimation bias. The improved empirical performance of the proposed method is demonstrated through numerical experiments on both simulated and real data. △ Less

Submitted 2 June, 2022; originally announced June 2022.

Comments: 9 pages, 6 figures

arXiv:2205.08653 [pdf, other]

Searching for subgroup-specific associations while controlling the false discovery rate

Authors: Matteo Sesia, Tianshu Sun

Abstract: This paper introduces an innovative method for conducting conditional independence testing in high-dimensional data, facilitating the automated discovery of significant associations within distinct subgroups of a population, all while controlling the false discovery rate. This is achieved by expanding upon the model-X knockoff filter to provide more informative inferences. Our enhanced inferences… ▽ More This paper introduces an innovative method for conducting conditional independence testing in high-dimensional data, facilitating the automated discovery of significant associations within distinct subgroups of a population, all while controlling the false discovery rate. This is achieved by expanding upon the model-X knockoff filter to provide more informative inferences. Our enhanced inferences can help explain sample heterogeneity and uncover interactions, making better use of the capabilities offered by modern machine learning models. Specifically, our method is able to leverage any model for the identification of data-driven hypotheses pertaining to interesting population subgroups. Then, it rigorously test these hypotheses without succumbing to selection bias. Importantly, our approach is efficient and does not require sample splitting. We demonstrate the effectiveness of our method through simulations and numerical experiments, using data derived from a randomized experiment featuring multiple treatment variables. △ Less

Submitted 15 September, 2023; v1 submitted 17 May, 2022; originally announced May 2022.

Comments: 13 pages (25 pages including references and appendices)

arXiv:2205.05878 [pdf, other]

Training Uncertainty-Aware Classifiers with Conformalized Deep Learning

Authors: Bat-Sheva Einbinder, Yaniv Romano, Matteo Sesia, Yanfei Zhou

Abstract: Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We begin to address this problem in the context of multi-class classification by develo** a novel training algorithm producing models with more depend… ▽ More Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We begin to address this problem in the context of multi-class classification by develo** a novel training algorithm producing models with more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method can lead to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives. △ Less

Submitted 8 November, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: 46 pages, 48 figures, 5 tables

arXiv:2204.04270 [pdf, other]

Conformal Frequency Estimation with Sketched Data

Authors: Matteo Sesia, Stefano Favaro

Abstract: A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under t… ▽ More A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under the sole assumption of data exchangeability. Although our solution is broadly applicable, this paper focuses on applications involving the count-min sketch algorithm and a non-linear variation thereof. The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature. △ Less

Submitted 8 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: 28 pages, 24 figures, 2 tables

arXiv:2108.08813 [pdf, other]

Transfer learning in genome-wide association studies with knockoffs

Authors: Shuangning Li, Zhimei Ren, Chiara Sabatti, Matteo Sesia

Abstract: This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to… ▽ More This paper presents and compares alternative transfer learning methods that can increase the power of conditional testing via knockoffs by leveraging prior information in external data sets collected from different populations or measuring related outcomes. The relevance of this methodology is explored in particular within the context of genome-wide association studies, where it can be helpful to address the pressing need for principled ways to suitably account for, and efficiently learn from the genetic variation associated to diverse ancestries. Finally, we apply these methods to analyze several phenotypes in the UK Biobank data set, demonstrating that transfer learning helps knockoffs discover more numerous associations in the data collected from minority populations, potentially opening the way to the development of more accurate polygenic risk scores. △ Less

Submitted 19 August, 2021; originally announced August 2021.

arXiv:2106.04118 [pdf, other]

Searching for consistent associations with a multi-environment knockoff filter

Authors: Shuangning Li, Matteo Sesia, Yaniv Romano, Emmanuel Candès, Chiara Sabatti

Abstract: This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associat… ▽ More This paper develops a method based on model-X knockoffs to find conditional associations that are consistent across diverse environments, controlling the false discovery rate. The motivation for this problem is that large data sets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations consistently replicated under different conditions may be more interesting. In fact, consistency sometimes provably leads to valid causal inferences even if conditional associations do not. While the proposed method is flexible and can be deployed in a wide range of applications, this paper highlights its relevance to genome-wide association studies, in which consistency across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to the UK Biobank data. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: 41 pages, 21 figures, 8 tables

arXiv:2105.08747 [pdf, other]

Conformal Prediction using Conditional Histograms

Authors: Matteo Sesia, Yaniv Romano

Abstract: This paper develops a conformal method to compute prediction intervals for non-parametric regression that can automatically adapt to skewed data. Leveraging black-box machine learning algorithms to estimate the conditional distribution of the outcome using histograms, it translates their output into the shortest prediction intervals with approximate conditional coverage. The resulting prediction i… ▽ More This paper develops a conformal method to compute prediction intervals for non-parametric regression that can automatically adapt to skewed data. Leveraging black-box machine learning algorithms to estimate the conditional distribution of the outcome using histograms, it translates their output into the shortest prediction intervals with approximate conditional coverage. The resulting prediction intervals provably have marginal coverage in finite samples, while asymptotically achieving conditional coverage and optimal length if the black-box model is consistent. Numerical experiments with simulated and real data demonstrate improved performance compared to state-of-the-art alternatives, including conformalized quantile regression and other distributional conformal prediction approaches. △ Less

Submitted 23 October, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 12 pages, 4 figures. Supplement: 15 pages, 3 figures, 1 table

arXiv:2104.08279 [pdf, other]

doi 10.1214/22-AOS2244

Testing for Outliers with Conformal p-values

Authors: Stephen Bates, Emmanuel Candès, Lihua Lei, Yaniv Romano, Matteo Sesia

Abstract: This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually depende… ▽ More This paper studies the construction of p-values for nonparametric outlier detection, taking a multiple-testing perspective. The goal is to test whether new independent samples belong to the same distribution as a reference data set or are outliers. We propose a solution based on conformal inference, a broadly applicable framework which yields p-values that are marginally valid but mutually dependent for different test points. We prove these p-values are positively dependent and enable exact false discovery rate control, although in a relatively weak marginal sense. We then introduce a new method to compute p-values that are both valid conditionally on the training data and independent of each other for different test points; this paves the way to stronger type-I error guarantees. Our results depart from classical conformal inference as we leverage concentration inequalities rather than combinatorial arguments to establish our finite-sample guarantees. Furthermore, our techniques also yield a uniform confidence bound for the false positive rate of any outlier detection algorithm, as a function of the threshold applied to its raw statistics. Finally, the relevance of our results is demonstrated by numerical experiments on real and simulated data. △ Less

Submitted 24 May, 2022; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Revision May 24, 2022: added "asymptotic" and "Monte Carlo" conditional calibration methods; added power analyses; updated numerical experiments to include new methods

Journal ref: Ann. Statist. 51(1): 149-178 (February 2023)

arXiv:2006.04937 [pdf, other]

doi 10.1109/JBHI.2021.3094873

Interpretable Classification of Bacterial Raman Spectra with Knockoff Wavelets

Authors: Charmaine Chia, Matteo Sesia, Chi-Sing Ho, Stefanie S. Jeffrey, Jennifer Dionne, Emmanuel J. Candès, Roger T. Howe

Abstract: Deep neural networks and other sophisticated machine learning models are widely applied to biomedical signal data because they can detect complex patterns and compute accurate predictions. However, the difficulty of interpreting such models is a limitation, especially for applications involving high-stakes decision, including the identification of bacterial infections. In this paper, we consider f… ▽ More Deep neural networks and other sophisticated machine learning models are widely applied to biomedical signal data because they can detect complex patterns and compute accurate predictions. However, the difficulty of interpreting such models is a limitation, especially for applications involving high-stakes decision, including the identification of bacterial infections. In this paper, we consider fast Raman spectroscopy data and demonstrate that a logistic regression model with carefully selected features achieves accuracy comparable to that of neural networks, while being much simpler and more transparent. Our analysis leverages wavelet features with intuitive chemical interpretations, and performs controlled variable selection with knockoffs to ensure the predictors are relevant and non-redundant. Although we focus on a particular data set, the proposed approach is broadly applicable to other types of signal data for which interpretability may be important. △ Less

Submitted 1 May, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 9 pages, 6 figures, 4 tables

arXiv:2006.02544 [pdf, other]

Classification with Valid and Adaptive Coverage

Authors: Yaniv Romano, Matteo Sesia, Emmanuel J. Candès

Abstract: Conformal inference, cross-validation+, and the jackknife+ are hold-out methods that can be combined with virtually any machine learning algorithm to construct prediction sets with guaranteed marginal coverage. In this paper, we develop specialized versions of these techniques for categorical and unordered response labels that, in addition to providing marginal coverage, are also fully adaptive to… ▽ More Conformal inference, cross-validation+, and the jackknife+ are hold-out methods that can be combined with virtually any machine learning algorithm to construct prediction sets with guaranteed marginal coverage. In this paper, we develop specialized versions of these techniques for categorical and unordered response labels that, in addition to providing marginal coverage, are also fully adaptive to complex data distributions, in the sense that they perform favorably in terms of approximate conditional coverage compared to alternative methods. The heart of our contribution is a novel conformity score, which we explicitly demonstrate to be powerful and intuitive for classification problems, but whose underlying principle is potentially far more general. Experiments on synthetic and real data demonstrate the practical value of our theoretical guarantees, as well as the statistical advantages of the proposed methods over the existing alternatives. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: 10 pages, 3 figures; 13 supplementary pages, 4 supplementary figures, 4 supplementary tables

Journal ref: Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

arXiv:2002.09644 [pdf, other]

doi 10.1073/pnas.2007743117

Causal Inference in Genetic Trio Studies

Authors: Stephen Bates, Matteo Sesia, Chiara Sabatti, Emmanuel Candes

Abstract: We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test… ▽ More We introduce a method to rigorously draw causal inferences---inferences immune to all possible confounding---from genetic data that include parents and offspring. Causal conclusions are possible with these data because the natural randomness in meiosis can be viewed as a high-dimensional randomized experiment. We make this observation actionable by develo** a novel conditional independence test that identifies regions of the genome containing distinct causal variants. The proposed Digital Twin Test compares an observed offspring to carefully constructed synthetic offspring from the same parents in order to determine statistical significance, and it can leverage any black-box multivariate model and additional non-trio genetic data in order to increase power. Crucially, our inferences are based only on a well-established mathematical description of the rearrangement of genetic material during meiosis and make no assumptions about the relationship between the genotypes and phenotypes. △ Less

Submitted 22 February, 2020; originally announced February 2020.

Journal ref: Proc. Natl. Acad. Sci. U.S.A. 177 (2020) 24117-24126

arXiv:1909.05433 [pdf, other]

doi 10.1002/sta4.261

A comparison of some conformal quantile regression methods

Authors: Matteo Sesia, Emmanuel J. Candès

Abstract: We compare two recently proposed methods that combine ideas from conformal inference and quantile regression to produce locally adaptive and marginally valid prediction intervals under sample exchangeability (Romano et al., 2019; Kivaranovic et al., 2019). First, we prove that these two approaches are asymptotically efficient in large samples, under some additional assumptions. Then we compare the… ▽ More We compare two recently proposed methods that combine ideas from conformal inference and quantile regression to produce locally adaptive and marginally valid prediction intervals under sample exchangeability (Romano et al., 2019; Kivaranovic et al., 2019). First, we prove that these two approaches are asymptotically efficient in large samples, under some additional assumptions. Then we compare them empirically on simulated and real data. Our results demonstrate that the method in Romano et al. (2019) typically yields tighter prediction intervals in finite samples. Finally, we discuss how to tune these procedures by fixing the relative proportions of observations used for training and conformalization. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: 20 pages, 9 figures, 3 tables

Journal ref: Stat. 2020; 9:e261

arXiv:1903.05701 [pdf, other]

doi 10.1093/biomet/asy075

Rejoinder: "Gene Hunting with Hidden Markov Model Knockoffs"

Authors: Matteo Sesia, Chiara Sabatti, Emmanuel J. Candès

Abstract: In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general,… ▽ More In this paper we deepen and enlarge the reflection on the possible advantages of a knockoff approach to genome wide association studies (Sesia et al., 2018), starting from the discussions in Bottolo & Richardson (2019); Jewell & Witten (2019); Rosenblatt et al. (2019) and Marchini (2019). The discussants bring up a number of important points, either related to the knockoffs methodology in general, or to its specific application to genetic studies. In the following we offer some clarifications, mention relevant recent developments and highlight some of the still open problems. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 12 pages, 4 figures

Journal ref: Biometrika, Volume 106, Issue 1, 1 March 2019, Pages 35-45

arXiv:1811.06687 [pdf, other]

doi 10.1080/01621459.2019.1660174

Deep Knockoffs

Authors: Yaniv Romano, Matteo Sesia, Emmanuel J. Candès

Abstract: This paper introduces a machine for sampling approximate model-X knockoffs for arbitrary and unspecified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can b… ▽ More This paper introduces a machine for sampling approximate model-X knockoffs for arbitrary and unspecified data distributions using deep generative models. The main idea is to iteratively refine a knockoff sampling mechanism until a criterion measuring the validity of the produced knockoffs is optimized; this criterion is inspired by the popular maximum mean discrepancy in machine learning and can be thought of as measuring the distance to pairwise exchangeability between original and knockoff features. By building upon the existing model-X framework, we thus obtain a flexible and model-free statistical tool to perform controlled variable selection. Extensive numerical experiments and quantitative tests confirm the generality, effectiveness, and power of our deep knockoff machines. Finally, we apply this new method to a real study of mutations linked to changes in drug resistance in the human immunodeficiency virus. △ Less

Submitted 16 November, 2018; originally announced November 2018.

Comments: 37 pages, 23 figures, 1 table

Journal ref: J. Am. Stat. Assoc., Volume 0, Issue 0, 17 Oct 2019, Pages 1-12

arXiv:1706.04677 [pdf, other]

doi 10.1093/biomet/asy033

Gene Hunting with Knockoffs for Hidden Markov Models

Authors: Matteo Sesia, Chiara Sabatti, Emmanuel J. Candès

Abstract: Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, withou… ▽ More Modern scientific studies often require the identification of a subset of relevant explanatory variables, in the attempt to understand an interesting phenomenon. Several statistical methods have been developed to automate this task, but only recently has the framework of model-free knockoffs proposed a general solution that can perform variable selection under rigorous type-I error control, without relying on strong modeling assumptions. In this paper, we extend the methodology of model-free knockoffs to a rich family of problems where the distribution of the covariates can be described by a hidden Markov model (HMM). We develop an exact and efficient algorithm to sample knockoff copies of an HMM. We then argue that combined with the knockoffs selective framework, they provide a natural and powerful tool for performing principled inference in genome-wide association studies with guaranteed FDR control. Finally, we apply our methodology to several datasets aimed at studying the Crohn's disease and several continuous phenotypes, e.g. levels of cholesterol. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 35 pages, 13 figues, 9 tables

Journal ref: Biometrika, Volume 106, Issue 1, 1 March 2019, Pages 1-18

Showing 1–27 of 27 results for author: Sesia, M