Search | arXiv e-print repository

Sharp error bounds for imbalanced classification: how many examples in the minority class?

Authors: Anass Aghbalou, François Portier, Anne Sabourin

Abstract: When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one cl… ▽ More When dealing with imbalanced classification data, reweighting the loss function is a standard procedure allowing to equilibrate between the true positive and true negative rates within the risk measure. Despite significant theoretical work in this area, existing results do not adequately address a main challenge within the imbalanced classification framework, which is the negligible size of one class in relation to the full sample size and the need to rescale the risk function by a probability tending to zero. To address this gap, we present two novel contributions in the setting where the rare class probability approaches zero: (1) a non asymptotic fast rate probability bound for constrained balanced empirical risk minimization, and (2) a consistent upper bound for balanced nearest neighbors estimates. Our findings provide a clearer understanding of the benefits of class-weighting in realistic settings, opening new avenues for further research in this field. △ Less

Submitted 16 April, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2308.01023 [pdf, other]

Regular Variation in Hilbert Spaces and Principal Component Analysis for Functional Extremes

Authors: Stephan Clémençon, Nathan Huet, Anne Sabourin

Abstract: Motivated by the increasing availability of data of functional nature, we develop a general probabilistic and statistical framework for extremes of regularly varying random elements $X$ in $L^2[0,1]$. We place ourselves in a Peaks-Over-Threshold framework where a functional extreme is defined as an observation $X$ whose $L^2$-norm $\|X\|$ is comparatively large. Our goal is to propose a dimension… ▽ More Motivated by the increasing availability of data of functional nature, we develop a general probabilistic and statistical framework for extremes of regularly varying random elements $X$ in $L^2[0,1]$. We place ourselves in a Peaks-Over-Threshold framework where a functional extreme is defined as an observation $X$ whose $L^2$-norm $\|X\|$ is comparatively large. Our goal is to propose a dimension reduction framework resulting into finite dimensional projections for such extreme observations. Our contribution is double. First, we investigate the notion of Regular Variation for random quantities valued in a general separable Hilbert space, for which we propose a novel concrete characterization involving solely stochastic convergence of real-valued random variables. Second, we propose a notion of functional Principal Component Analysis (PCA) accounting for the principal `directions' of functional extremes. We investigate the statistical properties of the empirical covariance operator of the angular component of extreme functions, by upper-bounding the Hilbert-Schmidt norm of the estimation error for finite sample sizes. Numerical experiments with simulated and real data illustrate this work. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 29 pages (main paper), 5 pages (appendix)

arXiv:2303.03084 [pdf, other]

On Regression in Extreme Regions

Authors: Nathan Huet, Stephan Clémençon, Anne Sabourin

Abstract: The statistical learning problem consists in building a predictive function $\hat{f}$ based on independent copies of $(X,Y)$ so that $Y$ is approximated by $\hat{f}(X)$ with minimum (squared) error. Motivated by various applications, special attention is paid here to the case of extreme (i.e. very large) observations $X$. Because of their rarity, the contributions of such observations to the (empi… ▽ More The statistical learning problem consists in building a predictive function $\hat{f}$ based on independent copies of $(X,Y)$ so that $Y$ is approximated by $\hat{f}(X)$ with minimum (squared) error. Motivated by various applications, special attention is paid here to the case of extreme (i.e. very large) observations $X$. Because of their rarity, the contributions of such observations to the (empirical) error is negligible, and the predictive performance of empirical risk minimizers can be consequently very poor in extreme regions. In this paper, we develop a general framework for regression on extremes. Under appropriate regular variation assumptions regarding the pair $(X,Y)$, we show that an asymptotic notion of risk can be tailored to summarize appropriately predictive performance in extreme regions. It is also proved that minimization of an empirical and nonasymptotic version of this 'extreme risk', based on a fraction of the largest observations solely, yields good generalization capacity. In addition, numerical results providing strong empirical evidence of the relevance of the approach proposed are displayed. △ Less

Submitted 10 April, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: 16 pages (main paper), 13 pages (appendix)

arXiv:2108.01432 [pdf, other]

Tail inverse regression for dimension reduction with extreme response

Authors: Anass Aghbalou, François Portier, Anne Sabourin, Chen Zhou

Abstract: We consider the problem of supervised dimension reduction with a particular focus on extreme values of the target $Y\in\mathbb{R}$ to be explained by a covariate vector $X \in \mathbb{R}^p$. The general purpose is to define and estimate a projection on a lower dimensional subspace of the covariate space which is sufficient for predicting exceedances of the target above high thresholds. We propose… ▽ More We consider the problem of supervised dimension reduction with a particular focus on extreme values of the target $Y\in\mathbb{R}$ to be explained by a covariate vector $X \in \mathbb{R}^p$. The general purpose is to define and estimate a projection on a lower dimensional subspace of the covariate space which is sufficient for predicting exceedances of the target above high thresholds. We propose an original definition of Tail Conditional Independence which matches this purpose. Inspired by Sliced Inverse Regression (SIR) methods, we develop a novel framework (TIREX, Tail Inverse Regression for EXtreme response) in order to estimate an extreme sufficient dimension reduction (SDR) space of potentially smaller dimension than that of a classical SDR space. We prove the weak convergence of tail empirical processes involved in the estimation procedure and we illustrate the relevance of the proposed approach on simulated and real world data. △ Less

Submitted 24 February, 2023; v1 submitted 30 July, 2021; originally announced August 2021.

Comments: main paper: 31 pages + supplementary material: 16 pages

MSC Class: 62G32; 62H25; 62G08; 62G30

arXiv:2104.03966 [pdf, other]

Concentration bounds for the empirical angular measure with statistical learning applications

Authors: Stéphan Clémençon, Hamid Jalalzai, Stéphane Lhaut, Anne Sabourin, Johan Segers

Abstract: The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, t… ▽ More The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of the paper to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combinatorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere. △ Less

Submitted 17 October, 2022; v1 submitted 7 April, 2021; originally announced April 2021.

Comments: 24 pages (main paper), 21 pages (supplement), 2 figures

MSC Class: Primary 62G05; 62G30; 62G32; secondary 62H30

arXiv:2003.11593 [pdf, other]

Heavy-tailed Representations, Text Polarity Classification & Data Augmentation

Authors: Hamid Jalalzai, Pierre Colombo, Chloé Clavel, Eric Gaussier, Giovanna Varni, Emmanuel Vignon, Anne Sabourin

Abstract: The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the… ▽ More The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment. △ Less

Submitted 25 March, 2021; v1 submitted 25 March, 2020; originally announced March 2020.

Journal ref: Advances in Neural Information Processing Systems (NeurIPS), Dec 2020

arXiv:1907.07523 [pdf, other]

A Multivariate Extreme Value Theory Approach to Anomaly Clustering and Visualization

Authors: Maël Chiapino, Stéphan Clémençon, Vincent Feuillard, Anne Sabourin

Abstract: In a wide variety of situations, anomalies in the behaviour of a complex system, whose health is monitored through the observation of a random vector X = (X1,. .. , X d) valued in R d , correspond to the simultaneous occurrence of extreme values for certain subgroups $α$ $\subset$ {1,. .. , d} of variables Xj. Under the heavy-tail assumption, which is precisely appropriate for modeling these pheno… ▽ More In a wide variety of situations, anomalies in the behaviour of a complex system, whose health is monitored through the observation of a random vector X = (X1,. .. , X d) valued in R d , correspond to the simultaneous occurrence of extreme values for certain subgroups $α$ $\subset$ {1,. .. , d} of variables Xj. Under the heavy-tail assumption, which is precisely appropriate for modeling these phenomena, statistical methods relying on multivariate extreme value theory have been developed in the past few years for identifying such events/subgroups. This paper exploits this approach much further by means of a novel mixture model that permits to describe the distribution of extremal observations and where the anomaly type $α$ is viewed as a latent variable. One may then take advantage of the model by assigning to any extreme point a posterior probability for each anomaly type $α$, defining implicitly a similarity measure between anomalies. It is explained at length how the latter permits to cluster extreme observations and obtain an informative planar representation of anomalies using standard graph-mining tools. The relevance and usefulness of the clustering and 2-d visual display thus designed is illustrated on simulated datasets and on real observations as well, in the aeronautics application domain. △ Less

Submitted 17 July, 2019; originally announced July 2019.

arXiv:1906.11043 [pdf, other]

Principal Component Analysis for Multivariate Extremes

Authors: Holger Drees, Anne Sabourin

Abstract: The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than other… ▽ More The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes. △ Less

Submitted 26 June, 2019; originally announced June 2019.

arXiv:1802.09977 [pdf, ps, other]

Identifying groups of variables with the potential of being large simultaneously

Authors: Maël Chiapino, Anne Sabourin, Johan Segers

Abstract: Identifying groups of variables that may be large simultaneously amounts to finding out which joint tail dependence coefficients of a multivariate distribution are positive. The asymptotic distribution of a vector of nonparametric, rank-based estimators of these coefficients justifies a stop** criterion in an algorithm that searches the collection of all possible groups of variables in a systema… ▽ More Identifying groups of variables that may be large simultaneously amounts to finding out which joint tail dependence coefficients of a multivariate distribution are positive. The asymptotic distribution of a vector of nonparametric, rank-based estimators of these coefficients justifies a stop** criterion in an algorithm that searches the collection of all possible groups of variables in a systematic way, from smaller groups to larger ones. The issue that the tolerance level in the stop** criterion should depend on the size of the groups is circumvented by the use of a conditional tail dependence coefficient. Alternatively, such stop** criteria can be based on limit distributions of rank-based estimators of the coefficient of tail dependence, quantifying the speed of decay of joint survival functions. Numerical experiments indicate that the algorithm's effectiveness for detecting tail-dependent groups of variables is highest when paired with a criterion based on a Hill-type estimator of the coefficient of tail dependence. △ Less

Submitted 27 February, 2018; originally announced February 2018.

Comments: 23 pages, 2 tables

MSC Class: 62G32

arXiv:1707.08820 [pdf, other]

Max K-armed bandit: On the ExtremeHunter algorithm and beyond

Authors: Mastane Achab, Stephan Clémençon, Aurélien Garivier, Anne Sabourin, Claire Vernade

Abstract: This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduc… ▽ More This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduced to a classical version of the bandit problem to a certain extent. Beyond the formal analysis, these two approaches are compared through numerical experiments. △ Less

Submitted 27 July, 2017; originally announced July 2017.

arXiv:1603.09584 [pdf, other]

Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking

Authors: Nicolas Goix, Anne Sabourin, Stéphan Clémençon

Abstract: Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the angular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space… ▽ More Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the angular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space with respect to their degree of 'abnormality'. The procedure relies on an original dimension-reduction technique in the extreme domain that possibly produces a sparse representation of multivariate extremes and allows to gain insight into the dependence structure thereof, esca** the curse of dimensionality. The representation output by the unsupervised methodology we propose here can be combined with any Anomaly Detection technique tailored to non-extreme data. As it performs linearly with the dimension and almost linearly in the data (in O(dn log n)), it fits to large scale problems. The approach in this paper is novel in that EVT has never been used in its multivariate version in the field of Anomaly Detection. Illustrative experimental results provide strong empirical evidence of the relevance of our approach. △ Less

Submitted 31 March, 2016; originally announced March 2016.

Comments: arXiv admin note: text overlap with arXiv:1507.05899

arXiv:1507.05899 [pdf, other]

Sparsity in Multivariate Extremes with Applications to Anomaly Detection

Authors: Nicolas Goix, Anne Sabourin, Stéphan Clémençon

Abstract: Capturing the dependence structure of multivariate extreme events is a major concern in many fields involving the management of risks stemming from multiple sources, e.g. portfolio monitoring, insurance, environmental risk management and anomaly detection. One convenient (non-parametric) characterization of extremal dependence in the framework of multivariate Extreme Value Theory (EVT) is the angu… ▽ More Capturing the dependence structure of multivariate extreme events is a major concern in many fields involving the management of risks stemming from multiple sources, e.g. portfolio monitoring, insurance, environmental risk management and anomaly detection. One convenient (non-parametric) characterization of extremal dependence in the framework of multivariate Extreme Value Theory (EVT) is the angular measure, which provides direct information about the probable 'directions' of extremes, that is, the relative contribution of each feature/coordinate of the 'largest' observations. Modeling the angular measure in high dimensional problems is a major challenge for the multivariate analysis of rare events. The present paper proposes a novel methodology aiming at exhibiting a sparsity pattern within the dependence structure of extremes. This is done by estimating the amount of mass spread by the angular measure on representative sets of directions, corresponding to specific sub-cones of $R^d\_+$. This dimension reduction technique paves the way towards scaling up existing multivariate EVT methods. Beyond a non-asymptotic study providing a theoretical validity framework for our method, we propose as a direct application a --first-- anomaly detection algorithm based on multivariate EVT. This algorithm builds a sparse 'normal profile' of extreme behaviours, to be confronted with new (possibly abnormal) extreme observations. Illustrative experimental results provide strong empirical evidence of the relevance of our approach. △ Less

Submitted 14 March, 2016; v1 submitted 21 July, 2015; originally announced July 2015.

arXiv:1502.01684 [pdf, other]

On Anomaly Ranking and Excess-Mass Curves

Authors: Nicolas Goix, Anne Sabourin, Stéphan Clémençon

Abstract: Learning how to rank multivariate unlabeled observations depending on their degree of abnormality/novelty is a crucial problem in a wide range of applications. In practice, it generally consists in building a real valued "scoring" function on the feature space so as to quantify to which extent observations should be considered as abnormal. In the 1-d situation, measurements are generally considere… ▽ More Learning how to rank multivariate unlabeled observations depending on their degree of abnormality/novelty is a crucial problem in a wide range of applications. In practice, it generally consists in building a real valued "scoring" function on the feature space so as to quantify to which extent observations should be considered as abnormal. In the 1-d situation, measurements are generally considered as "abnormal" when they are remote from central measures such as the mean or the median. Anomaly detection then relies on tail analysis of the variable of interest. Extensions to the multivariate setting are far from straightforward and it is precisely the main purpose of this paper to introduce a novel and convenient (functional) criterion for measuring the performance of a scoring function regarding the anomaly ranking task, referred to as the Excess-Mass curve (EM curve). In addition, an adaptive algorithm for building a scoring function based on unlabeled data X1 , . . . , Xn with a nearly optimal EM is proposed and is analyzed from a statistical perspective. △ Less

Submitted 5 February, 2015; originally announced February 2015.

arXiv:1412.0838 [pdf, other]

Semi-parametric modeling of excesses above high multivariate thresholds with censored data

Authors: Anne Sabourin

Abstract: How to include censored data in a statistical analysis is a recur-rent issue in statistics. In multivariate extremes, the dependence structure of large observations can be characterized in terms of a non parametric angular measure, while marginal excesses above asymptotically large thresholds have a parametric distribution. In this work, a flexible semi-parametric Dirichlet mix-ture model for angu… ▽ More How to include censored data in a statistical analysis is a recur-rent issue in statistics. In multivariate extremes, the dependence structure of large observations can be characterized in terms of a non parametric angular measure, while marginal excesses above asymptotically large thresholds have a parametric distribution. In this work, a flexible semi-parametric Dirichlet mix-ture model for angular measures is adapted to the context of censored data and missing components. One major issue is to take into account censoring intervals overlap** the extremal threshold, without knowing whether the correspond-ing hidden data is actually extreme. Further, the censored likelihood needed for Bayesian inference has no analytic expression. The first issue is tackled using a Poisson process model for extremes, whereas a data augmentation scheme avoids multivariate integration of the Poisson process intensity over both the censored intervals and the failure region above threshold. The implemented MCMC algorithm allows simultaneous estimation of marginal and dependence parameters, so that all sources of uncertainty other than model bias are cap-tured by posterior credible intervals. The method is illustrated on simulated and real data. △ Less

Submitted 2 December, 2014; originally announced December 2014.

arXiv:1411.7782 [pdf, other]

doi 10.1002/2015WR017320

Combining regional estimation and historical floods: a multivariate semi-parametric peaks-over-threshold model with censored data

Authors: Anne Sabourin, Benjamin Renard

Abstract: The estimation of extreme flood quantiles is challenging due to the relative scarcity of extreme data compared to typical target return periods. Several approaches have been developed over the years to face this challenge, including regional estimation and the use of historical flood data. This paper investigates the combination of both approaches using a multivariate peaks-over-threshold model, t… ▽ More The estimation of extreme flood quantiles is challenging due to the relative scarcity of extreme data compared to typical target return periods. Several approaches have been developed over the years to face this challenge, including regional estimation and the use of historical flood data. This paper investigates the combination of both approaches using a multivariate peaks-over-threshold model, that allows estimating altogether the intersite dependence structure and the marginal distributions at each site. The joint distribution of extremes at several sites is constructed using a semi-parametric Dirichlet Mixture model. The existence of partially missing and censored observations (historical data) is accounted for within a data augmentation scheme. This model is applied to a case study involving four catchments in Southern France, for which historical data are available since 1604. The comparison of marginal estimates from four versions of the model (with or without regionalizing the shape parameter; using or ignoring historical floods) highlights significant differences in terms of return level estimates. Moreover, the availability of historical data on several nearby catchments allows investigating the asymptotic dependence properties of extreme floods. Catchments display a a significant amount of asymptotic dependence, calling for adapted multivariate statistical models. △ Less

Submitted 28 November, 2014; originally announced November 2014.

MSC Class: 86A05 (primary); 62P12 (secondary)

Showing 1–15 of 15 results for author: Sabourin, A