Skip to main content

Showing 1–50 of 77 results for author: Clémençon, S

.
  1. arXiv:2406.16759  [pdf, other

    math.ST

    Anomaly Detection based on Markov Data: A Statistical Depth Approach

    Authors: Carlos Fernández, Stephan Clémençon

    Abstract: It is the main purpose of this article to extend the notion of statistical depth to the case of sample paths of a Markov chain, a very popular probabilistic model to describe parsimoniously random phenomena with a temporal causality. Initially introduced to define a center-outward ordering of points in the support of a multivariate distribution, depth functions permit to generalize the notions of… ▽ More

    Submitted 25 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    MSC Class: 62M10

  2. arXiv:2406.06849  [pdf, other

    stat.ML cs.LG

    Flexible Parametric Inference for Space-Time Hawkes Processes

    Authors: Emilia Siviero, Guillaume Staerman, Stephan Clémençon, Thomas Moreau

    Abstract: Many modern spatio-temporal data sets, in sociology, epidemiology or seismology, for example, exhibit self-exciting characteristics, triggering and clustering behaviors both at the same time, that a suitable Hawkes space-time process can accurately capture. This paper aims to develop a fast and flexible parametric inference technique to recover the parameters of the kernel functions involved in th… ▽ More

    Submitted 17 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

  3. arXiv:2403.07464  [pdf, other

    math.ST stat.ME stat.ML

    On Ranking-based Tests of Independence

    Authors: Myrto Limnios, Stéphan Clémençon

    Abstract: In this paper we develop a novel nonparametric framework to test the independence of two random variables $\mathbf{X}$ and $\mathbf{Y}$ with unknown respective marginals $H(dx)$ and $G(dy)$ and joint distribution $F(dx dy)$, based on {\it Receiver Operating Characteristic} (ROC) analysis and bipartite ranking. The rationale behind our approach relies on the fact that, the independence hypothesis… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  4. arXiv:2308.01023  [pdf, other

    math.ST math.FA stat.ML

    Regular Variation in Hilbert Spaces and Principal Component Analysis for Functional Extremes

    Authors: Stephan Clémençon, Nathan Huet, Anne Sabourin

    Abstract: Motivated by the increasing availability of data of functional nature, we develop a general probabilistic and statistical framework for extremes of regularly varying random elements $X$ in $L^2[0,1]$. We place ourselves in a Peaks-Over-Threshold framework where a functional extreme is defined as an observation $X$ whose $L^2$-norm $\|X\|$ is comparatively large. Our goal is to propose a dimension… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: 29 pages (main paper), 5 pages (appendix)

  5. arXiv:2305.10284  [pdf, other

    cs.CL cs.AI

    Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

    Authors: Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clemencon, Pierre Colombo

    Abstract: The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evalu… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  6. arXiv:2303.12878  [pdf, other

    cs.LG stat.ML

    Robust Consensus in Ranking Data Analysis: Definitions, Properties and Computational Issues

    Authors: Morgane Goibert, Clément Calauzènes, Ekhine Irurozki, Stéphan Clémençon

    Abstract: As the issue of robustness in AI systems becomes vital, statistical learning techniques that are reliable even in presence of partly contaminated data have to be developed. Preference data, in the form of (complete) rankings in the simplest situations, are no exception and the demand for appropriate concepts and tools is all the more pressing given that technologies fed by or producing this type o… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

  7. arXiv:2303.03084  [pdf, other

    stat.ML cs.LG math.ST

    On Regression in Extreme Regions

    Authors: Nathan Huet, Stephan Clémençon, Anne Sabourin

    Abstract: The statistical learning problem consists in building a predictive function $\hat{f}$ based on independent copies of $(X,Y)$ so that $Y$ is approximated by $\hat{f}(X)$ with minimum (squared) error. Motivated by various applications, special attention is paid here to the case of extreme (i.e. very large) observations $X$. Because of their rarity, the contributions of such observations to the (empi… ▽ More

    Submitted 10 April, 2024; v1 submitted 6 March, 2023; originally announced March 2023.

    Comments: 16 pages (main paper), 13 pages (appendix)

  8. arXiv:2302.03592  [pdf, other

    math.ST

    A Bipartite Ranking Approach to the Two-Sample Problem

    Authors: Stephan Clémençon, Myrto Limnios, Nicolas Vayatis

    Abstract: The two-sample problem, which consists in testing whether independent samples on $\mathbb{R}^d$ are drawn from the same (unknown) distribution, finds applications in many areas. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various sources, poorly controlled, leading to datasets possi… ▽ More

    Submitted 8 February, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

  9. arXiv:2211.07245  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    Assessing Uncertainty in Similarity Scoring: Performance & Fairness in Face Recognition

    Authors: Jean-Rémy Conti, Stéphan Clémençon

    Abstract: The ROC curve is the major tool for assessing not only the performance but also the fairness properties of a similarity scoring function. In order to draw reliable conclusions based on empirical ROC analysis, accurately evaluating the uncertainty level related to statistical versions of the ROC curves of interest is absolutely necessary, especially for applications with considerable societal impac… ▽ More

    Submitted 20 February, 2024; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: Accepted to ICLR 2024

  10. arXiv:2211.00603  [pdf, other

    stat.ML cs.LG

    On Medians of (Randomized) Pairwise Means

    Authors: Pierre Laforgue, Stephan Clémençon, Patrice Bertail

    Abstract: Tournament procedures, recently introduced in Lugosi & Mendelson (2016), offer an appealing alternative, from a theoretical perspective at least, to the principle of Empirical Risk Minimization in machine learning. Statistical learning by Median-of-Means (MoM) basically consists in segmenting the training data into blocks of equal size and comparing the statistical performance of every pair of can… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

  11. arXiv:2210.13664  [pdf, other

    cs.CV cs.AI

    Mitigating Gender Bias in Face Recognition Using the von Mises-Fisher Mixture Model

    Authors: Jean-Rémy Conti, Nathan Noiry, Vincent Despiegel, Stéphane Gentric, Stéphan Clémençon

    Abstract: In spite of the high performance and reliability of deep learning algorithms in a wide range of everyday applications, many investigations tend to show that a lot of models exhibit biases, discriminating against specific subgroups of the population (e.g. gender, ethnicity). This urges the practitioner to develop fair systems with a uniform/comparable performance across sensitive groups. In this wo… ▽ More

    Submitted 22 February, 2024; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted to ICML 2022

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:4344-4369, 2022

  12. EMOTHAW: A novel database for emotional state recognition from handwriting

    Authors: Laurence Likforman-Sulem, Anna Esposito, Marcos Faundez-Zanuy, Stephan Clemençon, Gennaro Cordasco

    Abstract: The detection of negative emotions through daily activities such as handwriting is useful for promoting well-being. The spread of human-machine interfaces such as tablets makes the collection of handwriting samples easier. In this context, we present a first publicly available handwriting database which relates emotional states to handwriting, that we call EMOTHAW. This database includes samples o… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

    Comments: 31 pages

    Journal ref: IEEE Transactions on Human-Machine Systems, vol. 47, no. 2, pp. 273-284, April 2017

  13. arXiv:2202.07365  [pdf, other

    stat.ML cs.LG

    A Statistical Learning View of Simple Kriging

    Authors: Emilia Siviero, Emilie Chautru, Stephan Clémençon

    Abstract: In the Big Data era, with the ubiquity of geolocation sensors in particular, massive datasets exhibiting a possibly complex spatial dependence structure are becoming increasingly available. In this context, the standard probabilistic theory of statistical learning does not apply directly and guarantees of the generalization capacity of predictive rules learned from such data are left to establish.… ▽ More

    Submitted 2 February, 2024; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: 41 pages

  14. arXiv:2202.03799  [pdf, other

    cs.CL cs.AI

    What are the best systems? New perspectives on NLP Benchmarking

    Authors: Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan Clemencon

    Abstract: In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained model… ▽ More

    Submitted 7 October, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

  15. arXiv:2201.08105  [pdf, other

    cs.LG stat.ML

    Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications

    Authors: Morgane Goibert, Stéphan Clémençon, Ekhine Irurozki, Pavlo Mozharovskyi

    Abstract: The concept of median/consensus has been widely investigated in order to provide a statistical summary of ranking data, i.e. realizations of a random permutation $Σ$ of a finite set, $\{1,\; \ldots,\; n\}$ with $n\geq 1$ say. As it sheds light onto only one aspect of $Σ$'s distribution $P$, it may neglect other informative features. It is the purpose of this paper to define analogs of quantiles, r… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

  16. arXiv:2201.06616  [pdf, other

    stat.ML cs.LG

    Improving the quality control of seismic data through active learning

    Authors: Mathieu Chambefort, Raphaël Butez, Emilie Chautru, Stephan Clémençon

    Abstract: In image denoising problems, the increasing density of available images makes an exhaustive visual inspection impossible and therefore automated methods based on machine-learning must be deployed for this purpose. This is particulary the case in seismic signal processing. Engineers/geophysicists have to deal with millions of seismic time series. Finding the sub-surface properties useful for the oi… ▽ More

    Submitted 20 January, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: 10 pages

  17. arXiv:2201.05115  [pdf, other

    stat.ML cs.LG

    Functional Anomaly Detection: a Benchmark Study

    Authors: Guillaume Staerman, Eric Adjakossa, Pavlo Mozharovskyi, Vera Hofer, Jayant Sen Gupta, Stephan Clémençon

    Abstract: The increasing automation in many areas of the Industry expressly demands to design efficient machine-learning solutions for the detection of abnormal events. With the ubiquitous deployment of sensors monitoring nearly continuously the health of complex infrastructures, anomaly detection can now rely on measurements sampled at a very high frequency, providing a very rich representation of the phen… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

  18. arXiv:2109.09590  [pdf, other

    math.ST stat.ML

    Learning to Rank Anomalies: Scalar Performance Criteria and Maximization of Two-Sample Rank Statistics

    Authors: Myrto Limnios, Nathan Noiry, Stéphan Clémençon

    Abstract: The ability to collect and store ever more massive databases has been accompanied by the need to process them efficiently. In many cases, most observations have the same behavior, while a probable small proportion of these observations are abnormal. Detecting the latter, defined as outliers, is one of the major challenges for machine learning applications (e.g. in fraud detection or in predictive… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

  19. arXiv:2109.02357  [pdf, other

    cs.CV cs.CY cs.LG stat.ML

    Fighting Selection Bias in Statistical Learning: Application to Visual Recognition from Biased Image Databases

    Authors: Stephan Clémençon, Pierre Laforgue, Robin Vogel

    Abstract: In practice, and especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven performances on different population segments has highlighted the representativeness issues induced by a naive aggregation of the datasets. In this paper, we show how bi… ▽ More

    Submitted 1 November, 2022; v1 submitted 6 September, 2021; originally announced September 2021.

  20. arXiv:2107.12825  [pdf

    cs.LG stat.ML

    Individual Survival Curves with Conditional Normalizing Flows

    Authors: Guillaume Ausset, Tom Ciffreo, Francois Portier, Stephan Clémençon, Timothée Papin

    Abstract: Survival analysis, or time-to-event modelling, is a classical statistical problem that has garnered a lot of interest for its practical use in epidemiology, demographics or actuarial sciences. Recent advances on the subject from the point of view of machine learning have been concerned with precise per-individual predictions instead of population studies, driven by the rise of individualized medic… ▽ More

    Submitted 27 July, 2021; originally announced July 2021.

    Comments: IEEE DSAA '21

  21. arXiv:2106.11068  [pdf, other

    stat.ML cs.LG

    Affine-Invariant Integrated Rank-Weighted Depth: Definition, Properties and Finite Sample Analysis

    Authors: Guillaume Staerman, Pavlo Mozharovskyi, Stéphan Clémençon

    Abstract: Because it determines a center-outward ordering of observations in $\mathbb{R}^d$ with $d\geq 2$, the concept of statistical depth permits to define quantiles and ranks for multivariate data and use them for various statistical tasks (e.g. inference, hypothesis testing). Whereas many depth functions have been proposed \textit{ad-hoc} in the literature since the seminal contribution of \cite{Tukey7… ▽ More

    Submitted 4 February, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

  22. arXiv:2104.03966  [pdf, other

    math.ST stat.ML

    Concentration bounds for the empirical angular measure with statistical learning applications

    Authors: Stéphan Clémençon, Hamid Jalalzai, Stéphane Lhaut, Anne Sabourin, Johan Segers

    Abstract: The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, t… ▽ More

    Submitted 17 October, 2022; v1 submitted 7 April, 2021; originally announced April 2021.

    Comments: 24 pages (main paper), 21 pages (supplement), 2 figures

    MSC Class: Primary 62G05; 62G30; 62G32; secondary 62H30

  23. arXiv:2104.02943  [pdf, other

    math.ST stat.ML

    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking

    Authors: Stéphan Clémençon, Myrto Limnios, Nicolas Vayatis

    Abstract: The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, th… ▽ More

    Submitted 24 January, 2023; v1 submitted 7 April, 2021; originally announced April 2021.

    Journal ref: Electronic Journal of Statistics , Shaker Heights, OH : Institute of Mathematical Statistics, 2021, 15 (2), pp.4659 -- 4717

  24. arXiv:2103.15708  [pdf, other

    cs.CR

    Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs

    Authors: Corentin Larroche, Johan Mazel, Stephan Clémençon

    Abstract: Anomaly detection in event logs is a promising approach for intrusion detection in enterprise networks. By building a statistical model of usual activity, it aims to detect multiple kinds of malicious behavior, including stealthy tactics, techniques and procedures (TTPs) designed to evade signature-based detection systems. However, finding suitable anomaly detection methods for event logs remains… ▽ More

    Submitted 28 June, 2022; v1 submitted 29 March, 2021; originally announced March 2021.

  25. arXiv:2103.12711  [pdf, other

    stat.ML cs.LG

    A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

    Authors: Guillaume Staerman, Pavlo Mozharovskyi, Pierre Colombo, Stéphan Clémençon, Florence d'Alché-Buc

    Abstract: The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametri… ▽ More

    Submitted 10 October, 2022; v1 submitted 23 March, 2021; originally announced March 2021.

  26. arXiv:2006.15043  [pdf, other

    cs.LG stat.ML

    Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications

    Authors: Guillaume Ausset, Stephan Clémençon, François Portier

    Abstract: Motivated by a wide variety of applications, ranging from stochastic optimization to dimension reduction through variable selection, the problem of estimating gradients accurately is of crucial importance in statistics and learning theory. We consider here the classic regression setup, where a real valued square integrable r.v. $Y$ is to be predicted upon observing a (possibly high dimensional) ra… ▽ More

    Submitted 26 June, 2020; originally announced June 2020.

  27. arXiv:2006.05240  [pdf, other

    stat.ML cs.LG

    Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

    Authors: Pierre Laforgue, Guillaume Staerman, Stephan Clémençon

    Abstract: In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $θ$ of a square integrable r.v. $Z$, around which accurate nonasymptotic confidence bounds can be built, even when $Z$ does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to desig… ▽ More

    Submitted 7 February, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

  28. arXiv:2003.07703  [pdf, other

    cs.CY

    Flexible and Context-Specific AI Explainability: A Multidisciplinary Approach

    Authors: Valérie Beaudouin, Isabelle Bloch, David Bounie, Stéphan Clémençon, Florence d'Alché-Buc, James Eagan, Winston Maxwell, Pavlo Mozharovskyi, Jayneel Parekh

    Abstract: The recent enthusiasm for artificial intelligence (AI) is due principally to advances in deep learning. Deep learning methods are remarkably accurate, but also opaque, which limits their potential use in safety-critical applications. To achieve trust and accountability, designers and operators of machine learning algorithms must be able to explain the inner workings, the results and the causes of… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

  29. arXiv:2002.09420  [pdf, other

    stat.ML cs.LG

    A Multiclass Classification Approach to Label Ranking

    Authors: Stephan Clémençon, Robin Vogel

    Abstract: In multiclass classification, the goal is to learn how to predict a random label $Y$, valued in $\mathcal{Y}=\{1,\; \ldots,\; K \}$ with $K\geq 3$, based upon observing a r.v. $X$, taking its values in $\mathbb{R}^q$ with $q\geq 1$ say, by means of a classification rule $g:\mathbb{R}^q\to \mathcal{Y}$ with minimum probability of error $\mathbb{P}\{Y\neq g(X) \}$. However, in a wide variety of situ… ▽ More

    Submitted 21 February, 2020; originally announced February 2020.

    Comments: 15 pages, 6 figures

  30. arXiv:2002.08159  [pdf, other

    stat.ML cs.LG

    Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints

    Authors: Robin Vogel, Aurélien Bellet, Stephan Clémençon

    Abstract: Many applications of AI involve scoring individuals using a learned function of their attributes. These predictive risk scores are then used to take decisions based on whether the score exceeds a certain threshold, which may vary depending on the context. The level of delegation granted to such systems in critical applications like credit lending and medical diagnosis will heavily depend on how qu… ▽ More

    Submitted 25 February, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

    Comments: 35 pages, 13 figures, 6 tables

  31. arXiv:2002.05145  [pdf, other

    stat.ML cs.LG

    Weighted Empirical Risk Minimization: Sample Selection Bias Correction based on Importance Sampling

    Authors: Robin Vogel, Mastane Achab, Stéphan Clémençon, Charles Tillier

    Abstract: We consider statistical learning problems, when the distribution $P'$ of the training observations $Z'_1,\; \ldots,\; Z'_n$ differs from the distribution $P$ involved in the risk one seeks to minimize (referred to as the test distribution) but is still defined on the same measurable space as $P$ and dominates it. In the unrealistic case where the likelihood ratio $Φ(z)=dP/dP'(z)$ is known, one may… ▽ More

    Submitted 19 February, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

    Comments: 20 pages, 7 tables and figures

  32. arXiv:1910.04085  [pdf, other

    stat.ML cs.LG math.ST stat.ME

    The Area of the Convex Hull of Sampled Curves: a Robust Functional Statistical Depth Measure

    Authors: Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon

    Abstract: With the ubiquity of sensors in the IoT era, statistical observations are becoming increasingly available in the form of massive (multivariate) time-series. Formulated as unsupervised anomaly detection tasks, an abundance of applications like aviation safety management, the health monitoring of complex infrastructures or fraud detection can now rely on such functional data, acquired and stored wit… ▽ More

    Submitted 13 February, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

  33. arXiv:1907.07523  [pdf, other

    stat.ME stat.AP stat.ML

    A Multivariate Extreme Value Theory Approach to Anomaly Clustering and Visualization

    Authors: Maël Chiapino, Stéphan Clémençon, Vincent Feuillard, Anne Sabourin

    Abstract: In a wide variety of situations, anomalies in the behaviour of a complex system, whose health is monitored through the observation of a random vector X = (X1,. .. , X d) valued in R d , correspond to the simultaneous occurrence of extreme values for certain subgroups $α$ $\subset$ {1,. .. , d} of variables Xj. Under the heavy-tail assumption, which is precisely appropriate for modeling these pheno… ▽ More

    Submitted 17 July, 2019; originally announced July 2019.

  34. arXiv:1906.12304  [pdf, other

    stat.ML cs.LG

    Statistical Learning from Biased Training Samples

    Authors: Stephan Clémençon, Pierre Laforgue

    Abstract: With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many practical situations, the poor control of the data acquisition processes may naturally jeopardize the outputs of machine learning algorithms, and selection bias issues are now the subject of much attention in the literature. The present… ▽ More

    Submitted 1 November, 2022; v1 submitted 28 June, 2019; originally announced June 2019.

  35. On Tree-based Methods for Similarity Learning

    Authors: Stéphan Clémençon, Robin Vogel

    Abstract: In many situations, the choice of an adequate similarity measure or metric on the feature space dramatically determines the performance of machine learning methods. Building automatically such measures is the specific purpose of metric/similarity learning. In Vogel et al. (2018), similarity learning is formulated as a pairwise bipartite ranking problem: ideally, the larger the probability that two… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: 17 pages, 4 figures

  36. arXiv:1906.09234  [pdf, other

    stat.ML cs.LG

    Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning

    Authors: Robin Vogel, Aurélien Bellet, Stephan Clémençon, Ons Jelassi, Guillaume Papa

    Abstract: The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression. In contrast, statistical learning tasks involvin… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: 23 pages, 6 figures, ECML 2019

  37. arXiv:1906.01908  [pdf, other

    cs.LG math.ST stat.ML

    Empirical Risk Minimization under Random Censorship: Theory and Practice

    Authors: Guillaume Ausset, Stéphan Clémençon, François Portier

    Abstract: We consider the classic supervised learning problem, where a continuous non-negative random label $Y$ (i.e. a random duration) is to be predicted based upon observing a random vector $X$ valued in $\mathbb{R}^d$ with $d\geq 1$ by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: Submitted to JMLR. 18 pages + Appendix

  38. arXiv:1904.04573  [pdf, other

    stat.ML cs.LG

    Functional Isolation Forest

    Authors: Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon, Florence d'Alché-Buc

    Abstract: For the purpose of monitoring the behavior of complex infrastructures (e.g. aircrafts, transport or energy networks), high-rate sensors are deployed to capture multivariate data, generally unlabeled, in quasi continuous-time to detect quickly the occurrence of anomalies that may jeopardize the smooth operation of the system of interest. The statistical analysis of such massive data of functional n… ▽ More

    Submitted 9 October, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

  39. arXiv:1810.06291  [pdf, other

    stat.ML cs.LG

    Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach

    Authors: Mastane Achab, Anna Korba, Stephan Clémençon

    Abstract: Whereas most dimensionality reduction techniques (e.g. PCA, ICA, NMF) for multivariate data essentially rely on linear algebra to a certain extent, summarizing ranking data, viewed as realizations of a random permutation $Σ$ on a set of items indexed by $i\in \{1,\ldots,\; n\}$, is a great statistical challenge, due to the absence of vector space structure for the set of permutations… ▽ More

    Submitted 30 August, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

  40. arXiv:1807.06981  [pdf, other

    stat.ML cs.AI cs.LG

    A Probabilistic Theory of Supervised Similarity Learning for Pointwise ROC Curve Optimization

    Authors: Robin Vogel, Aurélien Bellet, Stéphan Clémençon

    Abstract: The performance of many machine learning techniques depends on the choice of an appropriate similarity or distance measure on the input space. Similarity learning (or metric learning) aims at building such a measure from training data so that observations with the same (resp. different) label are as close (resp. far) as possible. In this paper, similarity learning is investigated from the perspect… ▽ More

    Submitted 18 July, 2018; originally announced July 2018.

    Comments: 8 pages main paper, 22 pages with appendices, proceedings of ICML 2018

    Journal ref: PMLR 80 (2018) 5062-5071

  41. arXiv:1805.11028  [pdf, other

    stat.ML cs.LG

    Autoencoding any Data through Kernel Autoencoders

    Authors: Pierre Laforgue, Stephan Clémençon, Florence d'Alché-Buc

    Abstract: This paper investigates a novel algorithmic approach to data representation based on kernel methods. Assuming that the observations lie in a Hilbert space X, the introduced Kernel Autoencoder (KAE) is the composition of map**s from vector-valued Reproducing Kernel Hilbert Spaces (vv-RKHSs) that minimizes the expected reconstruction error. Beyond a first extension of the autoencoding scheme to po… ▽ More

    Submitted 2 December, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

  42. arXiv:1805.02908  [pdf, other

    stat.ML cs.LG

    Profitable Bandits

    Authors: Mastane Achab, Stephan Clémençon, Aurélien Garivier

    Abstract: Originally motivated by default risk management applications, this paper investigates a novel problem, referred to as the profitable bandit problem here. At each step, an agent chooses a subset of the K possible actions. For each action chosen, she then receives the sum of a random number of rewards. Her objective is to maximize her cumulated earnings. We adapt and study three well-known strategie… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

  43. arXiv:1801.05772  [pdf, other

    stat.ML

    Ranking Data with Continuous Labels through Oriented Recursive Partitions

    Authors: Stephan Clémençon, Mastane Achab

    Abstract: We formulate a supervised learning problem, referred to as continuous ranking, where a continuous real-valued label Y is assigned to an observable r.v. X taking its values in a feature space $\mathcal{X}$ and the goal is to order all possible observations x in $\mathcal{X}$ by means of a scoring function $s:\mathcal{X}\rightarrow \mathbb{R}$ so that s(X) and Y tend to increase or decrease together… ▽ More

    Submitted 17 January, 2018; originally announced January 2018.

  44. arXiv:1711.00070  [pdf, other

    math.ST stat.ML

    Ranking Median Regression: Learning to Order through Local Consensus

    Authors: Stephan Clémençon, Anna Korba, Eric Sibony

    Abstract: This article is devoted to the problem of predicting the value taken by a random permutation $Σ$, describing the preferences of an individual over a set of numbered items $\{1,\; \ldots,\; n\}$ say, based on the observation of an input/explanatory r.v. $X$ e.g. characteristics of the individual), when error is measured by the Kendall $τ$ distance. In the probabilistic formulation of the 'Learning… ▽ More

    Submitted 18 December, 2017; v1 submitted 31 October, 2017; originally announced November 2017.

  45. arXiv:1707.08820  [pdf, other

    stat.ML cs.LG

    Max K-armed bandit: On the ExtremeHunter algorithm and beyond

    Authors: Mastane Achab, Stephan Clémençon, Aurélien Garivier, Anne Sabourin, Claire Vernade

    Abstract: This paper is devoted to the study of the max K-armed bandit problem, which consists in sequentially allocating resources in order to detect extreme values. Our contribution is twofold. We first significantly refine the analysis of the ExtremeHunter algorithm carried out in Carpentier and Valko (2014), and next propose an alternative approach, showing that, remarkably, Extreme Bandits can be reduc… ▽ More

    Submitted 27 July, 2017; originally announced July 2017.

  46. arXiv:1705.01305  [pdf, other

    stat.ML

    Mass Volume Curves and Anomaly Ranking

    Authors: Stephan Clémençon, Albert Thomas

    Abstract: This paper aims at formulating the issue of ranking multivariate unlabeled observations depending on their degree of abnormality as an unsupervised statistical learning task. In the 1-d situation, this problem is usually tackled by means of tail estimation techniques: univariate observations are viewed as all the more `abnormal' as they are located far in the tail(s) of the underlying probability… ▽ More

    Submitted 3 September, 2018; v1 submitted 3 May, 2017; originally announced May 2017.

  47. arXiv:1610.03776  [pdf, ps, other

    math.ST

    Sharp exponential inequalities in survey sampling: conditional Poisson sampling schemes

    Authors: Patrice Bertail, Stephan Clémençon

    Abstract: This paper is devoted to establishing exponential bounds for the probabilities of deviation of a sample sum from its expectation, when the variables involved in the summation are obtained by sampling in a finite population according to a rejective scheme, generalizing sampling without replacement, and by using an appropriate normalization. In contrast to Poisson sampling, classical deviation inequ… ▽ More

    Submitted 12 October, 2016; originally announced October 2016.

  48. arXiv:1606.02421  [pdf, other

    stat.ML cs.AI cs.DC cs.LG eess.SY

    Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions

    Authors: Igor Colin, Aurélien Bellet, Joseph Salmon, Stéphan Clémençon

    Abstract: In decentralized networks (of sensors, connected objects, etc.), there is an important need for efficient algorithms to optimize a global cost function, for instance to learn a global model from the local data collected by each computing unit. In this paper, we address the problem of decentralized minimization of pairwise functions of the data points, where these points are distributed over the no… ▽ More

    Submitted 8 June, 2016; originally announced June 2016.

  49. arXiv:1603.09584  [pdf, other

    stat.ML

    Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking

    Authors: Nicolas Goix, Anne Sabourin, Stéphan Clémençon

    Abstract: Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the angular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space… ▽ More

    Submitted 31 March, 2016; originally announced March 2016.

    Comments: arXiv admin note: text overlap with arXiv:1507.05899

  50. arXiv:1601.00399  [pdf, other

    math.ST

    A Multiresolution Analysis Framework for the Statistical Analysis of Incomplete Rankings

    Authors: Eric Sibony, Stéphan Clémençon, Jérémie Jakubowicz

    Abstract: Though the statistical analysis of ranking data has been a subject of interest over the past centuries, especially in economics, psychology or social choice theory, it has been revitalized in the past 15 years by recent applications such as recommender or search engines and is receiving now increasing interest in the machine learning literature. Numerous modern systems indeed generate ranking data… ▽ More

    Submitted 4 January, 2016; originally announced January 2016.