Search | arXiv e-print repository

Deep Sketched Output Kernel Regression for Structured Prediction

Authors: Tamim El Ahmad, Junjie Yang, Pierre Laforgue, Florence d'Alché-Buc

Abstract: By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or tex… ▽ More By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2312.16139 [pdf, other]

Anomaly component analysis

Authors: Romain Valla, Pavlo Mozharovskyi, Florence d'Alché-Buc

Abstract: At the crossway of machine learning and data analysis, anomaly detection aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification and isolation constitute an important task in almost any area of industry a… ▽ More At the crossway of machine learning and data analysis, anomaly detection aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification and isolation constitute an important task in almost any area of industry and science. While a substantial body of literature is devoted to detection of anomalies, little attention is payed to their explanation. This is the case mostly due to intrinsically non-supervised nature of the task and non-robustness of the exploratory methods like principal component analysis (PCA). We introduce a new statistical tool dedicated for exploratory analysis of abnormal observations using data depth as a score. Anomaly component analysis (shortly ACA) is a method that searches a low-dimensional data representation that best visualises and explains anomalies. This low-dimensional representation not only allows to distinguish groups of anomalies better than the methods of the state of the art, but as well provides a -- linear in variables and thus easily interpretable -- explanation for anomalies. In a comparative simulation and real-data study, ACA also proves advantageous for anomaly analysis with respect to methods present in the literature. △ Less

Submitted 26 December, 2023; originally announced December 2023.

Comments: 41 pages, 25 figures, 13 tables

arXiv:2312.14136 [pdf, other]

Fast kernel half-space depth for data with non-convex supports

Authors: Arturo Castellanos, Pavlo Mozharovskyi, Florence d'Alché-Buc, Hicham Janati

Abstract: Data depth is a statistical function that generalizes order and quantiles to the multivariate setting and beyond, with applications spanning over descriptive and visual statistics, anomaly detection, testing, etc. The celebrated halfspace depth exploits data geometry via an optimization program to deliver properties of invariances, robustness, and non-parametricity. Nevertheless, it implicitly ass… ▽ More Data depth is a statistical function that generalizes order and quantiles to the multivariate setting and beyond, with applications spanning over descriptive and visual statistics, anomaly detection, testing, etc. The celebrated halfspace depth exploits data geometry via an optimization program to deliver properties of invariances, robustness, and non-parametricity. Nevertheless, it implicitly assumes convex data supports and requires exponential computational cost. To tackle distribution's multimodality, we extend the halfspace depth in a Reproducing Kernel Hilbert Space (RKHS). We show that the obtained depth is intuitive and establish its consistency with provable concentration bounds that allow for homogeneity testing. The proposed depth can be computed using manifold gradient making faster than halfspace depth by several orders of magnitude. The performance of our depth is demonstrated through numerical simulations as well as applications such as anomaly detection on real data and homogeneity testing. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 30 pages

arXiv:2311.01434 [pdf, other]

Tailoring Mixup to Data for Calibration

Authors: Quentin Bouniot, Pavlo Mozharovskyi, Florence d'Alché-Buc

Abstract: Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved performance, Mixup is also a good technique for improving calibration and predictive uncertainty. However, mixing data carelessly can lead to manifold intrusion, i.e., conflicts between the synthetic la… ▽ More Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved performance, Mixup is also a good technique for improving calibration and predictive uncertainty. However, mixing data carelessly can lead to manifold intrusion, i.e., conflicts between the synthetic labels assigned and the true label distributions, which can deteriorate calibration. In this work, we argue that the likelihood of manifold intrusion increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves performance and calibration of models, while being much more efficient. The code for our work is available at https://github.com/qbouniot/sim_kernel_mixup. △ Less

Submitted 11 June, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2309.16604 [pdf, other]

Exploiting Edge Features in Graphs with Fused Network Gromov-Wasserstein Distance

Authors: Junjie Yang, Matthieu Labeau, Florence d'Alché-Buc

Abstract: Pairwise comparison of graphs is key to many applications in Machine learning ranging from clustering, kernel-based classification/regression and more recently supervised graph prediction. Distances between graphs usually rely on informative representations of these structured objects such as bag of substructures or other graph embeddings. A recently popular solution consists in representing graph… ▽ More Pairwise comparison of graphs is key to many applications in Machine learning ranging from clustering, kernel-based classification/regression and more recently supervised graph prediction. Distances between graphs usually rely on informative representations of these structured objects such as bag of substructures or other graph embeddings. A recently popular solution consists in representing graphs as metric measure spaces, allowing to successfully leverage Optimal Transport, which provides meaningful distances allowing to compare them: the Gromov-Wasserstein distances. However, this family of distances overlooks edge attributes, which are essential for many structured objects. In this work, we introduce an extension of Gromov-Wasserstein distance for comparing graphs whose both nodes and edges have features. We propose novel algorithms for distance and barycenter computation. We empirically show the effectiveness of the novel distance in learning tasks where graphs occur in either input space or output space, such as classification and graph prediction. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2302.10128 [pdf, other]

Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels

Authors: Tamim El Ahmad, Luc Brogat-Motte, Pierre Laforgue, Florence d'Alché-Buc

Abstract: Leveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approxim… ▽ More Leveraging the kernel trick in both the input and output spaces, surrogate kernel methods are a flexible and theoretically grounded solution to structured output prediction. If they provide state-of-the-art performance on complex data sets of moderate size (e.g., in chemoinformatics), these approaches however fail to scale. We propose to equip surrogate kernel methods with sketching-based approximations, applied to both the input and output feature maps. We prove excess risk bounds on the original structured prediction problem, showing how to attain close-to-optimal rates with a reduced sketch size that depends on the eigendecay of the input/output covariance operators. From a computational perspective, we show that the two approximations have distinct but complementary impacts: sketching the input kernel mostly reduces training time, while sketching the output kernel decreases the inference time. Empirically, our approach is shown to scale, achieving state-of-the-art performance on benchmark data sets where non-sketched methods are intractable. △ Less

Submitted 6 May, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

Journal ref: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:109-117, 2024

arXiv:2211.08958 [pdf, other]

Vector-Valued Least-Squares Regression under Output Regularity Assumptions

Authors: Luc Brogat-Motte, Alessandro Rudi, Céline Brouard, Juho Rousu, Florence d'Alché-Buc

Abstract: We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity… ▽ More We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic least-squares problems. Then, we propose a surrogate structured prediction method derived from this reduced-rank method. We assess its benefits on three different problems: image reconstruction, multi-label classification, and metabolite identification. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:2206.08220 [pdf, other]

Functional Output Regression with Infimal Convolution: Exploring the Huber and $ε$-insensitive Losses

Authors: Alex Lambert, Dimitri Bouche, Zoltan Szabo, Florence d'Alché-Buc

Abstract: The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $ε$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms… ▽ More The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $ε$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms relying on duality to tackle the resulting tasks in the context of vector-valued reproducing kernel Hilbert spaces. The efficiency of the approach is demonstrated and contrasted with the classical squared loss setting on both synthetic and real-world benchmarks. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: 24 pages, ICML 2022

arXiv:2206.03827 [pdf, other]

Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches

Authors: Tamim El Ahmad, Pierre Laforgue, Florence d'Alché-Buc

Abstract: Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, s… ▽ More Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods. △ Less

Submitted 6 November, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

Journal ref: Transactions on Machine Learning Research (2023)

arXiv:2204.09362 [pdf, other]

Wind power predictions from nowcasts to 4-hour forecasts: a learning approach with variable selection

Authors: Dimitri Bouche, Rémi Flamary, Florence d'Alché-Buc, Riwal Plougonven, Marianne Clausel, Jordi Badosa, Philippe Drobinski

Abstract: We study short-term prediction of wind speed and wind power (every 10 minutes up to 4 hours ahead). Accurate forecasts for these quantities are crucial to mitigate the negative effects of wind farms' intermittent production on energy systems and markets. We use machine learning to combine outputs from numerical weather prediction models with local observations. The former provide valuable informat… ▽ More We study short-term prediction of wind speed and wind power (every 10 minutes up to 4 hours ahead). Accurate forecasts for these quantities are crucial to mitigate the negative effects of wind farms' intermittent production on energy systems and markets. We use machine learning to combine outputs from numerical weather prediction models with local observations. The former provide valuable information on higher scales dynamics while the latter gives the model fresher and location-specific data. So as to make the results usable for practitioners, we focus on well-known methods which can handle a high volume of data. We study first variable selection using both a linear technique and a nonlinear one. Then we exploit these results to forecast wind speed and wind power still with an emphasis on linear models versus nonlinear ones. For the wind power prediction, we also compare the indirect approach (wind speed predictions passed through a power curve) and the indirect one (directly predict wind power). △ Less

Submitted 13 December, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

arXiv:2202.03813 [pdf, other]

Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters

Authors: Luc Brogat-Motte, Rémi Flamary, Céline Brouard, Juho Rousu, Florence d'Alché-Buc

Abstract: This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge… ▽ More This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering. △ Less

Submitted 24 June, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

arXiv:2103.12711 [pdf, other]

A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

Authors: Guillaume Staerman, Pavlo Mozharovskyi, Pierre Colombo, Stéphan Clémençon, Florence d'Alché-Buc

Abstract: The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametri… ▽ More The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in Machine Learning. Focusing on continuous probability distributions on the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions w.r.t. each distribution. Its good behavior w.r.t. major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in numerical experiments. △ Less

Submitted 10 October, 2022; v1 submitted 23 March, 2021; originally announced March 2021.

arXiv:2102.05075 [pdf, other]

Emotion Transfer Using Vector-Valued Infinite Task Learning

Authors: Alex Lambert, Sanjeel Parekh, Zoltán Szabó, Florence d'Alché-Buc

Abstract: Style transfer is a significant problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We instantiate the idea in emotion transfer where the goal is to transform facial images to different target emotions. The proposed approach provides a p… ▽ More Style transfer is a significant problem of machine learning with numerous successful applications. In this work, we present a novel style transfer framework building upon infinite task learning and vector-valued reproducing kernel Hilbert spaces. We instantiate the idea in emotion transfer where the goal is to transform facial images to different target emotions. The proposed approach provides a principled way to gain explicit control over the continuous style space. We demonstrate the efficiency of the technique on popular facial emotion benchmarks, achieving low reconstruction cost and high emotion classification accuracy. △ Less

Submitted 9 February, 2021; originally announced February 2021.

Comments: 17 pages, 10 figures

arXiv:2010.09345 [pdf, other]

A Framework to Learn with Interpretation

Authors: Jayneel Parekh, Pavlo Mozharovskyi, Florence d'Alché-Buc

Abstract: To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen r… ▽ More To tackle interpretability in deep learning, we present a novel framework to jointly learn a predictive model and its associated interpretation model. The interpreter provides both local and global interpretability about the predictive model in terms of human-understandable high level attribute functions, with minimal loss of accuracy. This is achieved by a dedicated architecture and well chosen regularization penalties. We seek for a small-size dictionary of high level attribute functions that take as inputs the outputs of selected hidden layers and whose outputs feed a linear classifier. We impose strong conciseness on the activation of attributes with an entropy-based criterion while enforcing fidelity to both inputs and outputs of the predictive model. A detailed pipeline to visualize the learnt features is also developed. Moreover, besides generating interpretable models by design, our approach can be specialized to provide post-hoc interpretations for a pre-trained neural network. We validate our approach against several state-of-the-art methods on multiple datasets and show its efficacy on both kinds of tasks. △ Less

Submitted 23 February, 2022; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2007.14703 [pdf, other]

Learning Output Embeddings in Structured Prediction

Authors: Luc Brogat-Motte, Alessandro Rudi, Céline Brouard, Juho Rousu, Florence d'Alché-Buc

Abstract: A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension by means of output kernels, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss… ▽ More A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension by means of output kernels, and then, solving a regression problem in this output space. A prediction in the original space is computed by solving a pre-image problem. In such an approach, the embedding, linked to the target loss, is defined prior to the learning phase. In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space. For that purpose, we leverage a priori information on the outputs and also unexploited unsupervised output data, which are both often available in structured prediction problems. We prove that the resulting structured predictor is a consistent estimator, and derive an excess risk bound. Moreover, the novel structured prediction tool enjoys a significantly smaller computational complexity than former output kernel methods. The approach empirically tested on various structured prediction problems reveals to be versatile and able to handle large datasets. △ Less

Submitted 2 November, 2020; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:2006.10325 [pdf, other]

When OT meets MoM: Robust estimation of Wasserstein Distance

Authors: Guillaume Staerman, Pierre Laforgue, Pavlo Mozharovskyi, Florence d'Alché-Buc

Abstract: Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to lev… ▽ More Issued from Optimal Transport, the Wasserstein distance has gained importance in Machine Learning due to its appealing geometrical properties and the increasing availability of efficient approximations. In this work, we consider the problem of estimating the Wasserstein distance between two probability distributions when observations are polluted by outliers. To that end, we investigate how to leverage Medians of Means (MoM) estimators to robustify the estimation of Wasserstein distance. Exploiting the dual Kantorovitch formulation of Wasserstein distance, we introduce and discuss novel MoM-based robust estimators whose consistency is studied under a data contamination model and for which convergence rates are provided. These MoM estimators enable to make Wasserstein Generative Adversarial Network (WGAN) robust to outliers, as witnessed by an empirical study on two benchmarks CIFAR10 and Fashion MNIST. Eventually, we discuss how to combine MoM with the entropy-regularized approximation of the Wasserstein distance and propose a simple MoM-based re-weighting scheme that could be used in conjunction with the Sinkhorn algorithm. △ Less

Submitted 18 February, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

Journal ref: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021

arXiv:2003.12206 [pdf, other]

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

Authors: Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d'Alché-Buc, Emily Fox, Hugo Larochelle

Abstract: One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible res… ▽ More One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative. △ Less

Submitted 30 December, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: To appear at JMLR, 16 pages + Appendix

arXiv:2003.01432 [pdf, other]

Nonlinear Functional Output Regression: a Dictionary Approach

Authors: Dimitri Bouche, Marianne Clausel, François Roueff, Florence d'Alché-Buc

Abstract: To address functional-output regression, we introduce projection learning (PL), a novel dictionary-based approach that learns to predict a function that is expanded on a dictionary while minimizing an empirical risk based on a functional loss. PL makes it possible to use non orthogonal dictionaries and can then be combined with dictionary learning; it is thus much more flexible than expansion-base… ▽ More To address functional-output regression, we introduce projection learning (PL), a novel dictionary-based approach that learns to predict a function that is expanded on a dictionary while minimizing an empirical risk based on a functional loss. PL makes it possible to use non orthogonal dictionaries and can then be combined with dictionary learning; it is thus much more flexible than expansion-based approaches relying on vectorial losses. This general method is instantiated with reproducing kernel Hilbert spaces of vector-valued functions as kernel-based projection learning (KPL). For the functional square loss, two closed-form estimators are proposed, one for fully observed output functions and the other for partially observed ones. Both are backed theoretically by an excess risk analysis. Then, in the more general setting of integral losses based on differentiable ground losses, KPL is implemented using first-order optimization for both fully and partially observed output functions. Eventually, several robustness aspects of the proposed algorithms are highlighted on a toy dataset; and a study on two real datasets shows that they are competitive compared to other nonlinear approaches. Notably, using the square loss and a learnt dictionary, KPL enjoys a particularily attractive trade-off between computational cost and performances. △ Less

Submitted 26 February, 2021; v1 submitted 3 March, 2020; originally announced March 2020.

arXiv:1910.04621 [pdf, other]

Duality in RKHSs with Infinite Dimensional Outputs: Application to Robust Losses

Authors: Pierre Laforgue, Alex Lambert, Luc Brogat-Motte, Florence d'Alché-Buc

Abstract: Operator-Valued Kernels (OVKs) and associated vector-valued Reproducing Kernel Hilbert Spaces provide an elegant way to extend scalar kernel methods when the output space is a Hilbert space. Although primarily used in finite dimension for problems like multi-task regression, the ability of this framework to deal with infinite dimensional output spaces unlocks many more applications, such as functi… ▽ More Operator-Valued Kernels (OVKs) and associated vector-valued Reproducing Kernel Hilbert Spaces provide an elegant way to extend scalar kernel methods when the output space is a Hilbert space. Although primarily used in finite dimension for problems like multi-task regression, the ability of this framework to deal with infinite dimensional output spaces unlocks many more applications, such as functional regression, structured output prediction, and structured data representation. However, these sophisticated schemes crucially rely on the kernel trick in the output space, so that most of previous works have focused on the square norm loss function, completely neglecting robustness issues that may arise in such surrogate problems. To overcome this limitation, this paper develops a duality approach that allows to solve OVK machines for a wide range of loss functions. The infinite dimensional Lagrange multipliers are handled through a Double Representer Theorem, and algorithms for $ε$-insensitive losses and the Huber loss are thoroughly detailed. Robustness benefits are emphasized by a theoretical stability analysis, as well as empirical improvements on structured data applications. △ Less

Submitted 21 August, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

arXiv:1904.04573 [pdf, other]

Functional Isolation Forest

Authors: Guillaume Staerman, Pavlo Mozharovskyi, Stephan Clémençon, Florence d'Alché-Buc

Abstract: For the purpose of monitoring the behavior of complex infrastructures (e.g. aircrafts, transport or energy networks), high-rate sensors are deployed to capture multivariate data, generally unlabeled, in quasi continuous-time to detect quickly the occurrence of anomalies that may jeopardize the smooth operation of the system of interest. The statistical analysis of such massive data of functional n… ▽ More For the purpose of monitoring the behavior of complex infrastructures (e.g. aircrafts, transport or energy networks), high-rate sensors are deployed to capture multivariate data, generally unlabeled, in quasi continuous-time to detect quickly the occurrence of anomalies that may jeopardize the smooth operation of the system of interest. The statistical analysis of such massive data of functional nature raises many challenging methodological questions. The primary goal of this paper is to extend the popular Isolation Forest (IF) approach to Anomaly Detection, originally dedicated to finite dimensional observations, to functional data. The major difficulty lies in the wide variety of topological structures that may equip a space of functions and the great variety of patterns that may characterize abnormal curves. We address the issue of (randomly) splitting the functional space in a flexible manner in order to isolate progressively any trajectory from the others, a key ingredient to the efficiency of the algorithm. Beyond a detailed description of the algorithm, computational complexity and stability issues are investigated at length. From the scoring function measuring the degree of abnormality of an observation provided by the proposed variant of the IF algorithm, a Functional Statistical Depth function is defined and discussed as well as a multivariate functional extension. Numerical experiments provide strong empirical evidence of the accuracy of the extension proposed. △ Less

Submitted 9 October, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

arXiv:1805.11028 [pdf, other]

Autoencoding any Data through Kernel Autoencoders

Authors: Pierre Laforgue, Stephan Clémençon, Florence d'Alché-Buc

Abstract: This paper investigates a novel algorithmic approach to data representation based on kernel methods. Assuming that the observations lie in a Hilbert space X, the introduced Kernel Autoencoder (KAE) is the composition of map**s from vector-valued Reproducing Kernel Hilbert Spaces (vv-RKHSs) that minimizes the expected reconstruction error. Beyond a first extension of the autoencoding scheme to po… ▽ More This paper investigates a novel algorithmic approach to data representation based on kernel methods. Assuming that the observations lie in a Hilbert space X, the introduced Kernel Autoencoder (KAE) is the composition of map**s from vector-valued Reproducing Kernel Hilbert Spaces (vv-RKHSs) that minimizes the expected reconstruction error. Beyond a first extension of the autoencoding scheme to possibly infinite dimensional Hilbert spaces, KAE further allows to autoencode any kind of data by choosing X to be itself a RKHS. A theoretical analysis of the model is carried out, providing a generalization bound, and shedding light on its connection with Kernel Principal Component Analysis. The proposed algorithms are then detailed at length: they crucially rely on the form taken by the minimizers, revealed by a dedicated Representer Theorem. Finally, numerical experiments on both simulated data and real labeled graphs (molecules) provide empirical evidence of the KAE performances. △ Less

Submitted 2 December, 2020; v1 submitted 28 May, 2018; originally announced May 2018.

arXiv:1805.08809 [pdf, other]

Infinite-Task Learning with RKHSs

Authors: Romain Brault, Alex Lambert, Zoltán Szabó, Maxime Sangnier, Florence d'Alché-Buc

Abstract: Machine learning has witnessed tremendous success in solving tasks depending on a single hyperparameter. When considering simultaneously a finite number of tasks, multi-task learning enables one to account for the similarities of the tasks via appropriate regularizers. A step further consists of learning a continuum of tasks for various loss functions. A promising approach, called \emph{Parametric… ▽ More Machine learning has witnessed tremendous success in solving tasks depending on a single hyperparameter. When considering simultaneously a finite number of tasks, multi-task learning enables one to account for the similarities of the tasks via appropriate regularizers. A step further consists of learning a continuum of tasks for various loss functions. A promising approach, called \emph{Parametric Task Learning}, has paved the way in the continuum setting for affine models and piecewise-linear loss functions. In this work, we introduce a novel approach called \emph{Infinite Task Learning} whose goal is to learn a function whose output is a function over the hyperparameter space. We leverage tools from operator-valued kernels and the associated vector-valued RKHSs that provide an explicit control over the role of the hyperparameters, and also allows us to consider new type of constraints. We provide generalization guarantees to the suggested scheme and illustrate its efficiency in cost-sensitive classification, quantile regression and density level set estimation. △ Less

Submitted 11 October, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

Comments: 23 pages, 6 figures, 3 tables

MSC Class: 46E22; 62G08; 65D15; 47B32 ACM Class: I.2.6

arXiv:1803.08355 [pdf, other]

Structured Output Learning with Abstention: Application to Accurate Opinion Prediction

Authors: Alexandre Garcia, Slim Essid, Chloé Clavel, Florence d'Alché-Buc

Abstract: Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abst… ▽ More Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a pair of predictors, one devoted to structured abstention and the other, to structured output prediction. To compare fully labeled training data with predictions potentially containing abstentions, we define a wide class of asymmetric abstention-aware losses. Learning is achieved by surrogate regression in an appropriate feature space while prediction with abstention is performed by solving a new pre-image problem. Thus, SOLA extends recent ideas about Structured Output Prediction via surrogate problems and calibration theory and enjoys statistical guarantees on the resulting excess risk. Instantiated on a hierarchical abstention-aware loss, SOLA is shown to be relevant for fine-grained opinion mining and gives state-of-the-art results on this task. Moreover, the abstention-aware representations can be used to competitively predict user-review ratings based on a sentence-level opinion predictor. △ Less

Submitted 8 June, 2018; v1 submitted 22 March, 2018; originally announced March 2018.

Journal ref: Proceedings of Machine Learning Research 80 (2018) 1695-1703

arXiv:1605.02536 [pdf, other]

Random Fourier Features for Operator-Valued Kernels

Authors: Romain Brault, Florence d'Alché-Buc, Markus Heinonen

Abstract: Devoted to multi-task learning and structured output learning, operator-valued kernels provide a flexible tool to build vector-valued functions in the context of Reproducing Kernel Hilbert Spaces. To scale up these methods, we extend the celebrated Random Fourier Feature methodology to get an approximation of operator-valued kernels. We propose a general principle for Operator-valued Random Fourie… ▽ More Devoted to multi-task learning and structured output learning, operator-valued kernels provide a flexible tool to build vector-valued functions in the context of Reproducing Kernel Hilbert Spaces. To scale up these methods, we extend the celebrated Random Fourier Feature methodology to get an approximation of operator-valued kernels. We propose a general principle for Operator-valued Random Fourier Feature construction relying on a generalization of Bochner's theorem for translation-invariant operator-valued Mercer kernels. We prove the uniform convergence of the kernel approximation for bounded and unbounded operator random Fourier features using appropriate Bernstein matrix concentration inequality. An experimental proof-of-concept shows the quality of the approximation and the efficiency of the corresponding linear models on example datasets. △ Less

Submitted 24 May, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

Comments: 32 pages, 6 figures

Report number: PMLR 63:110-125

Journal ref: ACML, Hamilton, New-Zealand, JMLR Workshop and Conference Proceedings, November 2016, vol. 63, pp. 110-125

arXiv:1411.5172 [pdf, other]

Learning nonparametric differential equations with operator-valued kernels and gradient matching

Authors: Markus Heinonen, Florence d'Alché-Buc

Abstract: Modeling dynamical systems with ordinary differential equations implies a mechanistic view of the process underlying the dynamics. However in many cases, this knowledge is not available. To overcome this issue, we introduce a general framework for nonparametric ODE models using penalized regression in Reproducing Kernel Hilbert Spaces (RKHS) based on operator-valued kernels. Moreover, we extend th… ▽ More Modeling dynamical systems with ordinary differential equations implies a mechanistic view of the process underlying the dynamics. However in many cases, this knowledge is not available. To overcome this issue, we introduce a general framework for nonparametric ODE models using penalized regression in Reproducing Kernel Hilbert Spaces (RKHS) based on operator-valued kernels. Moreover, we extend the scope of gradient matching approaches to nonparametric ODE. A smooth estimate of the solution ODE is built to provide an approximation of the derivative of the ODE solution which is in turn used to learn the nonparametric ODE model. This approach benefits from the flexibility of penalized regression in RKHS allowing for ridge or (structured) sparse regression as well. Very good results are shown on 3 different ODE systems. △ Less

Submitted 19 November, 2014; originally announced November 2014.

arXiv:1410.7566 [pdf, other]

doi 10.1080/01621459.2013.841583

Parametric Estimation of Ordinary Differential Equations with Orthogonality Conditions

Authors: Nicolas J-B Brunel, Quentin Clairon, Florence d'Alche-Buc

Abstract: Differential equations are commonly used to model dynamical deterministic systems in applications. When statistical parameter estimation is required to calibrate theoretical models to data, classical statistical estimators are often confronted to complex and potentially ill-posed optimization problem. As a consequence, alternative estimators to classical parametric estimators are needed for obtain… ▽ More Differential equations are commonly used to model dynamical deterministic systems in applications. When statistical parameter estimation is required to calibrate theoretical models to data, classical statistical estimators are often confronted to complex and potentially ill-posed optimization problem. As a consequence, alternative estimators to classical parametric estimators are needed for obtaining reliable estimates. We propose a gradient matching approach for the estimation of parametric Ordinary Differential Equations observed with noise. Starting from a nonparametric proxy of a true solution of the ODE, we build a parametric estimator based on a variational characterization of the solution. As a Generalized Moment Estimator, our estimator must satisfy a set of orthogonal conditions that are solved in the least squares sense. Despite the use of a nonparametric estimator, we prove the root-$n$ consistency and asymptotic normality of the Orthogonal Conditions estimator. We can derive confidence sets thanks to a closed-form expression for the asymptotic variance. Finally, the OC estimator is compared to classical estimators in several (simulated and real) experiments and ODE models in order to show its versatility and relevance with respect to classical Gradient Matching and Nonlinear Least Squares estimators. In particular, we show on a real dataset of influenza infection that the approach gives reliable estimates. Moreover, we show that our approach can deal directly with more elaborated models such as Delay Differential Equation (DDE). △ Less

Submitted 28 October, 2014; originally announced October 2014.

Comments: 35 pages, 5 figures

Journal ref: Brunel, N. JB, Clairon, Q., d Alche Buc, F. (2014). Parametric Estimation of Ordinary Differential Equations With Orthogonality Conditions. Journal of the American Statistical Association, 109(505), 173-185

Showing 1–26 of 26 results for author: d'Alché-Buc, F