Search | arXiv e-print repository

Leveraging PAC-Bayes Theory and Gibbs Distributions for Generalization Bounds with Complexity Measures

Authors: Paul Viallard, Rémi Emonet, Amaury Habrard, Emilie Morvant, Valentina Zantedeschi

Abstract: In statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitra… ▽ More In statistical learning theory, a generalization bound usually involves a complexity measure imposed by the considered theoretical framework. This limits the scope of such bounds, as other forms of capacity measures or regularizations are used in algorithms. In this paper, we leverage the framework of disintegrated PAC-Bayes bounds to derive a general generalization bound instantiable with arbitrary complexity measures. One trick to prove such a result involves considering a commonly used family of distributions: the Gibbs distributions. Our bound stands in probability jointly over the hypothesis and the learning sample, which allows the complexity to be adapted to the generalization gap as it can be customized to fit both the hypothesis class and the task. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: AISTATS 2024

arXiv:2310.19936 [pdf, other]

doi 10.1109/WACV56688.2023.00016

Towards Few-Annotation Learning for Object Detection: Are Transformer-based Models More Efficient ?

Authors: Quentin Bouniot, Angélique Loesch, Romaric Audigier, Amaury Habrard

Abstract: For specialized and dense downstream tasks such as object detection, labeling data requires expertise and can be very expensive, making few-shot and semi-supervised models much more attractive alternatives. While in the few-shot setup we observe that transformer-based object detectors perform better than convolution-based two-stage models for a similar amount of parameters, they are not as effecti… ▽ More For specialized and dense downstream tasks such as object detection, labeling data requires expertise and can be very expensive, making few-shot and semi-supervised models much more attractive alternatives. While in the few-shot setup we observe that transformer-based object detectors perform better than convolution-based two-stage models for a similar amount of parameters, they are not as effective when used with recent approaches in the semi-supervised setting. In this paper, we propose a semi-supervised method tailored for the current state-of-the-art object detector Deformable DETR in the few-annotation learning setup using a student-teacher architecture, which avoids relying on a sensitive post-processing of the pseudo-labels generated by the teacher model. We evaluate our method on the semi-supervised object detection benchmarks COCO and Pascal VOC, and it outperforms previous methods, especially when annotations are scarce. We believe that our contributions open new possibilities to adapt similar object detection methods in this setup as well. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: Published at WACV 2023

arXiv:2310.16835 [pdf, other]

Proposal-Contrastive Pretraining for Object Detection from Fewer Data

Authors: Quentin Bouniot, Romaric Audigier, Angélique Loesch, Amaury Habrard

Abstract: The use of pretrained deep neural networks represents an attractive way to achieve strong results with few data available. When specialized in dense problems such as object detection, learning local rather than global information in images has proven to be more efficient. However, for unsupervised pretraining, the popular contrastive learning requires a large batch size and, therefore, a lot of re… ▽ More The use of pretrained deep neural networks represents an attractive way to achieve strong results with few data available. When specialized in dense problems such as object detection, learning local rather than global information in images has proven to be more efficient. However, for unsupervised pretraining, the popular contrastive learning requires a large batch size and, therefore, a lot of resources. To address this problem, we are interested in transformer-based object detectors that have recently gained traction in the community with good performance and with the particularity of generating many diverse object proposals. In this work, we present Proposal Selection Contrast (ProSeCo), a novel unsupervised overall pretraining approach that leverages this property. ProSeCo uses the large number of object proposals generated by the detector for contrastive learning, which allows the use of a smaller batch size, combined with object-level features to learn local information in the images. To improve the effectiveness of the contrastive loss, we introduce the object location information in the selection of positive examples to take into account multiple overlap** object proposals. When reusing pretrained backbone, we advocate for consistency in learning local information between the backbone and the detection head. We show that our method outperforms state of the art in unsupervised pretraining for object detection on standard and novel benchmarks in learning with fewer data. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: Published as a conference paper at ICLR 2023

arXiv:2209.12727 [pdf, other]

A Simple Way to Learn Metrics Between Attributed Graphs

Authors: Yacouba Kaloga, Pierre Borgnat, Amaury Habrard

Abstract: The choice of good distances and similarity measures between objects is important for many machine learning methods. Therefore, many metric learning algorithms have been developed in recent years, mainly for Euclidean data in order to improve performance of classification or clustering methods. However, due to difficulties in establishing computable, efficient and differentiable distances between… ▽ More The choice of good distances and similarity measures between objects is important for many machine learning methods. Therefore, many metric learning algorithms have been developed in recent years, mainly for Euclidean data in order to improve performance of classification or clustering methods. However, due to difficulties in establishing computable, efficient and differentiable distances between attributed graphs, few metric learning algorithms adapted to graphs have been developed despite the strong interest of the community. In this paper, we address this issue by proposing a new Simple Graph Metric Learning - SGML - model with few trainable parameters based on Simple Graph Convolutional Neural Networks - SGCN - and elements of Optimal Transport theory. This model allows us to build an appropriate distance from a database of labeled (attributed) graphs to improve the performance of simple classification algorithms such as $k$-NN. This distance can be quickly trained while maintaining good performances as illustrated by the experimental study presented in this paper. △ Less

Submitted 21 December, 2022; v1 submitted 26 September, 2022; originally announced September 2022.

arXiv:2106.12535 [pdf, other]

Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound

Authors: Valentina Zantedeschi, Paul Viallard, Emilie Morvant, Rémi Emonet, Amaury Habrard, Pascal Germain, Benjamin Guedj

Abstract: We investigate a stochastic counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. While our approach holds for arbitrary distributions, we instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk, which then turns the generalization bound into a tractable training objective.… ▽ More We investigate a stochastic counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. While our approach holds for arbitrary distributions, we instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk, which then turns the generalization bound into a tractable training objective. The resulting stochastic majority vote learning algorithm achieves state-of-the-art accuracy and benefits from (non-vacuous) tight generalization bounds, in a series of numerical experiments when compared to competing algorithms which also minimize PAC-Bayes objectives -- both with uninformed (data-independent) and informed (data-dependent) priors. △ Less

Submitted 19 October, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

Journal ref: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2104.13626 [pdf, other]

Self-Bounding Majority Vote Learning Algorithms by the Direct Minimization of a Tight PAC-Bayesian C-Bound

Authors: Paul Viallard, Pascal Germain, Amaury Habrard, Emilie Morvant

Abstract: In the PAC-Bayesian literature, the C-Bound refers to an insightful relation between the risk of a majority vote classifier (under the zero-one loss) and the first two moments of its margin (i.e., the expected margin and the voters' diversity). Until now, learning algorithms developed in this framework minimize the empirical version of the C-Bound, instead of explicit PAC-Bayesian generalization b… ▽ More In the PAC-Bayesian literature, the C-Bound refers to an insightful relation between the risk of a majority vote classifier (under the zero-one loss) and the first two moments of its margin (i.e., the expected margin and the voters' diversity). Until now, learning algorithms developed in this framework minimize the empirical version of the C-Bound, instead of explicit PAC-Bayesian generalization bounds. In this paper, by directly optimizing PAC-Bayesian guarantees on the C-Bound, we derive self-bounding majority vote learning algorithms. Moreover, our algorithms based on gradient descent are scalable and lead to accurate predictors paired with non-vacuous guarantees. △ Less

Submitted 31 August, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Journal ref: ECML PKDD 2021, Sep 2021, Bilbao, Spain

arXiv:2102.11069 [pdf, other]

A PAC-Bayes Analysis of Adversarial Robustness

Authors: Paul Viallard, Guillaume Vidot, Amaury Habrard, Emilie Morvant

Abstract: We propose the first general PAC-Bayesian generalization bounds for adversarial robustness, that estimate, at test time, how much a model will be invariant to imperceptible perturbations in the input. Instead of deriving a worst-case analysis of the risk of a hypothesis over all the possible perturbations, we leverage the PAC-Bayesian framework to bound the averaged risk on the perturbations for m… ▽ More We propose the first general PAC-Bayesian generalization bounds for adversarial robustness, that estimate, at test time, how much a model will be invariant to imperceptible perturbations in the input. Instead of deriving a worst-case analysis of the risk of a hypothesis over all the possible perturbations, we leverage the PAC-Bayesian framework to bound the averaged risk on the perturbations for majority votes (over the whole class of hypotheses). Our theoretically founded analysis has the advantage to provide general bounds (i) that are valid for any kind of attacks (i.e., the adversarial attacks), (ii) that are tight thanks to the PAC-Bayesian framework, (iii) that can be directly minimized during the learning phase to obtain a robust model on different attacks at test time. △ Less

Submitted 27 October, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

Journal ref: NeurIPS 2021, Dec 2021, Sydney, Australia

arXiv:2102.08649 [pdf, other]

A General Framework for the Practical Disintegration of PAC-Bayesian Bounds

Authors: Paul Viallard, Pascal Germain, Amaury Habrard, Emilie Morvant

Abstract: PAC-Bayesian bounds are known to be tight and informative when studying the generalization ability of randomized classifiers. However, they require a loose and costly derandomization step when applied to some families of deterministic models such as neural networks. As an alternative to this step, we introduce new PAC-Bayesian generalization bounds that have the originality to provide disintegrate… ▽ More PAC-Bayesian bounds are known to be tight and informative when studying the generalization ability of randomized classifiers. However, they require a loose and costly derandomization step when applied to some families of deterministic models such as neural networks. As an alternative to this step, we introduce new PAC-Bayesian generalization bounds that have the originality to provide disintegrated bounds, i.e., they give guarantees over one single hypothesis instead of the usual averaged analysis. Our bounds are easily optimizable and can be used to design learning algorithms. We illustrate this behavior on neural networks, and we show a significant practical improvement over the state-of-the-art framework. △ Less

Submitted 18 September, 2023; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: Machine Learning, In press

arXiv:2010.16132 [pdf, other]

Multiview Variational Graph Autoencoders for Canonical Correlation Analysis

Authors: Yacouba Kaloga, Pierre Borgnat, Sundeep Prabhakar Chepuri, Patrice Abry, Amaury Habrard

Abstract: We present a novel multiview canonical correlation analysis model based on a variational approach. This is the first nonlinear model that takes into account the available graph-based geometric constraints while being scalable for processing large scale datasets with multiple views. It is based on an autoencoder architecture with graph convolutional neural network layers. We experiment with our app… ▽ More We present a novel multiview canonical correlation analysis model based on a variational approach. This is the first nonlinear model that takes into account the available graph-based geometric constraints while being scalable for processing large scale datasets with multiple views. It is based on an autoencoder architecture with graph convolutional neural network layers. We experiment with our approach on classification, clustering, and recommendation tasks on real datasets. The algorithm is competitive with state-of-the-art multiview representation learning techniques. △ Less

Submitted 4 October, 2021; v1 submitted 30 October, 2020; originally announced October 2020.

Comments: 4 pages, 3 figures, submitted

arXiv:2010.01992 [pdf, other]

Improving Few-Shot Learning through Multi-task Representation Learning Theory

Authors: Quentin Bouniot, Ievgen Redko, Romaric Audigier, Angélique Loesch, Amaury Habrard

Abstract: In this paper, we consider the framework of multi-task representation (MTR) learning where the goal is to use source tasks to learn a representation that reduces the sample complexity of solving a target task. We start by reviewing recent advances in MTR theory and show that they can provide novel insights for popular meta-learning algorithms when analyzed within this framework. In particular, we… ▽ More In this paper, we consider the framework of multi-task representation (MTR) learning where the goal is to use source tasks to learn a representation that reduces the sample complexity of solving a target task. We start by reviewing recent advances in MTR theory and show that they can provide novel insights for popular meta-learning algorithms when analyzed within this framework. In particular, we highlight a fundamental difference between gradient-based and metric-based algorithms in practice and put forward a theoretical analysis to explain it. Finally, we use the derived insights to improve the performance of meta-learning methods via a new spectral-based regularization term and confirm its efficiency through experimental studies on few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of MTR theory into practice for the task of few-shot classification. △ Less

Submitted 2 August, 2022; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: Accepted at ECCV2022. Code: https://github.com/CEA-LIST/MetaMTReg

arXiv:2007.03373 [pdf, other]

Hierarchical and Unsupervised Graph Representation Learning with Loukas's Coarsening

Authors: Louis Béthune, Yacouba Kaloga, Pierre Borgnat, Aurélien Garivier, Amaury Habrard

Abstract: We propose a novel algorithm for unsupervised graph representation learning with attributed graphs. It combines three advantages addressing some current limitations of the literature: i) The model is inductive: it can embed new graphs without re-training in the presence of new data; ii) The method takes into account both micro-structures and macro-structures by looking at the attributed graphs at… ▽ More We propose a novel algorithm for unsupervised graph representation learning with attributed graphs. It combines three advantages addressing some current limitations of the literature: i) The model is inductive: it can embed new graphs without re-training in the presence of new data; ii) The method takes into account both micro-structures and macro-structures by looking at the attributed graphs at different scales; iii) The model is end-to-end differentiable: it is a building block that can be plugged into deep learning pipelines and allows for back-propagation. We show that combining a coarsening method having strong theoretical guarantees with mutual information maximization suffices to produce high quality embeddings. We evaluate them on classification tasks with common benchmarks of the literature. We show that our algorithm is competitive with state of the art among unsupervised graph representation learning methods. △ Less

Submitted 17 August, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

Comments: 19 pages, 15 figures, submitted

arXiv:2004.11829 [pdf, other]

A survey on domain adaptation theory: learning bounds and theoretical guarantees

Authors: Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, Younès Bennani

Abstract: All famous machine learning algorithms that comprise both supervised and semi-supervised learning work well only under a common assumption: the training and test data follow the same distribution. When the distribution changes, most statistical models must be reconstructed from newly collected data, which for some applications can be costly or impossible to obtain. Therefore, it has become necessa… ▽ More All famous machine learning algorithms that comprise both supervised and semi-supervised learning work well only under a common assumption: the training and test data follow the same distribution. When the distribution changes, most statistical models must be reconstructed from newly collected data, which for some applications can be costly or impossible to obtain. Therefore, it has become necessary to develop approaches that reduce the need and the effort to obtain new labeled samples by exploiting data that are available in related areas, and using these further across similar fields. This has given rise to a new machine learning framework known as transfer learning: a learning setting inspired by the capability of a human being to extrapolate knowledge across tasks to learn more efficiently. Despite a large amount of different transfer learning scenarios, the main objective of this survey is to provide an overview of the state-of-the-art theoretical results in a specific, and arguably the most popular, sub-field of transfer learning, called domain adaptation. In this sub-field, the data distribution is assumed to change across the training and the test data, while the learning task remains the same. We provide a first up-to-date description of existing results related to domain adaptation problem that cover learning bounds based on different statistical learning frameworks. △ Less

Submitted 13 July, 2022; v1 submitted 24 April, 2020; originally announced April 2020.

arXiv:1909.01651 [pdf, other]

Metric Learning from Imbalanced Data

Authors: Léo Gautheron, Emilie Morvant, Amaury Habrard, Marc Sebban

Abstract: A key element of any machine learning algorithm is the use of a function that measures the dis/similarity between data points. Given a task, such a function can be optimized with a metric learning algorithm. Although this research field has received a lot of attention during the past decade, very few approaches have focused on learning a metric in an imbalanced scenario where the number of positiv… ▽ More A key element of any machine learning algorithm is the use of a function that measures the dis/similarity between data points. Given a task, such a function can be optimized with a metric learning algorithm. Although this research field has received a lot of attention during the past decade, very few approaches have focused on learning a metric in an imbalanced scenario where the number of positive examples is much smaller than the negatives. Here, we address this challenging task by designing a new Mahalanobis metric learning algorithm (IML) which deals with class imbalance. The empirical study performed shows the efficiency of IML. △ Less

Submitted 4 September, 2019; originally announced September 2019.

arXiv:1909.00693 [pdf, other]

An Adjusted Nearest Neighbor Algorithm Maximizing the F-Measure from Imbalanced Data

Authors: Rémi Viola, Rémi Emonet, Amaury Habrard, Guillaume Metzler, Sébastien Riou, Marc Sebban

Abstract: In this paper, we address the challenging problem of learning from imbalanced data using a Nearest-Neighbor (NN) algorithm. In this setting, the minority examples typically belong to the class of interest requiring the optimization of specific criteria, like the F-Measure. Based on simple geometrical ideas, we introduce an algorithm that reweights the distance between a query sample and any positi… ▽ More In this paper, we address the challenging problem of learning from imbalanced data using a Nearest-Neighbor (NN) algorithm. In this setting, the minority examples typically belong to the class of interest requiring the optimization of specific criteria, like the F-Measure. Based on simple geometrical ideas, we introduce an algorithm that reweights the distance between a query sample and any positive training example. This leads to a modification of the Voronoi regions and thus of the decision boundaries of the NN algorithm. We provide a theoretical justification about the weighting scheme needed to reduce the False Negative rate while controlling the number of False Positives. We perform an extensive experimental study on many public imbalanced datasets, but also on large scale non public data from the French Ministry of Economy and Finance on a tax fraud detection task, showing that our method is very effective and, interestingly, yields the best performance when combined with state of the art sampling methods. △ Less

Submitted 22 January, 2020; v1 submitted 2 September, 2019; originally announced September 2019.

Comments: In Proceedings of the 31 International Conference on Tools with Artificial Intelligence (ICTAI 2019)

arXiv:1906.06203 [pdf, other]

Learning Landmark-Based Ensembles with Random Fourier Features and Gradient Boosting

Authors: Léo Gautheron, Pascal Germain, Amaury Habrard, Emilie Morvant, Marc Sebban, Valentina Zantedeschi

Abstract: We propose a Gradient Boosting algorithm for learning an ensemble of kernel functions adapted to the task at hand. Unlike state-of-the-art Multiple Kernel Learning techniques that make use of a pre-computed dictionary of kernel functions to select from, at each iteration we fit a kernel by approximating it as a weighted sum of Random Fourier Features (RFF) and by optimizing their barycenter. This… ▽ More We propose a Gradient Boosting algorithm for learning an ensemble of kernel functions adapted to the task at hand. Unlike state-of-the-art Multiple Kernel Learning techniques that make use of a pre-computed dictionary of kernel functions to select from, at each iteration we fit a kernel by approximating it as a weighted sum of Random Fourier Features (RFF) and by optimizing their barycenter. This allows us to obtain a more versatile method, easier to setup and likely to have better performance. Our study builds on a recent result showing one can learn a kernel from RFF by computing the minimum of a PAC-Bayesian bound on the kernel alignment generalization loss, which is obtained efficiently from a closed-form solution. We conduct an experimental analysis to highlight the advantages of our method w.r.t. both Boosting-based and kernel-learning state-of-the-art methods. △ Less

Submitted 14 June, 2019; originally announced June 2019.

arXiv:1803.09065 [pdf, other]

doi 10.1609/aaai.v33i01.33017104

Near-lossless Binarization of Word Embeddings

Authors: Julien Tissier, Christophe Gravier, Amaury Habrard

Abstract: Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binar… ▽ More Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ~2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors. △ Less

Submitted 15 November, 2018; v1 submitted 24 March, 2018; originally announced March 2018.

Comments: Accepted as a long paper at AAAI 2019

arXiv:1705.08848 [pdf, other]

Joint Distribution Optimal Transportation for Domain Adaptation

Authors: Nicolas Courty, Rémi Flamary, Amaury Habrard, Alain Rakotomamonjy

Abstract: This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function $f$ in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two… ▽ More This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function $f$ in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two domain $\mathcal{P}_s$ and $\mathcal{P}_t$. We propose a solution of this problem with optimal transport, that allows to recover an estimated target $\mathcal{P}^f_t=(X,f(X))$ by optimizing simultaneously the optimal coupling and $f$. We show that our method corresponds to the minimization of a bound on the target error, and provide an efficient algorithmic solution, for which convergence is proved. The versatility of our approach, both in terms of class of hypothesis or loss functions is demonstrated with real world classification and regression problems, for which we reach or surpass state-of-the-art results. △ Less

Submitted 22 October, 2017; v1 submitted 24 May, 2017; originally announced May 2017.

Comments: Accepted for publication at NIPS 2017

arXiv:1610.04783 [pdf, other]

Similarity Learning for Time Series Classification

Authors: Maria-Irina Nicolae, Éric Gaussier, Amaury Habrard, Marc Sebban

Abstract: Multivariate time series naturally exist in many fields, like energy, bioinformatics, signal processing, and finance. Most of these applications need to be able to compare these structured data. In this context, dynamic time war** (DTW) is probably the most common comparison measure. However, not much research effort has been put into improving it by learning. In this paper, we propose a novel m… ▽ More Multivariate time series naturally exist in many fields, like energy, bioinformatics, signal processing, and finance. Most of these applications need to be able to compare these structured data. In this context, dynamic time war** (DTW) is probably the most common comparison measure. However, not much research effort has been put into improving it by learning. In this paper, we propose a novel method for learning similarities based on DTW, in order to improve time series classification. Making use of the uniform stability framework, we provide the first theoretical guarantees in the form of a generalization bound for linear classification. The experimental study shows that the proposed approach is efficient, while yielding sparse classifiers. △ Less

Submitted 15 October, 2016; originally announced October 2016.

Comments: Techreport

arXiv:1610.04420 [pdf, ps, other]

Theoretical Analysis of Domain Adaptation with Optimal Transport

Authors: Ievgen Redko, Amaury Habrard, Marc Sebban

Abstract: Domain adaptation (DA) is an important and emerging field of machine learning that tackles the problem occurring when the distributions of training (source domain) and test (target domain) data are similar but different. Current theoretical results show that the efficiency of DA algorithms depends on their capacity of minimizing the divergence between source and target probability distributions. I… ▽ More Domain adaptation (DA) is an important and emerging field of machine learning that tackles the problem occurring when the distributions of training (source domain) and test (target domain) data are similar but different. Current theoretical results show that the efficiency of DA algorithms depends on their capacity of minimizing the divergence between source and target probability distributions. In this paper, we provide a theoretical study on the advantages that concepts borrowed from optimal transportation theory can bring to DA. In particular, we show that the Wasserstein metric can be used as a divergence measure between distributions to obtain generalization guarantees for three different learning settings: (i) classic DA with unsupervised target data (ii) DA combining source and target labeled data, (iii) multiple source DA. Based on the obtained results, we provide some insights showing when this analysis can be tighter than other existing frameworks. △ Less

Submitted 28 July, 2017; v1 submitted 14 October, 2016; originally announced October 2016.

arXiv:1506.04573 [pdf, other]

A New PAC-Bayesian Perspective on Domain Adaptation

Authors: Pascal Germain, Amaury Habrard, François Laviolette, Emilie Morvant

Abstract: We study the issue of PAC-Bayesian domain adaptation: We want to learn, from a source domain, a majority vote model dedicated to a target one. Our theoretical contribution brings a new perspective by deriving an upper-bound on the target risk where the distributions' divergence---expressed as a ratio---controls the trade-off between a source error measure and the target voters' disagreement. Our b… ▽ More We study the issue of PAC-Bayesian domain adaptation: We want to learn, from a source domain, a majority vote model dedicated to a target one. Our theoretical contribution brings a new perspective by deriving an upper-bound on the target risk where the distributions' divergence---expressed as a ratio---controls the trade-off between a source error measure and the target voters' disagreement. Our bound suggests that one has to focus on regions where the source data is informative.From this result, we derive a PAC-Bayesian generalization bound, and specialize it to linear classifiers. Then, we infer a learning algorithmand perform experiments on real data. △ Less

Submitted 26 July, 2016; v1 submitted 15 June, 2015; originally announced June 2015.

Comments: Published at ICML 2016

arXiv:1503.06944 [pdf, other]

PAC-Bayesian Theorems for Domain Adaptation with Specialization to Linear Classifiers

Authors: Pascal Germain, Amaury Habrard, François Laviolette, Emilie Morvant

Abstract: In this paper, we provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different target distribution. On the one hand, we propose an improvement of the previous approach proposed by Germain et al. (2013), that relies on a novel distribution pseudodistance based on a disagreement… ▽ More In this paper, we provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different target distribution. On the one hand, we propose an improvement of the previous approach proposed by Germain et al. (2013), that relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter PAC-Bayesian domain adaptation bound for the stochastic Gibbs classifier. We specialize it to linear classifiers, and design a learning algorithm which shows interesting results on a synthetic problem and on a popular sentiment annotation task. On the other hand, we generalize these results to multisource domain adaptation allowing us to take into account different source domains. This study opens the door to tackle domain adaptation tasks by making use of all the PAC-Bayesian tools. △ Less

Submitted 9 August, 2016; v1 submitted 24 March, 2015; originally announced March 2015.

Comments: This report is a long version of our paper entitled A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers published in the proceedings of the International Conference on Machine Learning (ICML) 2013. We improved our main results, extended our experiments, and proposed an extension to multisource domain adaptation

arXiv:1501.03002 [pdf, ps, other]

An Improvement to the Domain Adaptation Bound in a PAC-Bayesian context

Authors: Pascal Germain, Amaury Habrard, Francois Laviolette, Emilie Morvant

Abstract: This paper provides a theoretical analysis of domain adaptation based on the PAC-Bayesian theory. We propose an improvement of the previous domain adaptation bound obtained by Germain et al. in two ways. We first give another generalization bound tighter and easier to interpret. Moreover, we provide a new analysis of the constant term appearing in the bound that can be of high interest for develop… ▽ More This paper provides a theoretical analysis of domain adaptation based on the PAC-Bayesian theory. We propose an improvement of the previous domain adaptation bound obtained by Germain et al. in two ways. We first give another generalization bound tighter and easier to interpret. Moreover, we provide a new analysis of the constant term appearing in the bound that can be of high interest for develo** new algorithmic solutions. △ Less

Submitted 13 January, 2015; originally announced January 2015.

Comments: NIPS 2014 Workshop on Transfer and Multi-task learning: Theory Meets Practice, Dec 2014, Montr{é}al, Canada

arXiv:1412.6452 [pdf, ps, other]

Algorithmic Robustness for Learning via $(ε, γ, τ)$-Good Similarity Functions

Authors: Maria-Irina Nicolae, Marc Sebban, Amaury Habrard, Éric Gaussier, Massih-Reza Amini

Abstract: The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(ε, γ, τ)$-good similarity functions (Balcan et al., 2008) has been one of… ▽ More The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(ε, γ, τ)$-good similarity functions (Balcan et al., 2008) has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend and complete this theory by providing a new generalization bound for the associated classifier based on the algorithmic robustness framework. △ Less

Submitted 31 March, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

Comments: ICLR 2015 Workshop - accepted

arXiv:1409.5241 [pdf, other]

Subspace Alignment For Domain Adaptation

Authors: Basura Fernando, Amaury Habrard, Marc Sebban, Tinne Tuytelaars

Abstract: In this paper, we introduce a new domain adaptation (DA) algorithm where the source and target domains are represented by subspaces spanned by eigenvectors. Our method seeks a domain invariant feature space by learning a map** function which aligns the source subspace with the target one. We show that the solution of the corresponding optimization problem can be obtained in a simple closed form,… ▽ More In this paper, we introduce a new domain adaptation (DA) algorithm where the source and target domains are represented by subspaces spanned by eigenvectors. Our method seeks a domain invariant feature space by learning a map** function which aligns the source subspace with the target one. We show that the solution of the corresponding optimization problem can be obtained in a simple closed form, leading to an extremely fast algorithm. We present two approaches to determine the only hyper-parameter in our method corresponding to the size of the subspaces. In the first approach we tune the size of subspaces using a theoretical bound on the stability of the obtained result. In the second approach, we use maximum likelihood estimation to determine the subspace size, which is particularly useful for high dimensional data. Apart from PCA, we propose a subspace creation method that outperform partial least squares (PLS) and linear discriminant analysis (LDA) in domain adaptation. We test our method on various datasets and show that, despite its intrinsic simplicity, it outperforms state of the art DA methods. △ Less

Submitted 23 October, 2014; v1 submitted 18 September, 2014; originally announced September 2014.

arXiv:1404.7796 [pdf, other]

Majority Vote of Diverse Classifiers for Late Fusion

Authors: Emilie Morvant, Amaury Habrard, Stéphane Ayache

Abstract: In the past few years, a lot of attention has been devoted to multimedia indexing by fusing multimodal informations. Two kinds of fusion schemes are generally considered: The early fusion and the late fusion. We focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic pr… ▽ More In the past few years, a lot of attention has been devoted to multimedia indexing by fusing multimodal informations. Two kinds of fusion schemes are generally considered: The early fusion and the late fusion. We focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic program named MinCq coming from the machine learning PAC-Bayesian theory. MinCq looks for the weighted combination, over a set of real-valued functions seen as voters, leading to the lowest misclassification rate, while maximizing the voters' diversity. We propose an extension of MinCq tailored to multimedia indexing. Our method is based on an order-preserving pairwise loss adapted to ranking that allows us to improve Mean Averaged Precision measure while taking into account the diversity of the voters that we want to fuse. We provide evidence that this method is naturally adapted to late fusion procedures and confirm the good behavior of our approach on the challenging PASCAL VOC'07 benchmark. △ Less

Submitted 19 June, 2014; v1 submitted 30 April, 2014; originally announced April 2014.

Comments: IAPR Joint International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recignition, Joensuu : Finland (2014)

arXiv:1312.6282 [pdf, other]

Dimension-free Concentration Bounds on Hankel Matrices for Spectral Learning

Authors: François Denis, Mattias Gybels, Amaury Habrard

Abstract: Learning probabilistic models over strings is an important issue for many applications. Spectral methods propose elegant solutions to the problem of inferring weighted automata from finite samples of variable-length strings drawn from an unknown target distribution. These methods rely on a singular value decomposition of a matrix $H_S$, called the Hankel matrix, that records the frequencies of (so… ▽ More Learning probabilistic models over strings is an important issue for many applications. Spectral methods propose elegant solutions to the problem of inferring weighted automata from finite samples of variable-length strings drawn from an unknown target distribution. These methods rely on a singular value decomposition of a matrix $H_S$, called the Hankel matrix, that records the frequencies of (some of) the observed strings. The accuracy of the learned distribution depends both on the quantity of information embedded in $H_S$ and on the distance between $H_S$ and its mean $H_r$. Existing concentration bounds seem to indicate that the concentration over $H_r$ gets looser with the size of $H_r$, suggesting to make a trade-off between the quantity of used information and the size of $H_r$. We propose new dimension-free concentration bounds for several variants of Hankel matrices. Experiments demonstrate that these bounds are tight and that they significantly improve existing bounds. These results suggest that the concentration rate of the Hankel matrix around its mean does not constitute an argument for limiting its size. △ Less

Submitted 21 December, 2013; originally announced December 2013.

Comments: Extended version of a paper to appear at ICML 2014

arXiv:1306.6709 [pdf, ps, other]

A Survey on Metric Learning for Feature Vectors and Structured Data

Authors: Aurélien Bellet, Amaury Habrard, Marc Sebban

Abstract: The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and… ▽ More The need for appropriate ways to measure the distance or similarity between data is ubiquitous in machine learning, pattern recognition and data mining, but handcrafting such good metrics for specific problems is generally difficult. This has led to the emergence of metric learning, which aims at automatically learning a metric from data and has attracted a lot of interest in machine learning and related fields for the past ten years. This survey paper proposes a systematic review of the metric learning literature, highlighting the pros and cons of each approach. We pay particular attention to Mahalanobis distance metric learning, a well-studied and successful framework, but additionally present a wide range of methods that have recently emerged as powerful alternatives, including nonlinear metric learning, similarity learning and local metric learning. Recent trends and extensions, such as semi-supervised metric learning, metric learning for histogram data and the derivation of generalization guarantees, are also covered. Finally, this survey addresses metric learning for structured data, in particular edit distance learning, and attempts to give an overview of the remaining challenges in metric learning for the years to come. △ Less

Submitted 12 February, 2014; v1 submitted 27 June, 2013; originally announced June 2013.

Comments: Technical report, 59 pages. Changes in v2: fixed typos and improved presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new methods

arXiv:1212.2340 [pdf, other]

PAC-Bayesian Learning and Domain Adaptation

Authors: Pascal Germain, Amaury Habrard, François Laviolette, Emilie Morvant

Abstract: In machine learning, Domain Adaptation (DA) arises when the distribution gen- erating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another p… ▽ More In machine learning, Domain Adaptation (DA) arises when the distribution gen- erating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a ρ-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the ρ-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples. △ Less

Submitted 11 December, 2012; originally announced December 2012.

Comments: https://sites.google.com/site/multitradeoffs2012/

Journal ref: Multi-Trade-offs in Machine Learning, NIPS 2012 Workshop, Lake Tahoe : United States (2012)

arXiv:1209.1086 [pdf, ps, other]

doi 10.1016/j.neucom.2014.09.044

Robustness and Generalization for Metric Learning

Authors: Aurélien Bellet, Amaury Habrard

Abstract: Metric learning has attracted a lot of interest over the last decade, but the generalization ability of such methods has not been thoroughly studied. In this paper, we introduce an adaptation of the notion of algorithmic robustness (previously introduced by Xu and Mannor) that can be used to derive generalization bounds for metric learning. We further show that a weak notion of robustness is in fa… ▽ More Metric learning has attracted a lot of interest over the last decade, but the generalization ability of such methods has not been thoroughly studied. In this paper, we introduce an adaptation of the notion of algorithmic robustness (previously introduced by Xu and Mannor) that can be used to derive generalization bounds for metric learning. We further show that a weak notion of robustness is in fact a necessary and sufficient condition for a metric learning algorithm to generalize. To illustrate the applicability of the proposed framework, we derive generalization results for a large family of existing metric learning algorithms, including some sparse formulations that are not covered by previous results. △ Less

Submitted 29 September, 2014; v1 submitted 5 September, 2012; originally announced September 2012.

Comments: 16 pages, to appear in Neurocomputing

Journal ref: Neurocomputing,151(1):259-267, 2015

arXiv:1207.1019 [pdf, ps, other]

PAC-Bayesian Majority Vote for Late Classifier Fusion

Authors: Emilie Morvant, Amaury Habrard, Stéphane Ayache

Abstract: A lot of attention has been devoted to multimedia indexing over the past few years. In the literature, we often consider two kinds of fusion schemes: The early fusion and the late fusion. In this paper we focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic program n… ▽ More A lot of attention has been devoted to multimedia indexing over the past few years. In the literature, we often consider two kinds of fusion schemes: The early fusion and the late fusion. In this paper we focus on late classifier fusion, where one combines the scores of each modality at the decision level. To tackle this problem, we investigate a recent and elegant well-founded quadratic program named MinCq coming from the Machine Learning PAC-Bayes theory. MinCq looks for the weighted combination, over a set of real-valued functions seen as voters, leading to the lowest misclassification rate, while making use of the voters' diversity. We provide evidence that this method is naturally adapted to late fusion procedure. We propose an extension of MinCq by adding an order- preserving pairwise loss for ranking, hel** to improve Mean Averaged Precision measure. We confirm the good behavior of the MinCq-based fusion approaches with experiments on a real image benchmark. △ Less

Submitted 4 July, 2012; originally announced July 2012.

Comments: 7 pages, Research report

arXiv:1206.6476 [pdf]

Similarity Learning for Provably Accurate Sparse Linear Classification

Authors: Aurelien Bellet, Amaury Habrard, Marc Sebban

Abstract: In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semi-definiteness) for use in a local k-NN algorithm. However, no theoretical link is established between the learned… ▽ More In recent years, the crucial importance of metrics in machine learning algorithms has led to an increasing interest for optimizing distance and similarity functions. Most of the state of the art focus on learning Mahalanobis distances (requiring to fulfill a constraint of positive semi-definiteness) for use in a local k-NN algorithm. However, no theoretical link is established between the learned metrics and their performance in classification. In this paper, we make use of the formal framework of good similarities introduced by Balcan et al. to design an algorithm for learning a non PSD linear similarity optimized in a nonlinear feature space, which is then used to build a global linear classifier. We show that our approach has uniform stability and derive a generalization bound on the classification error. Experiments performed on various datasets confirm the effectiveness of our approach compared to state-of-the-art methods and provide evidence that (i) it is fast, (ii) robust to overfitting and (iii) produces very sparse classifiers. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

arXiv:0807.2983 [pdf, ps, other]

On Probability Distributions for Trees: Representations, Inference and Learning

Authors: François Denis, Amaury Habrard, Rémi Gilleron, Marc Tommasi, Édouard Gilbert

Abstract: We study probability distributions over free algebras of trees. Probability distributions can be seen as particular (formal power) tree series [Berstel et al 82, Esik et al 03], i.e. map**s from trees to a semiring K . A widely studied class of tree series is the class of rational (or recognizable) tree series which can be defined either in an algebraic way or by means of multiplicity tree aut… ▽ More We study probability distributions over free algebras of trees. Probability distributions can be seen as particular (formal power) tree series [Berstel et al 82, Esik et al 03], i.e. map**s from trees to a semiring K . A widely studied class of tree series is the class of rational (or recognizable) tree series which can be defined either in an algebraic way or by means of multiplicity tree automata. We argue that the algebraic representation is very convenient to model probability distributions over a free algebra of trees. First, as in the string case, the algebraic representation allows to design learning algorithms for the whole class of probability distributions defined by rational tree series. Note that learning algorithms for rational tree series correspond to learning algorithms for weighted tree automata where both the structure and the weights are learned. Second, the algebraic representation can be easily extended to deal with unranked trees (like XML trees where a symbol may have an unbounded number of children). Both properties are particularly relevant for applications: nondeterministic automata are required for the inference problem to be relevant (recall that Hidden Markov Models are equivalent to nondeterministic string automata); nowadays applications for Web Information Extraction, Web Services and document processing consider unranked trees. △ Less

Submitted 18 July, 2008; originally announced July 2008.

Journal ref: Dans NIPS Workshop on Representations and Inference on Probability Distributions (2007)

arXiv:cs/0607085 [pdf, ps, other]

Using Pseudo-Stochastic Rational Languages in Probabilistic Grammatical Inference

Authors: Amaury Habrard, Francois Denis, Yann Esposito

Abstract: In probabilistic grammatical inference, a usual goal is to infer a good approximation of an unknown distribution P called a stochastic language. The estimate of P stands in some class of probabilistic models such as probabilistic automata (PA). In this paper, we focus on probabilistic models based on multiplicity automata (MA). The stochastic languages generated by MA are called rational stochas… ▽ More In probabilistic grammatical inference, a usual goal is to infer a good approximation of an unknown distribution P called a stochastic language. The estimate of P stands in some class of probabilistic models such as probabilistic automata (PA). In this paper, we focus on probabilistic models based on multiplicity automata (MA). The stochastic languages generated by MA are called rational stochastic languages; they strictly include stochastic languages generated by PA; they also admit a very concise canonical representation. Despite the fact that this class is not recursively enumerable, it is efficiently identifiable in the limit by using the algorithm DEES, introduced by the authors in a previous paper. However, the identification is not proper and before the convergence of the algorithm, DEES can produce MA that do not define stochastic languages. Nevertheless, it is possible to use these MA to define stochastic languages. We show that they belong to a broader class of rational series, that we call pseudo-stochastic rational languages. The aim of this paper is twofold. First we provide a theoretical study of pseudo-stochastic rational languages, the languages output by DEES, showing for example that this class is decidable within polynomial time. Second, we have carried out a lot of experiments in order to compare DEES to classical inference algorithms such as ALERGIA and MDI. They show that DEES outperforms them in most cases. △ Less

Submitted 7 November, 2008; v1 submitted 18 July, 2006; originally announced July 2006.

Journal ref: 8th International Colloquium on Grammatical Inference (ICGI'06), Japan (2006)

arXiv:cs/0602062 [pdf, ps, other]

Learning rational stochastic languages

Authors: François Denis, Yann Esposito, Amaury Habrard

Abstract: Given a finite set of words w1,...,wn independently drawn according to a fixed unknown distribution law P called a stochastic language, an usual goal in Grammatical Inference is to infer an estimate of P in some class of probabilistic models, such as Probabilistic Automata (PA). Here, we study the class of rational stochastic languages, which consists in stochastic languages that can be generate… ▽ More Given a finite set of words w1,...,wn independently drawn according to a fixed unknown distribution law P called a stochastic language, an usual goal in Grammatical Inference is to infer an estimate of P in some class of probabilistic models, such as Probabilistic Automata (PA). Here, we study the class of rational stochastic languages, which consists in stochastic languages that can be generated by Multiplicity Automata (MA) and which strictly includes the class of stochastic languages generated by PA. Rational stochastic languages have minimal normal representation which may be very concise, and whose parameters can be efficiently estimated from stochastic samples. We design an efficient inference algorithm DEES which aims at building a minimal normal representation of the target. Despite the fact that no recursively enumerable class of MA computes exactly the set of rational stochastic languages over Q, we show that DEES strongly identifies tis set in the limit. We study the intermediary MA output by DEES and show that they compute rational series which converge absolutely to one and which can be used to provide stochastic languages which closely estimate the target. △ Less

Submitted 17 February, 2006; originally announced February 2006.

Comments: 15 pages

Showing 1–34 of 34 results for author: Habrard, A