-
Compressive Mahalanobis Metric Learning Adapts to Intrinsic Dimension
Authors:
Efstratios Palias,
Ata Kabán
Abstract:
Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, it can also serve as dimensionality reduction by imposing a low-rank restriction to the learnt metric. In this paper, we consider the problem of learning a Mahalanobis metric, and instead of training a low-rank metric on hi…
▽ More
Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, it can also serve as dimensionality reduction by imposing a low-rank restriction to the learnt metric. In this paper, we consider the problem of learning a Mahalanobis metric, and instead of training a low-rank metric on high-dimensional data, we use a randomly compressed version of the data to train a full-rank metric in this reduced feature space. We give theoretical guarantees on the error for Mahalanobis metric learning, which depend on the stable dimension of the data support, but not on the ambient dimension. Our bounds make no assumptions aside from i.i.d. data sampling from a bounded support, and automatically tighten when benign geometrical structures are present. An important ingredient is an extension of Gordon's theorem, which may be of independent interest. We also corroborate our findings by numerical experiments.
△ Less
Submitted 13 April, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Approximability and Generalisation
Authors:
Andrew J. Turner,
Ata Kabán
Abstract:
Approximate learning machines have become popular in the era of small devices, including quantised, factorised, hashed, or otherwise compressed predictors, and the quest to explain and guarantee good generalisation abilities for such methods has just begun. In this paper we study the role of approximability in learning, both in the full precision and the approximated settings of the predictor that…
▽ More
Approximate learning machines have become popular in the era of small devices, including quantised, factorised, hashed, or otherwise compressed predictors, and the quest to explain and guarantee good generalisation abilities for such methods has just begun. In this paper we study the role of approximability in learning, both in the full precision and the approximated settings of the predictor that is learned from the data, through a notion of sensitivity of predictors to the action of the approximation operator at hand. We prove upper bounds on the generalisation of such predictors, yielding the following main findings, for any PAC-learnable class and any given approximation operator. 1) We show that under mild conditions, approximable target concepts are learnable from a smaller labelled sample, provided sufficient unlabelled data. 2) We give algorithms that guarantee a good predictor whose approximation also enjoys the same generalisation guarantees. 3) We highlight natural examples of structure in the class of sensitivities, which reduce, and possibly even eliminate the otherwise abundant requirement of additional unlabelled data, and henceforth shed new light onto what makes one problem instance easier to learn than another. These results embed the scope of modern model compression approaches into the general goal of statistical learning theory, which in return suggests appropriate algorithms through minimising uniform bounds.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Statistical optimality conditions for compressive ensembles
Authors:
Henry W. J. Reeve,
Ata Kaban
Abstract:
We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, an…
▽ More
We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, and explains the in-built regularisation effect of the compressive approach. We then instantiate this general bound to classification and regression tasks, considering Johnson-Lindenstrauss map**s as the compression scheme. For each of these tasks, our strategy is to develop a tight upper bound on the compressibility function, and by doing so we discover distributional conditions of geometric nature under which the compressive algorithm attains minimax-optimal rates up to at most poly-logarithmic factors. In the case of compressive classification, this is achieved with a mild geometric margin condition along with a flexible moment condition that is significantly more general than the assumption of bounded domain. In the case of regression with strongly convex smooth loss functions we find that compressive regression is capable of exploiting spectral decay with near-optimal guarantees. In addition, a key ingredient for our central upper bound is a high probability uniform upper bound on the integrated deviation of dependent empirical processes, which may be of independent interest.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Optimistic bounds for multi-output prediction
Authors:
Henry WJ Reeve,
Ata Kaban
Abstract:
We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functi…
▽ More
We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which are minimax optimal up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalization bound for multi-class gradient boosting.
△ Less
Submitted 22 February, 2020;
originally announced February 2020.
-
Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise
Authors:
Henry W. J. Reeve,
Ata Kaban
Abstract:
We consider classification in the presence of class-dependent asymmetric label noise with unknown noise probabilities. In this setting, identifiability conditions are known, but additional assumptions were shown to be required for finite sample rates, and so far only the parametric rate has been obtained. Assuming these identifiability conditions, together with a measure-smoothness condition on th…
▽ More
We consider classification in the presence of class-dependent asymmetric label noise with unknown noise probabilities. In this setting, identifiability conditions are known, but additional assumptions were shown to be required for finite sample rates, and so far only the parametric rate has been obtained. Assuming these identifiability conditions, together with a measure-smoothness condition on the regression function and Tsybakov's margin condition, we show that the Robust kNN classifier of Gao et al. attains, the minimax optimal rates of the noise-free setting, up to a log factor, even when trained on data with unknown asymmetric label noise. Hence, our results provide a solid theoretical backing for this empirically successful algorithm. By contrast the standard kNN is not even consistent in the setting of asymmetric label noise. A key idea in our analysis is a simple kNN based method for estimating the maximum of a function that requires far less assumptions than existing mode estimators do, and which may be of independent interest for noise proportion estimation and randomised optimisation problems.
△ Less
Submitted 11 June, 2019;
originally announced June 2019.
-
Classification with unknown class-conditional label noise on non-compact feature spaces
Authors:
Henry W J Reeve,
Ata Kaban
Abstract:
We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class-conditional label noise have required that the regression function is close to its extrema…
▽ More
We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class-conditional label noise have required that the regression function is close to its extrema on sets of large measure. We shall consider this problem in the setting of non-compact metric spaces, where the regression function need not attain its extrema.
In this setting we determine the minimax optimal learning rates (up to logarithmic factors). The rate displays interesting threshold behaviour: When the regression function approaches its extrema at a sufficient rate, the optimal learning rates are of the same order as those obtained in the label-noise free setting. If the regression function approaches its extrema more gradually then classification performance necessarily degrades. In addition, we present an adaptive algorithm which attains these rates without prior knowledge of either the distributional parameters or the local density. This identifies for the first time a scenario in which finite sample rates are achievable in the label noise setting, but they differ from the optimal rates without label noise.
△ Less
Submitted 9 June, 2019; v1 submitted 14 February, 2019;
originally announced February 2019.
-
Structure-aware error bounds for linear classification with the zero-one loss
Authors:
Ata Kaban,
Robert J. Durrant
Abstract:
We prove risk bounds for binary classification in high-dimensional settings when the sample size is allowed to be smaller than the dimensionality of the training set observations. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) (that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randoml…
▽ More
We prove risk bounds for binary classification in high-dimensional settings when the sample size is allowed to be smaller than the dimensionality of the training set observations. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) (that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randomly selected low-dimensional subspace) as well as uniform upper bounds in the full high-dimensional space. A novel tool we employ in both settings is the 'flip** probability' of Durrant and Kaban (ICML 2013) which we use to capture benign geometric structures that make a classification problem 'easy' in the sense of demanding a relatively low sample size for guarantees of good generalization. Furthermore our bounds also enable us to explain or draw connections between several existing successful classification algorithms. Finally we show empirically that our bounds are informative enough in practice to serve as the objective function for learning a classifier (by using them to do so).
△ Less
Submitted 27 September, 2017;
originally announced September 2017.
-
Boosting in the presence of label noise
Authors:
Jakramate Bootkrajang,
Ata Kaban
Abstract:
Boosting is known to be sensitive to label noise. We studied two approaches to improve AdaBoost's robustness against labelling errors. One is to employ a label-noise robust classifier as a base learner, while the other is to modify the AdaBoost algorithm to be more robust. Empirical evaluation shows that a committee of robust classifiers, although converges faster than non label-noise aware AdaBoo…
▽ More
Boosting is known to be sensitive to label noise. We studied two approaches to improve AdaBoost's robustness against labelling errors. One is to employ a label-noise robust classifier as a base learner, while the other is to modify the AdaBoost algorithm to be more robust. Empirical evaluation shows that a committee of robust classifiers, although converges faster than non label-noise aware AdaBoost, is still susceptible to label noise. However, pairing it with the new robust Boosting algorithm we propose here results in a more resilient algorithm under mislabelling.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
Robust mixtures in the presence of measurement errors
Authors:
Jianyong Sun,
Ata Kaban,
Somak Raychaudhury
Abstract:
We develop a mixture-based approach to robust density modeling and outlier detection for experimental multivariate data that includes measurement error information. Our model is designed to infer atypical measurements that are not due to errors, aiming to retrieve potentially interesting peculiar objects. Since exact inference is not possible in this model, we develop a tree-structured variation…
▽ More
We develop a mixture-based approach to robust density modeling and outlier detection for experimental multivariate data that includes measurement error information. Our model is designed to infer atypical measurements that are not due to errors, aiming to retrieve potentially interesting peculiar objects. Since exact inference is not possible in this model, we develop a tree-structured variational EM solution. This compares favorably against a fully factorial approximation scheme, approaching the accuracy of a Markov-Chain-EM, while maintaining computational simplicity. We demonstrate the benefits of including measurement errors in the model, in terms of improved outlier detection rates in varying measurement uncertainty conditions. We then use this approach in detecting peculiar quasars from an astrophysical survey, given photometric measurements with errors.
△ Less
Submitted 6 September, 2007;
originally announced September 2007.
-
On class visualisation for high dimensional data: Exploring scientific datasets
Authors:
Ata Kaban,
Jianyong Sun,
Somak Raychaudhury,
Louisa Nolan
Abstract:
Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional…
▽ More
Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional data, we show that a more optimal combination of these objectives can be achieved by integrating them both into a consistent probabilistic model. In this way, the projection step will fulfil a role of regularisation, guarding against the curse of dimensionality. As a result, the tradeoff between clustering and visualisation turns out to enhance the predictive abilities of the overall model. We present results on both synthetic data and two real-world high-dimensional data sets: observed spectra of early-type galaxies and gene expression arrays.
△ Less
Submitted 4 September, 2006;
originally announced September 2006.
-
Young stellar populations in early-type galaxies in the Sloan Digital Sky Survey
Authors:
Louisa A. Nolan,
Somak Raychaudhury,
Ata Kaban
Abstract:
We use a purely data-driven rectified factor analysis to identify early-type galaxies with recent star formation in DR4 of the SDSS Spectroscopic Catalogue. We compare the spectra and environment of these galaxies with `normal' early-types, and a sample of independently selected E+A galaxies. We calculate the projected local galaxy surface density (Sigma_5 and Sigma_10) for each galaxy in our sa…
▽ More
We use a purely data-driven rectified factor analysis to identify early-type galaxies with recent star formation in DR4 of the SDSS Spectroscopic Catalogue. We compare the spectra and environment of these galaxies with `normal' early-types, and a sample of independently selected E+A galaxies. We calculate the projected local galaxy surface density (Sigma_5 and Sigma_10) for each galaxy in our sample, and find that the dependence, on projected local density, of the properties of E+As is not significantly different from that of early-types with young stellar populations, drop** off rapidly towards denser environments, and flattening off at densities < 0.1-0.3 Mpc^-2. The dearth of E+A galaxies in dense environments confirms that E+As are most likely the products of galaxy-galaxy merging or interactions, rather than star-forming galaxies whose star formation has been quenched by processes unique to dense environments. We see a tentative peak in the number of E+A galaxies at Sigma_10 \~ 0.1-0.3 Mpc^-2, which may represent the local galaxy density at which the rate of galaxy-galaxy merging or interaction rate peaks. Analysis of the spectra of our early-types with young stellar populations suggests that they have a stellar component dominated by F stars, ~ 1-4 Gyr old, together with a mature, metal-rich population characteristic of `typical' early-types. The young stars represent > 10% of the stellar mass in these galaxies. This, together with the similarity of the environments in which this `E+F' population and the E+A galaxy sample are found, suggests that E+F galaxies used to be E+A galaxies, but have evolved by a further ~ one to a few Gyr. Our factor analysis is sensitive enough to identify this hidden population. (Abridged)
△ Less
Submitted 13 November, 2006; v1 submitted 29 August, 2006;
originally announced August 2006.
-
A data-driven Bayesian approach for finding young stellar populations in early-type galaxies from their UV-optical spectra
Authors:
L. A. Nolan,
M. O. Harva,
A Kaban,
S. Raychaudhury
Abstract:
We present the results of a novel application of Bayesian modelling techniques, which, although purely data driven, have a physically interpretable result, and will be useful as an efficient data mining tool. We base our studies on the UV-to-optical spectra (observed and synthetic) of early-type galaxies. A probabilistic latent variable architecture is formulated, and a rigorous Bayesian methodo…
▽ More
We present the results of a novel application of Bayesian modelling techniques, which, although purely data driven, have a physically interpretable result, and will be useful as an efficient data mining tool. We base our studies on the UV-to-optical spectra (observed and synthetic) of early-type galaxies. A probabilistic latent variable architecture is formulated, and a rigorous Bayesian methodology is employed for solving the inverse modelling problem from the available data. A powerful aspect of our formalism is that it allows us to recover a limited fraction of missing data due to incomplete spectral coverage, as well as to handle observational errors in a principled way. We apply this method to a sample of 21 well-studied early-type spectra, with known star-formation histories. We find that our data-driven Bayesian modelling allows us to identify those early-types which contain a significant stellar population <~ 1 Gyr old. This method would therefore be a very useful tool for automatically discovering various interesting sub-classes of galaxies. (abridged)
△ Less
Submitted 16 November, 2005;
originally announced November 2005.
-
Finding Young Stellar Populations in Elliptical Galaxies from Independent Components of Optical Spectra
Authors:
Ata Kaban,
Louisa A. Nolan,
Somak Raychaudhury
Abstract:
Elliptical galaxies are believed to consist of a single population of old stars formed together at an early epoch in the Universe, yet recent analyses of galaxy spectra seem to indicate the presence of significant younger populations of stars in them. The detailed physical modelling of such populations is computationally expensive, inhibiting the detailed analysis of the several million galaxy s…
▽ More
Elliptical galaxies are believed to consist of a single population of old stars formed together at an early epoch in the Universe, yet recent analyses of galaxy spectra seem to indicate the presence of significant younger populations of stars in them. The detailed physical modelling of such populations is computationally expensive, inhibiting the detailed analysis of the several million galaxy spectra becoming available over the next few years. Here we present a data mining application aimed at decomposing the spectra of elliptical galaxies into several coeval stellar populations, without the use of detailed physical models. This is achieved by performing a linear independent basis transformation that essentially decouples the initial problem of joint processing of a set of correlated spectral measurements into that of the independent processing of a small set of prototypical spectra. Two methods are investigated: (1) A fast projection approach is derived by exploiting the correlation structure of neighboring wavelength bins within the spectral data. (2) A factorisation method that takes advantage of the positivity of the spectra is also investigated. The preliminary results show that typical features observed in stellar population spectra of different evolutionary histories can be convincingly disentangled by these methods, despite the absence of input physics. The success of this basis transformation analysis in recovering physically interpretable representations indicates that this technique is a potentially powerful tool for astronomical data mining.
△ Less
Submitted 3 May, 2005;
originally announced May 2005.