Search | arXiv e-print repository

Compressive Mahalanobis Metric Learning Adapts to Intrinsic Dimension

Abstract: Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, it can also serve as dimensionality reduction by imposing a low-rank restriction to the learnt metric. In this paper, we consider the problem of learning a Mahalanobis metric, and instead of training a low-rank metric on hi… ▽ More Metric learning aims at finding a suitable distance metric over the input space, to improve the performance of distance-based learning algorithms. In high-dimensional settings, it can also serve as dimensionality reduction by imposing a low-rank restriction to the learnt metric. In this paper, we consider the problem of learning a Mahalanobis metric, and instead of training a low-rank metric on high-dimensional data, we use a randomly compressed version of the data to train a full-rank metric in this reduced feature space. We give theoretical guarantees on the error for Mahalanobis metric learning, which depend on the stable dimension of the data support, but not on the ambient dimension. Our bounds make no assumptions aside from i.i.d. data sampling from a bounded support, and automatically tighten when benign geometrical structures are present. An important ingredient is an extension of Gordon's theorem, which may be of independent interest. We also corroborate our findings by numerical experiments. △ Less

Submitted 13 April, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: 8 pages, 2 figures

arXiv:2203.07989 [pdf, ps, other]

Approximability and Generalisation

Authors: Andrew J. Turner, Ata Kabán

Abstract: Approximate learning machines have become popular in the era of small devices, including quantised, factorised, hashed, or otherwise compressed predictors, and the quest to explain and guarantee good generalisation abilities for such methods has just begun. In this paper we study the role of approximability in learning, both in the full precision and the approximated settings of the predictor that… ▽ More Approximate learning machines have become popular in the era of small devices, including quantised, factorised, hashed, or otherwise compressed predictors, and the quest to explain and guarantee good generalisation abilities for such methods has just begun. In this paper we study the role of approximability in learning, both in the full precision and the approximated settings of the predictor that is learned from the data, through a notion of sensitivity of predictors to the action of the approximation operator at hand. We prove upper bounds on the generalisation of such predictors, yielding the following main findings, for any PAC-learnable class and any given approximation operator. 1) We show that under mild conditions, approximable target concepts are learnable from a smaller labelled sample, provided sufficient unlabelled data. 2) We give algorithms that guarantee a good predictor whose approximation also enjoys the same generalisation guarantees. 3) We highlight natural examples of structure in the class of sensitivities, which reduce, and possibly even eliminate the otherwise abundant requirement of additional unlabelled data, and henceforth shed new light onto what makes one problem instance easier to learn than another. These results embed the scope of modern model compression approaches into the general goal of statistical learning theory, which in return suggests appropriate algorithms through minimising uniform bounds. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: 25 pages

arXiv:2106.01092 [pdf, ps, other]

Statistical optimality conditions for compressive ensembles

Authors: Henry W. J. Reeve, Ata Kaban

Abstract: We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, an… ▽ More We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent upper-bound on the excess risk, framed in terms of a natural notion of compressibility. This bound is independent of the dimension of the original data representation, and explains the in-built regularisation effect of the compressive approach. We then instantiate this general bound to classification and regression tasks, considering Johnson-Lindenstrauss map**s as the compression scheme. For each of these tasks, our strategy is to develop a tight upper bound on the compressibility function, and by doing so we discover distributional conditions of geometric nature under which the compressive algorithm attains minimax-optimal rates up to at most poly-logarithmic factors. In the case of compressive classification, this is achieved with a mild geometric margin condition along with a flexible moment condition that is significantly more general than the assumption of bounded domain. In the case of regression with strongly convex smooth loss functions we find that compressive regression is capable of exploiting spectral decay with near-optimal guarantees. In addition, a key ingredient for our central upper bound is a high probability uniform upper bound on the integrated deviation of dependent empirical processes, which may be of independent interest. △ Less

Submitted 2 June, 2021; originally announced June 2021.

MSC Class: 62-08

arXiv:2002.09769 [pdf, ps, other]

Optimistic bounds for multi-output prediction

Authors: Henry WJ Reeve, Ata Kaban

Abstract: We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functi… ▽ More We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which are minimax optimal up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalization bound for multi-class gradient boosting. △ Less

Submitted 22 February, 2020; originally announced February 2020.

arXiv:1906.04542 [pdf, other]

Fast Rates for a kNN Classifier Robust to Unknown Asymmetric Label Noise

Authors: Henry W. J. Reeve, Ata Kaban

Abstract: We consider classification in the presence of class-dependent asymmetric label noise with unknown noise probabilities. In this setting, identifiability conditions are known, but additional assumptions were shown to be required for finite sample rates, and so far only the parametric rate has been obtained. Assuming these identifiability conditions, together with a measure-smoothness condition on th… ▽ More We consider classification in the presence of class-dependent asymmetric label noise with unknown noise probabilities. In this setting, identifiability conditions are known, but additional assumptions were shown to be required for finite sample rates, and so far only the parametric rate has been obtained. Assuming these identifiability conditions, together with a measure-smoothness condition on the regression function and Tsybakov's margin condition, we show that the Robust kNN classifier of Gao et al. attains, the minimax optimal rates of the noise-free setting, up to a log factor, even when trained on data with unknown asymmetric label noise. Hence, our results provide a solid theoretical backing for this empirically successful algorithm. By contrast the standard kNN is not even consistent in the setting of asymmetric label noise. A key idea in our analysis is a simple kNN based method for estimating the maximum of a function that requires far less assumptions than existing mode estimators do, and which may be of independent interest for noise proportion estimation and randomised optimisation problems. △ Less

Submitted 11 June, 2019; originally announced June 2019.

Comments: ICML 2019

arXiv:1902.05627 [pdf, other]

Classification with unknown class-conditional label noise on non-compact feature spaces

Authors: Henry W J Reeve, Ata Kaban

Abstract: We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class-conditional label noise have required that the regression function is close to its extrema… ▽ More We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability. In order to obtain finite sample rates, previous approaches to classification with unknown class-conditional label noise have required that the regression function is close to its extrema on sets of large measure. We shall consider this problem in the setting of non-compact metric spaces, where the regression function need not attain its extrema. In this setting we determine the minimax optimal learning rates (up to logarithmic factors). The rate displays interesting threshold behaviour: When the regression function approaches its extrema at a sufficient rate, the optimal learning rates are of the same order as those obtained in the label-noise free setting. If the regression function approaches its extrema more gradually then classification performance necessarily degrades. In addition, we present an adaptive algorithm which attains these rates without prior knowledge of either the distributional parameters or the local density. This identifies for the first time a scenario in which finite sample rates are achievable in the label noise setting, but they differ from the optimal rates without label noise. △ Less

Submitted 9 June, 2019; v1 submitted 14 February, 2019; originally announced February 2019.

arXiv:1709.09782 [pdf, ps, other]

Structure-aware error bounds for linear classification with the zero-one loss

Authors: Ata Kaban, Robert J. Durrant

Abstract: We prove risk bounds for binary classification in high-dimensional settings when the sample size is allowed to be smaller than the dimensionality of the training set observations. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) (that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randoml… ▽ More We prove risk bounds for binary classification in high-dimensional settings when the sample size is allowed to be smaller than the dimensionality of the training set observations. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) (that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randomly selected low-dimensional subspace) as well as uniform upper bounds in the full high-dimensional space. A novel tool we employ in both settings is the 'flip** probability' of Durrant and Kaban (ICML 2013) which we use to capture benign geometric structures that make a classification problem 'easy' in the sense of demanding a relatively low sample size for guarantees of good generalization. Furthermore our bounds also enable us to explain or draw connections between several existing successful classification algorithms. Finally we show empirically that our bounds are informative enough in practice to serve as the objective function for learning a classifier (by using them to do so). △ Less

Submitted 27 September, 2017; originally announced September 2017.

MSC Class: 62G05; 68Q32; 62H05; 68W25 ACM Class: I.2.6

arXiv:1309.6818 [pdf]

Boosting in the presence of label noise

Authors: Jakramate Bootkrajang, Ata Kaban

Abstract: Boosting is known to be sensitive to label noise. We studied two approaches to improve AdaBoost's robustness against labelling errors. One is to employ a label-noise robust classifier as a base learner, while the other is to modify the AdaBoost algorithm to be more robust. Empirical evaluation shows that a committee of robust classifiers, although converges faster than non label-noise aware AdaBoo… ▽ More Boosting is known to be sensitive to label noise. We studied two approaches to improve AdaBoost's robustness against labelling errors. One is to employ a label-noise robust classifier as a base learner, while the other is to modify the AdaBoost algorithm to be more robust. Empirical evaluation shows that a committee of robust classifiers, although converges faster than non label-noise aware AdaBoost, is still susceptible to label noise. However, pairing it with the new robust Boosting algorithm we propose here results in a more resilient algorithm under mislabelling. △ Less

Submitted 26 September, 2013; originally announced September 2013.

Comments: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI2013)

Report number: UAI-P-2013-PG-82-91

arXiv:0709.0928 [pdf, ps, other]

Robust mixtures in the presence of measurement errors

Authors: Jianyong Sun, Ata Kaban, Somak Raychaudhury

Abstract: We develop a mixture-based approach to robust density modeling and outlier detection for experimental multivariate data that includes measurement error information. Our model is designed to infer atypical measurements that are not due to errors, aiming to retrieve potentially interesting peculiar objects. Since exact inference is not possible in this model, we develop a tree-structured variation… ▽ More We develop a mixture-based approach to robust density modeling and outlier detection for experimental multivariate data that includes measurement error information. Our model is designed to infer atypical measurements that are not due to errors, aiming to retrieve potentially interesting peculiar objects. Since exact inference is not possible in this model, we develop a tree-structured variational EM solution. This compares favorably against a fully factorial approximation scheme, approaching the accuracy of a Markov-Chain-EM, while maintaining computational simplicity. We demonstrate the benefits of including measurement errors in the model, in terms of improved outlier detection rates in varying measurement uncertainty conditions. We then use this approach in detecting peculiar quasars from an astrophysical survey, given photometric measurements with errors. △ Less

Submitted 6 September, 2007; originally announced September 2007.

Comments: (Refereed) Proceedings of the 24-th Annual International Conference on Machine Learning 2007 (ICML07), (Ed.) Z. Ghahramani. June 20-24, 2007, Oregon State University, Corvallis, OR, USA, pp. 847-854; Omnipress. ISBN 978-1-59593-793-3; 8 pages, 6 figures

arXiv:astro-ph/0609094 [pdf, ps, other]

doi 10.1007/11893318_15

On class visualisation for high dimensional data: Exploring scientific datasets

Authors: Ata Kaban, Jianyong Sun, Somak Raychaudhury, Louisa Nolan

Abstract: Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional… ▽ More Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional data, we show that a more optimal combination of these objectives can be achieved by integrating them both into a consistent probabilistic model. In this way, the projection step will fulfil a role of regularisation, guarding against the curse of dimensionality. As a result, the tradeoff between clustering and visualisation turns out to enhance the predictive abilities of the overall model. We present results on both synthetic data and two real-world high-dimensional data sets: observed spectra of early-type galaxies and gene expression arrays. △ Less

Submitted 4 September, 2006; originally announced September 2006.

Comments: to appear in Lecture notes in Artificial Intelligence vol. 4265, the (refereed) proceedings of the Ninth International conference on Discovery Science (DS-2006), October 2006, Barcelona, Spain. 12 pages, 8 figures

arXiv:astro-ph/0608623 [pdf, ps, other]

doi 10.1111/j.1365-2966.2006.11326.x

Young stellar populations in early-type galaxies in the Sloan Digital Sky Survey

Authors: Louisa A. Nolan, Somak Raychaudhury, Ata Kaban

Abstract: We use a purely data-driven rectified factor analysis to identify early-type galaxies with recent star formation in DR4 of the SDSS Spectroscopic Catalogue. We compare the spectra and environment of these galaxies with `normal' early-types, and a sample of independently selected E+A galaxies. We calculate the projected local galaxy surface density (Sigma_5 and Sigma_10) for each galaxy in our sa… ▽ More We use a purely data-driven rectified factor analysis to identify early-type galaxies with recent star formation in DR4 of the SDSS Spectroscopic Catalogue. We compare the spectra and environment of these galaxies with `normal' early-types, and a sample of independently selected E+A galaxies. We calculate the projected local galaxy surface density (Sigma_5 and Sigma_10) for each galaxy in our sample, and find that the dependence, on projected local density, of the properties of E+As is not significantly different from that of early-types with young stellar populations, drop** off rapidly towards denser environments, and flattening off at densities < 0.1-0.3 Mpc^-2. The dearth of E+A galaxies in dense environments confirms that E+As are most likely the products of galaxy-galaxy merging or interactions, rather than star-forming galaxies whose star formation has been quenched by processes unique to dense environments. We see a tentative peak in the number of E+A galaxies at Sigma_10 \~ 0.1-0.3 Mpc^-2, which may represent the local galaxy density at which the rate of galaxy-galaxy merging or interaction rate peaks. Analysis of the spectra of our early-types with young stellar populations suggests that they have a stellar component dominated by F stars, ~ 1-4 Gyr old, together with a mature, metal-rich population characteristic of `typical' early-types. The young stars represent > 10% of the stellar mass in these galaxies. This, together with the similarity of the environments in which this `E+F' population and the E+A galaxy sample are found, suggests that E+F galaxies used to be E+A galaxies, but have evolved by a further ~ one to a few Gyr. Our factor analysis is sensitive enough to identify this hidden population. (Abridged) △ Less

Submitted 13 November, 2006; v1 submitted 29 August, 2006; originally announced August 2006.

Comments: 7 pages, 5 figures, submitted to MNRAS, minor revision

arXiv:astro-ph/0511503 [pdf, ps, other]

doi 10.1111/j.1365-2966.2005.09868.x

A data-driven Bayesian approach for finding young stellar populations in early-type galaxies from their UV-optical spectra

Authors: L. A. Nolan, M. O. Harva, A Kaban, S. Raychaudhury

Abstract: We present the results of a novel application of Bayesian modelling techniques, which, although purely data driven, have a physically interpretable result, and will be useful as an efficient data mining tool. We base our studies on the UV-to-optical spectra (observed and synthetic) of early-type galaxies. A probabilistic latent variable architecture is formulated, and a rigorous Bayesian methodo… ▽ More We present the results of a novel application of Bayesian modelling techniques, which, although purely data driven, have a physically interpretable result, and will be useful as an efficient data mining tool. We base our studies on the UV-to-optical spectra (observed and synthetic) of early-type galaxies. A probabilistic latent variable architecture is formulated, and a rigorous Bayesian methodology is employed for solving the inverse modelling problem from the available data. A powerful aspect of our formalism is that it allows us to recover a limited fraction of missing data due to incomplete spectral coverage, as well as to handle observational errors in a principled way. We apply this method to a sample of 21 well-studied early-type spectra, with known star-formation histories. We find that our data-driven Bayesian modelling allows us to identify those early-types which contain a significant stellar population <~ 1 Gyr old. This method would therefore be a very useful tool for automatically discovering various interesting sub-classes of galaxies. (abridged) △ Less

Submitted 16 November, 2005; originally announced November 2005.

Comments: 19 pages, 15 figures, accepted for publication MNRAS

Journal ref: Mon.Not.Roy.Astron.Soc.366:321-338,2006

arXiv:astro-ph/0505059 [pdf, ps, other]

Finding Young Stellar Populations in Elliptical Galaxies from Independent Components of Optical Spectra

Authors: Ata Kaban, Louisa A. Nolan, Somak Raychaudhury

Abstract: Elliptical galaxies are believed to consist of a single population of old stars formed together at an early epoch in the Universe, yet recent analyses of galaxy spectra seem to indicate the presence of significant younger populations of stars in them. The detailed physical modelling of such populations is computationally expensive, inhibiting the detailed analysis of the several million galaxy s… ▽ More Elliptical galaxies are believed to consist of a single population of old stars formed together at an early epoch in the Universe, yet recent analyses of galaxy spectra seem to indicate the presence of significant younger populations of stars in them. The detailed physical modelling of such populations is computationally expensive, inhibiting the detailed analysis of the several million galaxy spectra becoming available over the next few years. Here we present a data mining application aimed at decomposing the spectra of elliptical galaxies into several coeval stellar populations, without the use of detailed physical models. This is achieved by performing a linear independent basis transformation that essentially decouples the initial problem of joint processing of a set of correlated spectral measurements into that of the independent processing of a small set of prototypical spectra. Two methods are investigated: (1) A fast projection approach is derived by exploiting the correlation structure of neighboring wavelength bins within the spectral data. (2) A factorisation method that takes advantage of the positivity of the spectra is also investigated. The preliminary results show that typical features observed in stellar population spectra of different evolutionary histories can be convincingly disentangled by these methods, despite the absence of input physics. The success of this basis transformation analysis in recovering physically interpretable representations indicates that this technique is a potentially powerful tool for astronomical data mining. △ Less

Submitted 3 May, 2005; originally announced May 2005.

Comments: 12 Pages, 7 figures; accepted in SIAM 2005 International Conference on Data Mining, Newport Beach, CA, April 2005

Showing 1–13 of 13 results for author: Kaban, A