-
Using the Sinkhorn divergence in permutation tests for the multivariate two-sample problem
Authors:
E. del Barrio,
J. S. Osorio,
A. J. Quiroz
Abstract:
In order to adapt the Wasserstein distance to the large sample multivariate non-parametric two-sample problem, making its application computationally feasible, permutation tests based on the Sinkhorn divergence between probability vectors associated to data dependent partitions are considered. Different ways of implementing these tests are evaluated and the asymptotic distribution of the underlyin…
▽ More
In order to adapt the Wasserstein distance to the large sample multivariate non-parametric two-sample problem, making its application computationally feasible, permutation tests based on the Sinkhorn divergence between probability vectors associated to data dependent partitions are considered. Different ways of implementing these tests are evaluated and the asymptotic distribution of the underlying statistic is established in some cases. The statistics proposed are compared, in simulated examples, with the test of Schilling's, one of the best non-parametric tests available in the literature.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
A bagging and importance sampling approach to Support Vector Machines
Authors:
R. Bárcenas,
M. D. Gónzalez--Lima,
A. J. Quiroz
Abstract:
An importance sampling and bagging approach to solving the support vector machine (SVM) problem in the context of large databases is presented and evaluated. Our algorithm builds on the nearest neighbors ideas presented in Camelo at al. (2015). As in that reference, the goal of the present proposal is to achieve a faster solution of the SVM problem without a significance loss in the prediction err…
▽ More
An importance sampling and bagging approach to solving the support vector machine (SVM) problem in the context of large databases is presented and evaluated. Our algorithm builds on the nearest neighbors ideas presented in Camelo at al. (2015). As in that reference, the goal of the present proposal is to achieve a faster solution of the SVM problem without a significance loss in the prediction error. The performance of the methodology is evaluated in benchmark examples and theoretical aspects of subsample methods are discussed.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
Local angles and dimension estimation from data on manifolds
Authors:
Mateo Díaz,
Adolfo J. Quiroz,
Mauricio Velasco
Abstract:
For data living in a manifold $M\subseteq \mathbb{R}^m$ and a point $p\in M$ we consider a statistic $U_{k,n}$ which estimates the variance of the angle between pairs of vectors $X_i-p$ and $X_j-p$, for data points $X_i$, $X_j$, near $p$, and evaluate this statistic as a tool for estimation of the intrinsic dimension of $M$ at $p$. Consistency of the local dimension estimator is established and th…
▽ More
For data living in a manifold $M\subseteq \mathbb{R}^m$ and a point $p\in M$ we consider a statistic $U_{k,n}$ which estimates the variance of the angle between pairs of vectors $X_i-p$ and $X_j-p$, for data points $X_i$, $X_j$, near $p$, and evaluate this statistic as a tool for estimation of the intrinsic dimension of $M$ at $p$. Consistency of the local dimension estimator is established and the asymptotic distribution of $U_{k,n}$ is found under minimal regularity assumptions. Performance of the proposed methodology is compared against state-of-the-art methods on simulated data.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.
-
Machine learning techniques to select Be star candidates. An application in the OGLE-IV Gaia south ecliptic pole field
Authors:
M. F. Pérez-Ortiz,
A. García-Varela,
A. J. Quiroz,
B. E. Sabogal,
J. Hernández
Abstract:
Statistical pattern recognition methods have provided competitive solutions for variable star classification at a relatively low computational cost. In order to perform supervised classification, a set of features is proposed and used to train an automatic classification system. Quantities related to the magnitude density of the light curves and their Fourier coefficients have been chosen as featu…
▽ More
Statistical pattern recognition methods have provided competitive solutions for variable star classification at a relatively low computational cost. In order to perform supervised classification, a set of features is proposed and used to train an automatic classification system. Quantities related to the magnitude density of the light curves and their Fourier coefficients have been chosen as features in previous studies. However, some of these features are not robust to the presence of outliers and the calculation of Fourier coefficients is computationally expensive for large data sets. We propose and evaluate the performance of a new robust set of features using supervised classifiers in order to look for new Be star candidates in the OGLE-IV Gaia south ecliptic pole field. We calculated the proposed set of features on six types of variable stars and on a set of Be star candidates reported in the literature. We evaluated the performance of these features using classification trees and random forests along with K-nearest neighbours, support vector machines, and gradient boosted trees methods. We tuned the classifiers with a 10-fold cross-validation and grid search. We validated the performance of the best classifier on a set of OGLE-IV light curves and applied this to find new Be star candidates. The random forest classifier outperformed the others. By using the random forest classifier and colour criteria we found 50 Be star candidates in the direction of the Gaia south ecliptic pole field, four of which have infrared colours consistent with Herbig Ae/Be stars. Supervised methods are very useful in order to obtain preliminary samples of variable stars extracted from large databases. As usual, the stars classified as Be stars candidates must be checked for the colours and spectroscopic characteristics expected for them.
△ Less
Submitted 14 July, 2017;
originally announced July 2017.
-
Permutation tests in the two-sample problem for functional data
Authors:
Alejandra Cabaña,
Ana Maria Estrada,
Jairo I. Peña,
Adolfo J. Quiroz
Abstract:
Three different permutation test schemes are discussed and compared in the context of the two-sample problem for functional data. One of the procedures was essentially introduced by Lopez-Pintado and Romo (2009), using notions of functional data depth to adapt the ideas originally proposed by Liu and Singh (1993) for multivariate data. Of the new methods introduced here, one is also based on funct…
▽ More
Three different permutation test schemes are discussed and compared in the context of the two-sample problem for functional data. One of the procedures was essentially introduced by Lopez-Pintado and Romo (2009), using notions of functional data depth to adapt the ideas originally proposed by Liu and Singh (1993) for multivariate data. Of the new methods introduced here, one is also based on functional data depths, but uses a different way (inspired by Meta-Analysis) to assess the significance of the depth differences. The second new method presented here adapts, to the functional data setting, the k-nearest-neighbors statistic of Schilling (1986). The three methods are compared among them and against the test of Horvath and Kokoszka (2012) in simulated examples and real data. The comparison considers the performance of the statistics in terms of statistical power and in terms of computational cost.
△ Less
Submitted 21 October, 2016;
originally announced October 2016.
-
Metric Entropy estimation using o-minimality Theory
Authors:
Alf Onshuus,
Adolfo J. Quiroz
Abstract:
It is shown how tools from the area of Model Theory, specifically from the Theory of o-minimality, can be used to prove that a class of functions is VC-subgraph (in the sense of Dudley, 1987), and therefore satisfies a uniform polynomial metric entropy bound. We give examples where the use of these methods significantly improves the existing metric entropy bounds. The methods proposed here can be…
▽ More
It is shown how tools from the area of Model Theory, specifically from the Theory of o-minimality, can be used to prove that a class of functions is VC-subgraph (in the sense of Dudley, 1987), and therefore satisfies a uniform polynomial metric entropy bound. We give examples where the use of these methods significantly improves the existing metric entropy bounds. The methods proposed here can be applied to finite dimensional parametric families of functions without the need for the parameters to live in a compact set, as is sometimes required in theorems that produce similar entropy bounds (for instance Theorem 19.7 of van der Vaart, 1998).
△ Less
Submitted 22 November, 2015;
originally announced November 2015.
-
Quadratic forms of the empirical processes for the two sample problem for functional data
Authors:
R. Bárcenas,
J. Ortega,
A. J. Quiroz
Abstract:
The use of quadratic forms of the empirical process for the two-sample problem in the context of functional data is considered. The convergence of the family of statistics proposed to a Gaussian limit is established under metric entropy conditions for smooth functional data. The applicability of the proposed methodology is evaluated in examples.
The use of quadratic forms of the empirical process for the two-sample problem in the context of functional data is considered. The convergence of the family of statistics proposed to a Gaussian limit is established under metric entropy conditions for smooth functional data. The applicability of the proposed methodology is evaluated in examples.
△ Less
Submitted 3 July, 2015;
originally announced July 2015.