-
Informative regularization for a multi-layer perceptron RR Lyrae classifier under data shift
Authors:
Francisco Pérez-Galarce,
Karim Pichara,
Pablo Huijse,
Márcio Catelan,
Domingo Mery
Abstract:
In recent decades, machine learning has provided valuable models and algorithms for processing and extracting knowledge from time-series surveys. Different classifiers have been proposed and performed to an excellent standard. Nevertheless, few papers have tackled the data shift problem in labeled training sets, which occurs when there is a mismatch between the data distribution in the training se…
▽ More
In recent decades, machine learning has provided valuable models and algorithms for processing and extracting knowledge from time-series surveys. Different classifiers have been proposed and performed to an excellent standard. Nevertheless, few papers have tackled the data shift problem in labeled training sets, which occurs when there is a mismatch between the data distribution in the training set and the testing set. This drawback can damage the prediction performance in unseen data. Consequently, we propose a scalable and easily adaptable approach based on an informative regularization and an ad-hoc training procedure to mitigate the shift problem during the training of a multi-layer perceptron for RR Lyrae classification. We collect ranges for characteristic features to construct a symbolic representation of prior knowledge, which was used to model the informative regularizer component. Simultaneously, we design a two-step back-propagation algorithm to integrate this knowledge into the neural network, whereby one step is applied in each epoch to minimize classification error, while another is applied to ensure regularization. Our algorithm defines a subset of parameters (a mask) for each loss function. This approach handles the forgetting effect, which stems from a trade-off between these loss functions (learning from data versus learning expert knowledge) during training. Experiments were conducted using recently proposed shifted benchmark sets for RR Lyrae stars, outperforming baseline models by up to 3\% through a more reliable classifier. Our method provides a new path to incorporate knowledge from characteristic features into artificial neural networks to manage the underlying data shift problem.
△ Less
Submitted 11 March, 2023;
originally announced March 2023.
-
Informative Bayesian model selection for RR Lyrae star classifiers
Authors:
F. Pérez-Galarce,
K. Pichara,
P. Huijse,
M. Catelan,
D. Mery
Abstract:
Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of thos…
▽ More
Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of those classifiers on objects not belonging to the training data is uncertain, potentially resulting in the selection of incorrect models. Besides, it gives rise to the deployment of misleading classifiers. An example of the latter is the creation of open-source labelled catalogues with biased predictions. In this paper, we develop a method based on an informative marginal likelihood to evaluate variable star classifiers. We collect deterministic rules that are based on physical descriptors of RR Lyrae stars, and then, to mitigate the biases, we introduce those rules into the marginal likelihood estimation. We perform experiments with a set of Bayesian Logistic Regressions, which are trained to classify RR Lyraes, and we found that our method outperforms traditional non-informative cross-validation strategies, even when penalized models are assessed. Our methodology provides a more rigorous alternative to assess machine learning models using astronomical knowledge. From this approach, applications to other classes of variable stars and algorithmic improvements can be developed.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
MPCC: Matching Priors and Conditionals for Clustering
Authors:
Nicolás Astorga,
Pablo Huijse,
Pavlos Protopapas,
Pablo Estévez
Abstract:
Clustering is a fundamental task in unsupervised learning that depends heavily on the data representation that is used. Deep generative models have appeared as a promising tool to learn informative low-dimensional data representations. We propose Matching Priors and Conditionals for Clustering (MPCC), a GAN-based model with an encoder to infer latent variables and cluster categories from data, and…
▽ More
Clustering is a fundamental task in unsupervised learning that depends heavily on the data representation that is used. Deep generative models have appeared as a promising tool to learn informative low-dimensional data representations. We propose Matching Priors and Conditionals for Clustering (MPCC), a GAN-based model with an encoder to infer latent variables and cluster categories from data, and a flexible decoder to generate samples from a conditional latent space. With MPCC we demonstrate that a deep generative model can be competitive/superior against discriminative methods in clustering tasks surpassing the state of the art over a diverse set of benchmark datasets. Our experiments show that adding a learnable prior and augmenting the number of encoder updates improve the quality of the generated samples, obtaining an inception score of 9.49 $\pm$ 0.15 and improving the Fréchet inception distance over the state of the art by a 46.9% in CIFAR10.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
An Information Theory Approach on Deciding Spectroscopic Follow Ups
Authors:
Javiera Astudillo,
Pavlos Protopapas,
Karim Pichara,
Pablo Huijse
Abstract:
Classification and characterization of variable phenomena and transient phenomena are critical for astrophysics and cosmology. These objects are commonly studied using photometric time series or spectroscopic data. Given that many ongoing and future surveys are in time-domain and given that adding spectra provide further insights but requires more observational resources, it would be valuable to k…
▽ More
Classification and characterization of variable phenomena and transient phenomena are critical for astrophysics and cosmology. These objects are commonly studied using photometric time series or spectroscopic data. Given that many ongoing and future surveys are in time-domain and given that adding spectra provide further insights but requires more observational resources, it would be valuable to know which objects should we prioritize to have spectrum in addition to time series. We propose a methodology in a probabilistic setting that determines a-priory which objects are worth taking spectrum to obtain better insights, where we focus 'insight' as the type of the object (classification). Objects for which we query its spectrum are reclassified using their full spectrum information. We first train two classifiers, one that uses photometric data and another that uses photometric and spectroscopic data together. Then for each photometric object we estimate the probability of each possible spectrum outcome. We combine these models in various probabilistic frameworks (strategies) which are used to guide the selection of follow up observations. The best strategy depends on the intended use, whether it is getting more confidence or accuracy. For a given number of candidate objects (127, equal to 5% of the dataset) for taking spectra, we improve 37% class prediction accuracy as opposed to 20% of a non-naive (non-random) best base-line strategy. Our approach provides a general framework for follow-up strategies and can be extended beyond classification and to include other forms of follow-ups beyond spectroscopy.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Robust period estimation using mutual information for multi-band light curves in the synoptic survey era
Authors:
Pablo Huijse,
Pablo A. Estevez,
Francisco Forster,
Scott F. Daniel,
Andrew J. Connolly,
Pavlos Protopapas,
Rodrigo Carrasco,
Jose C. Principe
Abstract:
The Large Synoptic Survey Telescope (LSST) will produce an unprecedented amount of light curves using six optical bands. Robust and efficient methods that can aggregate data from multidimensional sparsely-sampled time series are needed. In this paper we present a new method for light curve period estimation based on the quadratic mutual information (QMI). The proposed method does not assume a part…
▽ More
The Large Synoptic Survey Telescope (LSST) will produce an unprecedented amount of light curves using six optical bands. Robust and efficient methods that can aggregate data from multidimensional sparsely-sampled time series are needed. In this paper we present a new method for light curve period estimation based on the quadratic mutual information (QMI). The proposed method does not assume a particular model for the light curve nor its underlying probability density and it is robust to non-Gaussian noise and outliers. By combining the QMI from several bands the true period can be estimated even when no single-band QMI yields the period. Period recovery performance as a function of average magnitude and sample size is measured using 30,000 synthetic multi-band light curves of RR Lyrae and Cepheid variables generated by the LSST Operations and Catalog simulators. The results show that aggregating information from several bands is highly beneficial in LSST sparsely-sampled time series, obtaining an absolute increase in period recovery rate up to 50%. We also show that the QMI is more robust to noise and light curve length (sample size) than the multiband generalizations of the Lomb Scargle and Analysis of Variance periodograms, recovering the true period in 10-30% more cases than its competitors. A python package containing efficient Cython implementations of the QMI and other methods is provided.
△ Less
Submitted 11 September, 2017;
originally announced September 2017.
-
Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases
Authors:
Pablo Huijse,
Pablo A. Estevez,
Pavlos Protopapas,
Jose C. Principe,
Pablo Zegers
Abstract:
Time-domain astronomy (TDA) is facing a paradigm shift caused by the exponential growth of the sample size, data complexity and data generation rates of new astronomical sky surveys. For example, the Large Synoptic Survey Telescope (LSST), which will begin operations in northern Chile in 2022, will generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky. The LSST will stream…
▽ More
Time-domain astronomy (TDA) is facing a paradigm shift caused by the exponential growth of the sample size, data complexity and data generation rates of new astronomical sky surveys. For example, the Large Synoptic Survey Telescope (LSST), which will begin operations in northern Chile in 2022, will generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky. The LSST will stream data at rates of 2 Terabytes per hour, effectively capturing an unprecedented movie of the sky. The LSST is expected not only to improve our understanding of time-varying astrophysical objects, but also to reveal a plethora of yet unknown faint and fast-varying phenomena. To cope with a change of paradigm to data-driven astronomy, the fields of astroinformatics and astrostatistics have been created recently. The new data-oriented paradigms for astronomy combine statistics, data mining, knowledge discovery, machine learning and computational intelligence, in order to provide the automated and robust methods needed for the rapid detection and classification of known astrophysical objects as well as the unsupervised characterization of novel phenomena. In this article we present an overview of machine learning and computational intelligence applications to TDA. Future big data challenges and new lines of research in TDA, focusing on the LSST, are identified and discussed from the viewpoint of computational intelligence/machine learning. Interdisciplinary collaboration will be required to cope with the challenges posed by the deluge of astronomical data coming from the LSST.
△ Less
Submitted 25 September, 2015;
originally announced September 2015.
-
A Novel, Fully Automated Pipeline for Period Estimation in the EROS 2 Data Set
Authors:
Pavlos Protopapas,
Pablo Huijse,
Pablo A. Estevez,
Pablo Zegers,
Jose C. Principe
Abstract:
We present a new method to discriminate periodic from non-periodic irregularly sampled lightcurves. We introduce a periodic kernel and maximize a similarity measure derived from information theory to estimate the periods and a discriminator factor. We tested the method on a dataset containing 100,000 synthetic periodic and non-periodic lightcurves with various periods, amplitudes and shapes genera…
▽ More
We present a new method to discriminate periodic from non-periodic irregularly sampled lightcurves. We introduce a periodic kernel and maximize a similarity measure derived from information theory to estimate the periods and a discriminator factor. We tested the method on a dataset containing 100,000 synthetic periodic and non-periodic lightcurves with various periods, amplitudes and shapes generated using a multivariate generative model. We correctly identified periodic and non-periodic lightcurves with a completeness of 90% and a precision of 95%, for lightcurves with a signal-to-noise ratio (SNR) larger than 0.5. We characterize the efficiency and reliability of the model using these synthetic lightcurves and applied the method on the EROS-2 dataset. A crucial consideration is the speed at which the method can be executed. Using hierarchical search and some simplification on the parameter search we were able to analyze 32.8 million lightcurves in 18 hours on a cluster of GPGPUs. Using the sensitivity analysis on the synthetic dataset, we infer that 0.42% in the LMC and 0.61% in the SMC of the sources show periodic behavior. The training set, the catalogs and source code are all available in http://timemachine.iic.harvard.edu.
△ Less
Submitted 4 December, 2014;
originally announced December 2014.
-
An Information Theoretic Algorithm for Finding Periodicities in Stellar Light Curves
Authors:
Pablo Huijse,
Pablo A. Estevez,
Pavlos Protopapas,
Pablo Zegers,
Jose C. Principe
Abstract:
We propose a new information theoretic metric for finding periodicities in stellar light curves. Light curves are astronomical time series of brightness over time, and are characterized as being noisy and unevenly sampled. The proposed metric combines correntropy (generalized correlation) with a periodic kernel to measure similarity among samples separated by a given period. The new metric provide…
▽ More
We propose a new information theoretic metric for finding periodicities in stellar light curves. Light curves are astronomical time series of brightness over time, and are characterized as being noisy and unevenly sampled. The proposed metric combines correntropy (generalized correlation) with a periodic kernel to measure similarity among samples separated by a given period. The new metric provides a periodogram, called Correntropy Kernelized Periodogram (CKP), whose peaks are associated with the fundamental frequencies present in the data. The CKP does not require any resampling, slotting or folding scheme as it is computed directly from the available samples. CKP is the main part of a fully-automated pipeline for periodic light curve discrimination to be used in astronomical survey databases. We show that the CKP method outperformed the slotted correntropy, and conventional methods used in astronomy for periodicity discrimination and period estimation tasks, using a set of light curves drawn from the MACHO survey. The proposed metric achieved 97.2% of true positives with 0% of false positives at the confidence level of 99% for the periodicity discrimination task; and 88% of hits with 11.6% of multiples and 0.4% of misses in the period estimation task.
△ Less
Submitted 11 December, 2012;
originally announced December 2012.
-
Period Estimation in Astronomical Time Series Using Slotted Correntropy
Authors:
Pablo Huijse,
Pablo A. Estévez,
Pablo Zegers,
José Príncipe,
Pavlos Protopapas
Abstract:
In this letter, we propose a method for period estimation in light curves from periodic variable stars using correntropy. Light curves are astronomical time series of stellar brightness over time, and are characterized as being noisy and unevenly sampled. We propose to use slotted time lags in order to estimate correntropy directly from irregularly sampled time series. A new information theoretic…
▽ More
In this letter, we propose a method for period estimation in light curves from periodic variable stars using correntropy. Light curves are astronomical time series of stellar brightness over time, and are characterized as being noisy and unevenly sampled. We propose to use slotted time lags in order to estimate correntropy directly from irregularly sampled time series. A new information theoretic metric is proposed for discriminating among the peaks of the correntropy spectral density. The slotted correntropy method outperformed slotted correlation, string length, VarTools (Lomb-Scargle periodogram and Analysis of Variance), and SigSpec applications on a set of light curves drawn from the MACHO survey.
△ Less
Submitted 13 December, 2011;
originally announced December 2011.