-
Distinguishing a planetary transit from false positives: a Transformer-based classification for planetary transit signals
Authors:
Helem Salinas,
Karim Pichara,
Rafael Brahm,
Francisco Pérez-Galarce,
Domingo Mery
Abstract:
Current space-based missions, such as the Transiting Exoplanet Survey Satellite (TESS), provide a large database of light curves that must be analysed efficiently and systematically. In recent years, deep learning (DL) methods, particularly convolutional neural networks (CNN), have been used to classify transit signals of candidate exoplanets automatically. However, CNNs have some drawbacks; for e…
▽ More
Current space-based missions, such as the Transiting Exoplanet Survey Satellite (TESS), provide a large database of light curves that must be analysed efficiently and systematically. In recent years, deep learning (DL) methods, particularly convolutional neural networks (CNN), have been used to classify transit signals of candidate exoplanets automatically. However, CNNs have some drawbacks; for example, they require many layers to capture dependencies on sequential data, such as light curves, making the network so large that it eventually becomes impractical. The self-attention mechanism is a DL technique that attempts to mimic the action of selectively focusing on some relevant things while ignoring others. Models, such as the Transformer architecture, were recently proposed for sequential data with successful results. Based on these successful models, we present a new architecture for the automatic classification of transit signals. Our proposed architecture is designed to capture the most significant features of a transit signal and stellar parameters through the self-attention mechanism. In addition to model prediction, we take advantage of attention map inspection, obtaining a more interpretable DL approach. Thus, we can identify the relevance of each element to differentiate a transit signal from false positives, simplifying the manual examination of candidates. We show that our architecture achieves competitive results concerning the CNNs applied for recognizing exoplanetary transit signals in data from the TESS telescope. Based on these results, we demonstrate that applying this state-of-the-art DL model to light curves can be a powerful technique for transit signal detection while offering a level of interpretability.
△ Less
Submitted 27 April, 2023;
originally announced April 2023.
-
Informative regularization for a multi-layer perceptron RR Lyrae classifier under data shift
Authors:
Francisco Pérez-Galarce,
Karim Pichara,
Pablo Huijse,
Márcio Catelan,
Domingo Mery
Abstract:
In recent decades, machine learning has provided valuable models and algorithms for processing and extracting knowledge from time-series surveys. Different classifiers have been proposed and performed to an excellent standard. Nevertheless, few papers have tackled the data shift problem in labeled training sets, which occurs when there is a mismatch between the data distribution in the training se…
▽ More
In recent decades, machine learning has provided valuable models and algorithms for processing and extracting knowledge from time-series surveys. Different classifiers have been proposed and performed to an excellent standard. Nevertheless, few papers have tackled the data shift problem in labeled training sets, which occurs when there is a mismatch between the data distribution in the training set and the testing set. This drawback can damage the prediction performance in unseen data. Consequently, we propose a scalable and easily adaptable approach based on an informative regularization and an ad-hoc training procedure to mitigate the shift problem during the training of a multi-layer perceptron for RR Lyrae classification. We collect ranges for characteristic features to construct a symbolic representation of prior knowledge, which was used to model the informative regularizer component. Simultaneously, we design a two-step back-propagation algorithm to integrate this knowledge into the neural network, whereby one step is applied in each epoch to minimize classification error, while another is applied to ensure regularization. Our algorithm defines a subset of parameters (a mask) for each loss function. This approach handles the forgetting effect, which stems from a trade-off between these loss functions (learning from data versus learning expert knowledge) during training. Experiments were conducted using recently proposed shifted benchmark sets for RR Lyrae stars, outperforming baseline models by up to 3\% through a more reliable classifier. Our method provides a new path to incorporate knowledge from characteristic features into artificial neural networks to manage the underlying data shift problem.
△ Less
Submitted 11 March, 2023;
originally announced March 2023.
-
Error-Aware B-PINNs: Improving Uncertainty Quantification in Bayesian Physics-Informed Neural Networks
Authors:
Olga Graf,
Pablo Flores,
Pavlos Protopapas,
Karim Pichara
Abstract:
Physics-Informed Neural Networks (PINNs) are gaining popularity as a method for solving differential equations. While being more feasible in some contexts than the classical numerical techniques, PINNs still lack credibility. A remedy for that can be found in Uncertainty Quantification (UQ) which is just beginning to emerge in the context of PINNs. Assessing how well the trained PINN complies with…
▽ More
Physics-Informed Neural Networks (PINNs) are gaining popularity as a method for solving differential equations. While being more feasible in some contexts than the classical numerical techniques, PINNs still lack credibility. A remedy for that can be found in Uncertainty Quantification (UQ) which is just beginning to emerge in the context of PINNs. Assessing how well the trained PINN complies with imposed differential equation is the key to tackling uncertainty, yet there is lack of comprehensive methodology for this task. We propose a framework for UQ in Bayesian PINNs (B-PINNs) that incorporates the discrepancy between the B-PINN solution and the unknown true solution. We exploit recent results on error bounds for PINNs on linear dynamical systems and demonstrate the predictive uncertainty on a class of linear ODEs.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Semi-Supervised Classification and Clustering Analysis for Variable Stars
Authors:
R. Pantoja,
M. Catelan,
K. Pichara,
P. Protopapas
Abstract:
The immense amount of time series data produced by astronomical surveys has called for the use of machine learning algorithms to discover and classify several million celestial sources. In the case of variable stars, supervised learning approaches have become commonplace. However, this needs a considerable collection of expert-labeled light curves to achieve adequate performance, which is costly t…
▽ More
The immense amount of time series data produced by astronomical surveys has called for the use of machine learning algorithms to discover and classify several million celestial sources. In the case of variable stars, supervised learning approaches have become commonplace. However, this needs a considerable collection of expert-labeled light curves to achieve adequate performance, which is costly to construct. To solve this problem, we introduce two approaches. First, a semi-supervised hierarchical method, which requires substantially less trained data than supervised methods. Second, a clustering analysis procedure that finds groups that may correspond to classes or sub-classes of variable stars. Both methods are primarily supported by dimensionality reduction of the data for visualization and to avoid the curse of dimensionality. We tested our methods with catalogs collected from OGLE, CSS, and Gaia surveys. The semi-supervised method reaches a performance of around 90\% for all of our three selected catalogs of variable stars using only $5\%$ of the data in the training. This method is suitable for classifying the main classes of variable stars when there is only a small amount of training data. Our clustering analysis confirms that most of the clusters found have a purity over 90\% with respect to classes and 80\% with respect to sub-classes, suggesting that this type of analysis can be used in large-scale variability surveys as an initial step to identify which classes or sub-classes of variable stars are present in the data and/or to build training sets, among many other possible applications.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Uncertainty Quantification in Neural Differential Equations
Authors:
Olga Graf,
Pablo Flores,
Pavlos Protopapas,
Karim Pichara
Abstract:
Uncertainty quantification (UQ) helps to make trustworthy predictions based on collected observations and uncertain domain knowledge. With increased usage of deep learning in various applications, the need for efficient UQ methods that can make deep models more reliable has increased as well. Among applications that can benefit from effective handling of uncertainty are the deep learning based dif…
▽ More
Uncertainty quantification (UQ) helps to make trustworthy predictions based on collected observations and uncertain domain knowledge. With increased usage of deep learning in various applications, the need for efficient UQ methods that can make deep models more reliable has increased as well. Among applications that can benefit from effective handling of uncertainty are the deep learning based differential equation (DE) solvers. We adapt several state-of-the-art UQ methods to get the predictive uncertainty for DE solutions and show the results on four different DE types.
△ Less
Submitted 7 November, 2021;
originally announced November 2021.
-
Informative Bayesian model selection for RR Lyrae star classifiers
Authors:
F. Pérez-Galarce,
K. Pichara,
P. Huijse,
M. Catelan,
D. Mery
Abstract:
Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of thos…
▽ More
Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of those classifiers on objects not belonging to the training data is uncertain, potentially resulting in the selection of incorrect models. Besides, it gives rise to the deployment of misleading classifiers. An example of the latter is the creation of open-source labelled catalogues with biased predictions. In this paper, we develop a method based on an informative marginal likelihood to evaluate variable star classifiers. We collect deterministic rules that are based on physical descriptors of RR Lyrae stars, and then, to mitigate the biases, we introduce those rules into the marginal likelihood estimation. We perform experiments with a set of Bayesian Logistic Regressions, which are trained to classify RR Lyraes, and we found that our method outperforms traditional non-informative cross-validation strategies, even when penalized models are assessed. Our methodology provides a more rigorous alternative to assess machine learning models using astronomical knowledge. From this approach, applications to other classes of variable stars and algorithmic improvements can be developed.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
The VVV Infrared Variability Catalog (VIVA-I)
Authors:
C. E. Ferreira Lopes,
N. J. G. Cross,
M. Catelan,
D. Minniti,
M. Hempel,
P. W. Lucas,
R. Angeloni,
F. Jablonsky,
V. F. Braga,
I. C. Leao,
F. R. Herpich,
J. Alonso-Garcia,
A. Papageorgiou,
K. Pichara,
R. K. Saito,
A. Bradley,
J. C. Beamin,
C. Cortes,
J. R. De Medeiros,
Christopher. M. P. Russell
Abstract:
Thanks to the VISTA Variables in the Via Lactea (VVV) ESO Public Survey it is now possible to explore a large number of objects in those regions. This paper addresses the variability analysis of all VVV point sources having more than 10 observations in VVVDR4 using a novel approach. In total, the near-IR light curves of 288,378,769 sources were analysed using methods developed in the New Insight I…
▽ More
Thanks to the VISTA Variables in the Via Lactea (VVV) ESO Public Survey it is now possible to explore a large number of objects in those regions. This paper addresses the variability analysis of all VVV point sources having more than 10 observations in VVVDR4 using a novel approach. In total, the near-IR light curves of 288,378,769 sources were analysed using methods developed in the New Insight Into Time Series Analysis project. As a result, we present a complete sample having 44, 998, 752 variable star candidates (VVV-CVSC), which include accurate individual coordinates, near-IR magnitudes (ZYJHKs), extinctions A(Ks), variability indices, periods, amplitudes, among other parameters to assess the science. Unfortunately, a side effect of having a highly complete sample, is also having a high level of contamination by non-variable (contamination ratio of non-variables to variables is slightly over 10:1). To deal with this, we also provide some flags and parameters that can be used by the community to de-crease the number of variable candidates without heavily decreasing the completeness of the sample. In particular, we cross-identified 339,601 of our sources with Simbad and AAVSO databases, which provide us with information for these objects at other wavelegths. This sub-sample constitutes a unique resource to study the corresponding near-IR variability of known sources as well as to assess the IR variability related with X-ray and Gamma-Ray sources. On the other hand, the other 99.5% sources in our sample constitutes a number of potentially new objects with variability information for the heavily crowded and reddened regions of the Galactic Plane and Bulge. The present results also provide an important queryable resource to perform variability analysis and to characterize ongoing and future surveys like TESS and LSST.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
Classifying CMB time-ordered data through deep neural networks
Authors:
Felipe Rojas,
Loïc Maurin,
Rolando Dünner,
Karim Pichara
Abstract:
The Cosmic Microwave Background (CMB) has been measured over a wide range of multipoles. Experiments with arc-minute resolution like the Atacama Cosmology Telescope (ACT) have contributed to the measurement of primary and secondary anisotropies, leading to remarkable scientific discoveries. Such findings require careful data selection in order to remove poorly-behaved detectors and unwanted contam…
▽ More
The Cosmic Microwave Background (CMB) has been measured over a wide range of multipoles. Experiments with arc-minute resolution like the Atacama Cosmology Telescope (ACT) have contributed to the measurement of primary and secondary anisotropies, leading to remarkable scientific discoveries. Such findings require careful data selection in order to remove poorly-behaved detectors and unwanted contaminants. The current data classification methodology used by ACT relies on several statistical parameters that are assessed and fine-tuned by an expert. This method is highly time-consuming and band or season-specific, which makes it less scalable and efficient for future CMB experiments. In this work, we propose a supervised machine learning model to classify detectors of CMB experiments. The model corresponds to a deep convolutional neural network. We tested our method on real ACT data, using the 2008 season, 148 GHz, as training set with labels provided by the ACT data selection software. The model learns to classify time-streams starting directly from the raw data. For the season and frequency considered during the training, we find that our classifier reaches a precision of 99.8%. For 220 and 280 GHz data, season 2008, we obtained 99.4% and 97.5% of precision, respectively. Finally, we performed a cross-season test over 148 GHz data from 2009 and 2010 for which our model reaches a precision of 99.8% and 99.5%, respectively. Our model is about 10x faster than the current pipeline, making it potentially suitable for real-time implementations.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
Scalable End-to-end Recurrent Neural Network for Variable star classification
Authors:
Ignacio Becker,
Karim Pichara,
Márcio Catelan,
Pavlos Protopapas,
Carlos Aguirre,
Fatemeh Nikzat
Abstract:
During the last decade, considerable effort has been made to perform automatic classification of variable stars using machine learning techniques. Traditionally, light curves are represented as a vector of descriptors or features used as input for many algorithms. Some features are computationally expensive, cannot be updated quickly and hence for large datasets such as the LSST cannot be applied.…
▽ More
During the last decade, considerable effort has been made to perform automatic classification of variable stars using machine learning techniques. Traditionally, light curves are represented as a vector of descriptors or features used as input for many algorithms. Some features are computationally expensive, cannot be updated quickly and hence for large datasets such as the LSST cannot be applied. Previous work has been done to develop alternative unsupervised feature extraction algorithms for light curves, but the cost of doing so still remains high. In this work, we propose an end-to-end algorithm that automatically learns the representation of light curves that allows an accurate automatic classification. We study a series of deep learning architectures based on Recurrent Neural Networks and test them in automated classification scenarios. Our method uses minimal data preprocessing, can be updated with a low computational cost for new observations and light curves, and can scale up to massive datasets. We transform each light curve into an input matrix representation whose elements are the differences in time and magnitude, and the outputs are classification probabilities. We test our method in three surveys: OGLE-III, Gaia and WISE. We obtain accuracies of about $95\%$ in the main classes and $75\%$ in the majority of subclasses. We compare our results with the Random Forest classifier and obtain competitive accuracies while being faster and scalable. The analysis shows that the computational complexity of our approach grows up linearly with the light curve size, while the traditional approach cost grows as $N\log{(N)}$.
△ Less
Submitted 3 February, 2020;
originally announced February 2020.
-
Streaming Classification of Variable Stars
Authors:
Lukas Zorich,
Karim Pichara,
Pavlos Protopapas
Abstract:
In the last years, automatic classification of variable stars has received substantial attention. Using machine learning techniques for this task has proven to be quite useful. Typically, machine learning classifiers used for this task require to have a fixed training set, and the training process is performed offline. Upcoming surveys such as the Large Synoptic Survey Telescope (LSST) will genera…
▽ More
In the last years, automatic classification of variable stars has received substantial attention. Using machine learning techniques for this task has proven to be quite useful. Typically, machine learning classifiers used for this task require to have a fixed training set, and the training process is performed offline. Upcoming surveys such as the Large Synoptic Survey Telescope (LSST) will generate new observations daily, where an automatic classification system able to create alerts online will be mandatory. A system with those characteristics must be able to update itself incrementally. Unfortunately, after training, most machine learning classifiers do not support the inclusion of new observations in light curves, they need to re-train from scratch. Naively re-training from scratch is not an option in streaming settings, mainly because of the expensive pre-processing routines required to obtain a vector representation of light curves (features) each time we include new observations. In this work, we propose a streaming probabilistic classification model; it uses a set of newly designed features that work incrementally. With this model, we can have a machine learning classifier that updates itself in real time with new observations. To test our approach, we simulate a streaming scenario with light curves from CoRot, OGLE and MACHO catalogs. Results show that our model achieves high classification performance, staying an order of magnitude faster than traditional classification approaches.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
An Information Theory Approach on Deciding Spectroscopic Follow Ups
Authors:
Javiera Astudillo,
Pavlos Protopapas,
Karim Pichara,
Pablo Huijse
Abstract:
Classification and characterization of variable phenomena and transient phenomena are critical for astrophysics and cosmology. These objects are commonly studied using photometric time series or spectroscopic data. Given that many ongoing and future surveys are in time-domain and given that adding spectra provide further insights but requires more observational resources, it would be valuable to k…
▽ More
Classification and characterization of variable phenomena and transient phenomena are critical for astrophysics and cosmology. These objects are commonly studied using photometric time series or spectroscopic data. Given that many ongoing and future surveys are in time-domain and given that adding spectra provide further insights but requires more observational resources, it would be valuable to know which objects should we prioritize to have spectrum in addition to time series. We propose a methodology in a probabilistic setting that determines a-priory which objects are worth taking spectrum to obtain better insights, where we focus 'insight' as the type of the object (classification). Objects for which we query its spectrum are reclassified using their full spectrum information. We first train two classifiers, one that uses photometric data and another that uses photometric and spectroscopic data together. Then for each photometric object we estimate the probability of each possible spectrum outcome. We combine these models in various probabilistic frameworks (strategies) which are used to guide the selection of follow up observations. The best strategy depends on the intended use, whether it is getting more confidence or accuracy. For a given number of candidate objects (127, equal to 5% of the dataset) for taking spectra, we improve 37% class prediction accuracy as opposed to 20% of a non-naive (non-random) best base-line strategy. Our approach provides a general framework for follow-up strategies and can be extended beyond classification and to include other forms of follow-ups beyond spectroscopy.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
An Algorithm for the Visualization of Relevant Patterns in Astronomical Light Curves
Authors:
Christian Pieringer,
Karim Pichara,
Márcio Catelán,
Pavlos Protopapas
Abstract:
Within the last years, the classification of variable stars with Machine Learning has become a mainstream area of research. Recently, visualization of time series is attracting more attention in data science as a tool to visually help scientists to recognize significant patterns in complex dynamics. Within the Machine Learning literature, dictionary-based methods have been widely used to encode re…
▽ More
Within the last years, the classification of variable stars with Machine Learning has become a mainstream area of research. Recently, visualization of time series is attracting more attention in data science as a tool to visually help scientists to recognize significant patterns in complex dynamics. Within the Machine Learning literature, dictionary-based methods have been widely used to encode relevant parts of image data. These methods intrinsically assign a degree of importance to patches in pictures, according to their contribution in the image reconstruction. Inspired by dictionary-based techniques, we present an approach that naturally provides the visualization of salient parts in astronomical light curves, making the analogy between image patches and relevant pieces in time series. Our approach encodes the most meaningful patterns such that we can approximately reconstruct light curves by just using the encoded information. We test our method in light curves from the OGLE-III and StarLight databases. Our results show that the proposed model delivers an automatic and intuitive visualization of relevant light curve parts, such as local peaks and drops in magnitude.
△ Less
Submitted 7 March, 2019;
originally announced March 2019.
-
A Full Probabilistic Model for Yes/No Type Crowdsourcing in Multi-Class Classification
Authors:
Belen Saldias,
Pavlos Protopapas,
Karim Pichara
Abstract:
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may n…
▽ More
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries only require "yes" or "no" responses. Our model estimates a joint posterior distribution of matrices related to labelers' confusions and the posterior probability of the class of every object. We developed an approximate inference approach, using Monte Carlo Sampling and Black Box Variational Inference, which provides the derivation of the necessary gradients. We built two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on the image classification of animals. We achieved results that are comparable with those of full query crowdsourcing. Furthermore, we show that modeling labelers' failures plays an important role in estimating true classes. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments. All our code is publicly available.
△ Less
Submitted 13 August, 2019; v1 submitted 2 January, 2019;
originally announced January 2019.
-
Deep multi-survey classification of variable stars
Authors:
Carlos Aguirre,
Karim Pichara,
Ignacio Becker
Abstract:
During the last decade, a considerable amount of effort has been made to classify variable stars using different machine learning techniques. Typically, light curves are represented as vectors of statistical descriptors or features that are used to train various algorithms. These features demand big computational powers that can last from hours to days, making impossible to create scalable and eff…
▽ More
During the last decade, a considerable amount of effort has been made to classify variable stars using different machine learning techniques. Typically, light curves are represented as vectors of statistical descriptors or features that are used to train various algorithms. These features demand big computational powers that can last from hours to days, making impossible to create scalable and efficient ways of automatically classifying variable stars. Also, light curves from different surveys cannot be integrated and analyzed together when using features, because of observational differences. For example, having variations in cadence and filters, feature distributions become biased and require expensive data-calibration models. The vast amount of data that will be generated soon make necessary to develop scalable machine learning architectures without expensive integration techniques. Convolutional Neural Networks have shown impressing results in raw image classification and representation within the machine learning literature. In this work, we present a novel Deep Learning model for light curve classification, mainly based on convolutional units. Our architecture receives as input the differences between time and magnitude of light curves. It captures the essential classification patterns regardless of cadence and filter. In addition, we introduce a novel data augmentation schema for unevenly sampled time series. We test our method using three different surveys: OGLE-III; Corot; and VVV, which differ in filters, cadence, and area of the sky. We show that besides the benefit of scalability, our model obtains state of the art levels accuracy in light curve classification benchmarks.
△ Less
Submitted 21 October, 2018;
originally announced October 2018.
-
New variable Stars from the Photographic Archive: Semi-automated Discoveries, Attempts of Automatic Classification, and the New Field 104 Her
Authors:
S. V. Antipin,
I. Becker,
A. A. Belinski,
D. M. Kolesnikova,
K. Pichara,
N. N. Samus,
K. V. Sokolovsky,
A. V. Zharova,
A. M. Zubareva
Abstract:
Using 172 plates taken with the 40-cm astrograph of the Sternberg Astronomical Institute (Lomonosov Moscow University) in 1976-1994 and digitized with the resolution of 2400 dpi, we discovered and studied 275 new variable stars. We present the list of our new variables with all necessary information concerning their brightness variations. As in our earlier studies, the new discoveries show a rathe…
▽ More
Using 172 plates taken with the 40-cm astrograph of the Sternberg Astronomical Institute (Lomonosov Moscow University) in 1976-1994 and digitized with the resolution of 2400 dpi, we discovered and studied 275 new variable stars. We present the list of our new variables with all necessary information concerning their brightness variations. As in our earlier studies, the new discoveries show a rather large number of high-amplitude Delta Scuti variables, predicting that many stars of this type remain not detected in the whole sky. We also performed automated classification of the newly discovered variable stars based on the Random Forest algorithm. The results of the automated classification were compared to traditional classification and showed that automated classification was possible even with noisy photographic data. However, further improvement of automated techniques is needed, which is especially important having in mind the very large numbers of new discoveries expected from all-sky surveys.
△ Less
Submitted 7 February, 2018;
originally announced February 2018.
-
Automatic Survey-Invariant Variable Star Classification
Authors:
Patricio Benavente,
Pavlos Protopapas,
Karim Pichara
Abstract:
Machine learning techniques have been successfully used to classify variable stars on widely-studied astronomical surveys. These datasets have been available to astronomers long enough, thus allowing them to perform deep analysis over several variable sources and generating useful catalogs with identified variable stars. The products of these studies are labeled data that enable supervised learnin…
▽ More
Machine learning techniques have been successfully used to classify variable stars on widely-studied astronomical surveys. These datasets have been available to astronomers long enough, thus allowing them to perform deep analysis over several variable sources and generating useful catalogs with identified variable stars. The products of these studies are labeled data that enable supervised learning models to be trained successfully. However, when these models are blindly applied to data from new sky surveys their performance drops significantly. Furthermore, unlabeled data becomes available at a much higher rate than its labeled counterpart, since labeling is a manual and time-consuming effort. Domain adaptation techniques aim to learn from a domain where labeled data is available, the \textit{source domain}, and through some adaptation perform well on a different domain, the \textit{target domain}. We propose a full probabilistic model that represents the joint distribution of features from two surveys as well as a probabilistic transformation of the features between one survey to the other. This allows us to transfer labeled data to a study where it is not available and to effectively run a variable star classification model in a new survey. Our model represents the features of each domain as a Gaussian mixture and models the transformation as a translation, rotation and scaling of each separate component. We perform tests using three different variability catalogs: EROS, MACHO, and HiTS, presenting differences among them, such as the amount of observations per star, cadence, observational time and optical bands observed, among others.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.
-
Uncertain classification of Variable Stars: handling observational GAPS and noise
Authors:
Nicolas Castro,
Pavlos Protopapas,
Karim Pichara
Abstract:
Automatic classification methods applied to sky surveys have revolutionized the astronomical target selection process. Most surveys generate a vast amount of time series, or \quotes{lightcurves}, that represent the brightness variability of stellar objects in time. Unfortunately, lightcurves' observations take several years to be completed, producing truncated time series that generally remain wit…
▽ More
Automatic classification methods applied to sky surveys have revolutionized the astronomical target selection process. Most surveys generate a vast amount of time series, or \quotes{lightcurves}, that represent the brightness variability of stellar objects in time. Unfortunately, lightcurves' observations take several years to be completed, producing truncated time series that generally remain without the application of automatic classifiers until they are finished. This happens because state of the art methods rely on a variety of statistical descriptors or features that present an increasing degree of dispersion when the number of observations decreases, which reduces their precision. In this paper we propose a novel method that increases the performance of automatic classifiers of variable stars by incorporating the deviations that scarcity of observations produces. Our method uses Gaussian Process Regression to form a probabilistic model of each lightcurve's observations. Then, based on this model, bootstrapped samples of the time series features are generated. Finally a bagging approach is used to improve the overall performance of the classification. We perform tests on the MACHO and OGLE catalogs, results show that our method classifies effectively some variability classes using a small fraction of the original observations. For example, we found that RR Lyrae stars can be classified with around 80\% of accuracy just by observing the first 5\% of the whole lightcurves' observations in MACHO and OGLE catalogs. We believe these results prove that, when studying lightcurves, it is important to consider the features' error and how the measurement process impacts it.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.
-
Unsupervised Classification of Variable Stars
Authors:
Lucas Valenzuela,
Karim Pichara
Abstract:
During the last ten years, a considerable amount of effort has been made to develop algorithms for automatic classification of variable stars. That has been primarily achieved by applying machine learning methods to photometric datasets where objects are represented as light curves. Classifiers require training sets to learn the underlying patterns that allow the separation among classes. Unfortun…
▽ More
During the last ten years, a considerable amount of effort has been made to develop algorithms for automatic classification of variable stars. That has been primarily achieved by applying machine learning methods to photometric datasets where objects are represented as light curves. Classifiers require training sets to learn the underlying patterns that allow the separation among classes. Unfortunately, building training sets is an expensive process that demands a lot of human efforts. Every time data comes from new surveys; the only available training instances are the ones that have a cross-match with previously labelled objects, consequently generating insufficient training sets compared with the large amounts of unlabelled sources. In this work, we present an algorithm that performs unsupervised classification of variable stars, relying only on the similarity among light curves. We tackle the unsupervised classification problem by proposing an untraditional approach. Instead of trying to match classes of stars with clusters found by a clustering algorithm, we propose a query based method where astronomers can find groups of variable stars ranked by similarity. We also develop a fast similarity function specific for light curves, based on a novel data structure that allows scaling the search over the entire dataset of unlabelled objects. Experiments show that our unsupervised model achieves high accuracy in the classification of different types of variable stars and that the proposed algorithm scales up to massive amounts of light curves.
△ Less
Submitted 29 January, 2018;
originally announced January 2018.
-
Clustering Based Feature Learning on Variable Stars
Authors:
Cristóbal Mackenzie,
Karim Pichara,
Pavlos Protopapas
Abstract:
The success of automatic classification of variable stars strongly depends on the lightcurve representation. Usually, lightcurves are represented as a vector of many statistical descriptors designed by astronomers called features. These descriptors commonly demand significant computational power to calculate, require substantial research effort to develop and do not guarantee good performance on t…
▽ More
The success of automatic classification of variable stars strongly depends on the lightcurve representation. Usually, lightcurves are represented as a vector of many statistical descriptors designed by astronomers called features. These descriptors commonly demand significant computational power to calculate, require substantial research effort to develop and do not guarantee good performance on the final classification task. Today, lightcurve representation is not entirely automatic; algorithms that extract lightcurve features are designed by humans and must be manually tuned up for every survey. The vast amounts of data that will be generated in future surveys like LSST mean astronomers must develop analysis pipelines that are both scalable and automated. Recently, substantial efforts have been made in the machine learning community to develop methods that prescind from expert-designed and manually tuned features for features that are automatically learned from data. In this work we present what is, to our knowledge, the first unsupervised feature learning algorithm designed for variable stars. Our method first extracts a large number of lightcurve subsequences from a given set of photometric data, which are then clustered to find common local patterns in the time series. Representatives of these patterns, called exemplars, are then used to transform lightcurves of a labeled set into a new representation that can then be used to train an automatic classifier. The proposed algorithm learns the features from both labeled and unlabeled lightcurves, overcoming the bias generated when the learning process is done only with labeled data. We test our method on MACHO and OGLE datasets; the results show that the classification performance we achieve is as good and in some cases better than the performance achieved using traditional features, while the computational cost is significantly lower.
△ Less
Submitted 29 February, 2016;
originally announced February 2016.
-
Meta Classification for Variable Stars
Authors:
Karim Pichara,
Pavlos Protopapas,
Daniel León
Abstract:
The need for the development of automatic tools to explore astronomical databases has been recognized since the inception of CCDs and modern computers. Astronomers already have developed solutions to tackle several science problems, such as automatic classification of stellar objects, outlier detection, and globular clusters identification, among others. New science problems emerge and it is criti…
▽ More
The need for the development of automatic tools to explore astronomical databases has been recognized since the inception of CCDs and modern computers. Astronomers already have developed solutions to tackle several science problems, such as automatic classification of stellar objects, outlier detection, and globular clusters identification, among others. New science problems emerge and it is critical to be able to re-use the models learned before, without rebuilding everything from the beginning when the science problem changes. In this paper, we propose a new meta-model that automatically integrates existing classification models of variable stars. The proposed meta-model incorporates existing models that are trained in a different context, answering different questions and using different representations of data. Conventional mixture of experts algorithms in machine learning literature can not be used since each expert (model) uses different inputs. We also consider computational complexity of the model by using the most expensive models only when it is necessary. We test our model with EROS-2 and MACHO datasets, and we show that we solve most of the classification challenges only by training a meta-model to learn how to integrate the previous experts.
△ Less
Submitted 12 January, 2016;
originally announced January 2016.
-
FATS: Feature Analysis for Time Series
Authors:
Isadora Nun,
Pavlos Protopapas,
Brandon Sim,
Ming Zhu,
Rahul Dave,
Nicolas Castro,
Karim Pichara
Abstract:
In this paper, we present the FATS (Feature Analysis for Time Series) library. FATS is a Python library which facilitates and standardizes feature extraction for time series data. In particular, we focus on one application: feature extraction for astronomical light curve data, although the library is generalizable for other uses. We detail the methods and features implemented for light curve analy…
▽ More
In this paper, we present the FATS (Feature Analysis for Time Series) library. FATS is a Python library which facilitates and standardizes feature extraction for time series data. In particular, we focus on one application: feature extraction for astronomical light curve data, although the library is generalizable for other uses. We detail the methods and features implemented for light curve analysis, and present examples for its usage.
△ Less
Submitted 31 August, 2015; v1 submitted 29 May, 2015;
originally announced June 2015.
-
Photometric Classification of quasars from RCS-2 using Random Forest
Authors:
D. Carrasco,
L. F. Barrientos,
K. Pichara,
T. Anguita,
D. N. A. Murphy,
D. G. Gilbank,
M. D. Gladders,
H. K. C. Yee,
B. C. Hsieh,
S. López
Abstract:
Aims. Construction of a new quasar candidate catalog from the Red-Sequence Cluster Survey 2 (RCS-2), identified solely from photometric information using an automated algorithm suitable for large surveys. The algorithm performance is tested using a well-defined SDSS spectroscopic sample of quasars and stars. Methods. The Random Forest algorithm constructs the catalog from RCS-2 point sources using…
▽ More
Aims. Construction of a new quasar candidate catalog from the Red-Sequence Cluster Survey 2 (RCS-2), identified solely from photometric information using an automated algorithm suitable for large surveys. The algorithm performance is tested using a well-defined SDSS spectroscopic sample of quasars and stars. Methods. The Random Forest algorithm constructs the catalog from RCS-2 point sources using SDSS spectroscopically-confirmed stars and quasars. The algorithm identifies putative quasars from broadband magnitudes (g, r, i, z) and colours. Exploiting NUV GALEX measurements for a subset of the objects, we refine the classifier by adding new information. An additional subset of the data with WISE W1 and W2 bands is also studied. Results. Upon analyzing 542,897 RCS-2 point sources, the algorithm identified 21,501 quasar candidates, with a training-set-derived precision (the fraction of true positives within the group assigned quasar status) of 89.5% and recall (the fraction of true positives relative to all sources that actually are quasars) of 88.4%. These performance metrics improve for the GALEX subset; 6,530 quasar candidates are identified from 16,898 sources, with a precision and recall respectively of 97.0% and 97.5%. Algorithm performance is further improved when WISE data are included, with precision and recall increasing to 99.3% and 99.1% respectively for 21,834 quasar candidates from 242,902 sources. We compile our final catalog (38,257) by merging these samples and removing duplicates. An observational follow up of 17 bright (r < 19) candidates with long-slit spectroscopy at DuPont telescope (LCO) yields 14 confirmed quasars. Conclusions. The results signal encouraging progress in the classification of point sources with Random Forest algorithms to search for quasars within current and future large-area photometric surveys.
△ Less
Submitted 24 August, 2015; v1 submitted 21 May, 2014;
originally announced May 2014.
-
The VVV Templates Project. Towards an Automated Classification of VVV Light-Curves. I. Building a database of stellar variability in the near-infrared
Authors:
R. Angeloni,
R. Contreras Ramos,
M. Catelan,
I. Dékány,
F. Gran,
J. Alonso-García,
M. Hempel,
C. Navarrete,
H. Andrews,
A. Aparicio,
J. C. Beamín,
C. Berger,
J. Borissova,
C. Contreras Peña,
A. Cunial,
R. de Grijs,
N. Espinoza,
S. Eyheramendy,
C. E. Ferreira Lopes,
M. Fiaschi,
G. Hajdu,
J. Han,
K. G. Hełminiak,
A. Hempel,
S. L. Hidalgo
, et al. (28 additional authors not shown)
Abstract:
Context. The Vista Variables in the Vía Láctea (VVV) ESO Public Survey is a variability survey of the Milky Way bulge and an adjacent section of the disk carried out from 2010 on ESO Visible and Infrared Survey Telescope for Astronomy (VISTA). VVV will eventually deliver a deep near-IR atlas with photometry and positions in five passbands (ZYJHK_S) and a catalogue of 1-10 million variable point so…
▽ More
Context. The Vista Variables in the Vía Láctea (VVV) ESO Public Survey is a variability survey of the Milky Way bulge and an adjacent section of the disk carried out from 2010 on ESO Visible and Infrared Survey Telescope for Astronomy (VISTA). VVV will eventually deliver a deep near-IR atlas with photometry and positions in five passbands (ZYJHK_S) and a catalogue of 1-10 million variable point sources - mostly unknown - which require classifications. Aims. The main goal of the VVV Templates Project, that we introduce in this work, is to develop and test the machine-learning algorithms for the automated classification of the VVV light-curves. As VVV is the first massive, multi-epoch survey of stellar variability in the near-infrared, the template light-curves that are required for training the classification algorithms are not available. In the first paper of the series we describe the construction of this comprehensive database of infrared stellar variability. Methods. First we performed a systematic search in the literature and public data archives, second, we coordinated a worldwide observational campaign, and third we exploited the VVV variability database itself on (optically) well-known stars to gather high-quality infrared light-curves of several hundreds of variable stars. Results. We have now collected a significant (and still increasing) number of infrared template light-curves. This database will be used as a training-set for the machine-learning algorithms that will automatically classify the light-curves produced by VVV. The results of such an automated classification will be covered in forthcoming papers of the series.
△ Less
Submitted 3 June, 2014; v1 submitted 18 May, 2014;
originally announced May 2014.
-
Supervised detection of anomalous light-curves in massive astronomical catalogs
Authors:
Isadora Nun,
Karim Pichara,
Pavlos Protopapas,
Dae-Won Kim
Abstract:
The development of synoptic sky surveys has led to a massive amount of data for which resources needed for analysis are beyond human capabilities. To process this information and to extract all possible knowledge, machine learning techniques become necessary. Here we present a new method to automatically discover unknown variable objects in large astronomical catalogs. With the aim of taking full…
▽ More
The development of synoptic sky surveys has led to a massive amount of data for which resources needed for analysis are beyond human capabilities. To process this information and to extract all possible knowledge, machine learning techniques become necessary. Here we present a new method to automatically discover unknown variable objects in large astronomical catalogs. With the aim of taking full advantage of all the information we have about known objects, our method is based on a supervised algorithm. In particular, we train a random forest classifier using known variability classes of objects and obtain votes for each of the objects in the training set. We then model this voting distribution with a Bayesian network and obtain the joint voting distribution among the training objects. Consequently, an unknown object is considered as an outlier insofar it has a low joint probability. Our method is suitable for exploring massive datasets given that the training process is performed offline. We tested our algorithm on 20 millions light-curves from the MACHO catalog and generated a list of anomalous candidates. We divided the candidates into two main classes of outliers: artifacts and intrinsic outliers. Artifacts were principally due to air mass variation, seasonal variation, bad calibration or instrumental errors and were consequently removed from our outlier list and added to the training set. After retraining, we selected about 4000 objects, which we passed to a post analysis stage by perfoming a cross-match with all publicly available catalogs. Within these candidates we identified certain known but rare objects such as eclipsing Cepheids, blue variables, cataclysmic variables and X-ray sources. For some outliers there were no additional information. Among them we identified three unknown variability types and few individual outliers that will be followed up for a deeper analysis.
△ Less
Submitted 27 May, 2015; v1 submitted 18 April, 2014;
originally announced April 2014.
-
Automatic Classification of Variable Stars in Catalogs with missing data
Authors:
Karim Pichara,
Pavlos Protopapas
Abstract:
We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks, a probabilistic graphical model, that allows us to perform inference to pre- dict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilises sampling methods and exp…
▽ More
We present an automatic classification method for astronomical catalogs with missing data. We use Bayesian networks, a probabilistic graphical model, that allows us to perform inference to pre- dict missing values given observed data and dependency relationships between variables. To learn a Bayesian network from incomplete data, we use an iterative algorithm that utilises sampling methods and expectation maximization to estimate the distributions and probabilistic dependencies of variables from data with missing values. To test our model we use three catalogs with missing data (SAGE, 2MASS and UBVI) and one complete catalog (MACHO). We examine how classification accuracy changes when information from missing data catalogs is included, how our method compares to traditional missing data approaches and at what computational cost. Integrating these catalogs with missing data we find that classification of variable objects improves by few percent and by 15% for quasar detection while kee** the computational cost the same.
△ Less
Submitted 29 October, 2013;
originally announced October 2013.
-
Stellar Variability in the VVV survey
Authors:
M. Catelan,
D. Minniti,
P. W. Lucas,
I. Dékány,
R. K. Saito,
R. Angeloni,
J. Alonso-García,
M. Hempel,
K. Helminiak,
A. Jordán,
R. Contreras Ramos,
C. Navarrete,
J. C. Beamín,
A. F. Rojas,
F. Gran,
C. E. Ferreira Lopes,
C. Contreras Peña,
E. Kerins,
L. Huckvale,
M. Rejkuba,
R. Cohen,
F. Mauro,
J. Borissova,
P. Amigo,
S. Eyheramendy
, et al. (12 additional authors not shown)
Abstract:
The Vista Variables in the Vía Láctea (VVV) ESO Public Survey is an ongoing time-series, near-infrared (IR) survey of the Galactic bulge and an adjacent portion of the inner disk, covering 562 square degrees of the sky, using ESO's VISTA telescope. The survey has provided superb multi-color photometry in 5 broadband filters ($Z$, $Y$, $J$, $H$, and $K_s$), leading to the best map of the inner Milk…
▽ More
The Vista Variables in the Vía Láctea (VVV) ESO Public Survey is an ongoing time-series, near-infrared (IR) survey of the Galactic bulge and an adjacent portion of the inner disk, covering 562 square degrees of the sky, using ESO's VISTA telescope. The survey has provided superb multi-color photometry in 5 broadband filters ($Z$, $Y$, $J$, $H$, and $K_s$), leading to the best map of the inner Milky Way ever obtained, particularly in the near-IR. The main variability part of the survey, which is focused on $K_s$-band observations, is currently underway, with bulge fields having been observed between 31 and 70 times, and disk fields between 17 and 36 times. When the survey is complete, bulge (disk) fields will have been observed up to a total of 100 (60) times, providing unprecedented depth and time coverage. Here we provide a first overview of stellar variability in the VVV data, including examples of the light curves that have been collected thus far, scientific applications, and our efforts towards the automated classification of VVV light curves.
△ Less
Submitted 4 November, 2013; v1 submitted 7 October, 2013;
originally announced October 2013.
-
An improved quasar detection method in EROS-2 and MACHO LMC datasets
Authors:
Karim Pichara,
Pavlos Protopapas,
Dae-Won Kim,
Jean-Baptiste Marquette,
Patrick Tisserand
Abstract:
We present a new classification method for quasar identification in the EROS-2 and MACHO datasets based on a boosted version of Random Forest classifier. We use a set of variability features including parameters of a continuous auto regressive model. We prove that continuous auto regressive parameters are very important discriminators in the classification process. We create two training sets (one…
▽ More
We present a new classification method for quasar identification in the EROS-2 and MACHO datasets based on a boosted version of Random Forest classifier. We use a set of variability features including parameters of a continuous auto regressive model. We prove that continuous auto regressive parameters are very important discriminators in the classification process. We create two training sets (one for EROS-2 and one for MACHO datasets) using known quasars found in the LMC. Our model's accuracy in both EROS-2 and MACHO training sets is about 90% precision and 86% recall, improving the state of the art models accuracy in quasar detection. We apply the model on the complete, including 28 million objects, EROS-2 and MACHO LMC datasets, finding 1160 and 2551 candidates respectively. To further validate our list of candidates, we crossmatched our list with a previous 663 known strong candidates, getting 74% of matches for MACHO and 40% in EROS-2. The main difference on matching level is because EROS-2 is a slightly shallower survey which translates to significantly lower signal-to-noise ratio lightcurves.
△ Less
Submitted 1 April, 2013;
originally announced April 2013.
-
The Vista Variables in the Via Lactea (VVV) ESO Public Survey: Current Status and First Results
Authors:
M. Catelan,
D. Minniti,
P. W. Lucas,
J. Alonso-Garcia,
R. Angeloni,
J. C. Beamin,
C. Bonatto,
J. Borissova,
C. Contreras,
N. Cross,
I. Dekany,
J. P. Emerson,
S. Eyheramendy,
D. Geisler,
E. Gonzalez-Solares,
K. G. Helminiak,
M. Hempel,
M. J. Irwin,
V. D. Ivanov,
A. Jordan,
E. Kerins,
R. Kurtev,
F. Mauro,
C. Moni Bidin,
C. Navarrete
, et al. (7 additional authors not shown)
Abstract:
Vista Variables in the Via Lactea (VVV) is an ESO Public Survey that is performing a variability survey of the Galactic bulge and part of the inner disk using ESO's Visible and Infrared Survey Telescope for Astronomy (VISTA). The survey covers 520 deg^2 of sky area in the ZYJHK_S filters, for a total observing time of 1929 hours, including ~ 10^9 point sources and an estimated ~ 10^6 variable star…
▽ More
Vista Variables in the Via Lactea (VVV) is an ESO Public Survey that is performing a variability survey of the Galactic bulge and part of the inner disk using ESO's Visible and Infrared Survey Telescope for Astronomy (VISTA). The survey covers 520 deg^2 of sky area in the ZYJHK_S filters, for a total observing time of 1929 hours, including ~ 10^9 point sources and an estimated ~ 10^6 variable stars. Here we describe the current status of the VVV Survey, in addition to a variety of new results based on VVV data, including light curves for variable stars, newly discovered globular clusters, open clusters, and associations. A set of reddening-free indices based on the ZYJHK_S system is also introduced. Finally, we provide an overview of the VVV Templates Project, whose main goal is to derive well-defined light curve templates in the near-IR, for the automated classification of VVV light curves.
△ Less
Submitted 7 June, 2011; v1 submitted 5 May, 2011;
originally announced May 2011.