-
Identifying type II quasars at intermediate redshift with few-shot learning photometric classification
Authors:
P. A. C. Cunha,
A. Humphrey,
J. Brinchmann,
S. G. Morais,
R. Carvajal,
J. M. Gomes,
I. Matute,
A. Paulino-Afonso
Abstract:
We aim to identify QSO2 candidates in the redshift desert using optical and infrared photometry. At this intermediate redshift range, most of the prominent optical emission lines in QSO2 sources (e.g. CIV1549; [OIII]4959,5008) fall either outside the wavelength range of the SDSS optical spectra or in particularly noisy wavelength ranges, making QSO2 identification challenging. Therefore, we adopte…
▽ More
We aim to identify QSO2 candidates in the redshift desert using optical and infrared photometry. At this intermediate redshift range, most of the prominent optical emission lines in QSO2 sources (e.g. CIV1549; [OIII]4959,5008) fall either outside the wavelength range of the SDSS optical spectra or in particularly noisy wavelength ranges, making QSO2 identification challenging. Therefore, we adopted a semi-supervised machine learning approach to select candidates in the SDSS galaxy sample. Recent applications of machine learning in astronomy focus on problems involving large data sets, with small data sets often being overlooked. We developed a few-shot learning approach for the identification and classification of rare-object classes using limited training data (200 sources). The new AMELIA pipeline uses a transfer-learning based approach with decision trees, distance-based, and deep learning methods to build a classifier capable of identifying rare objects on the basis of an observational training data set. We validated the performance of AMELIA by addressing the problem of identifying QSO2s at 1 $\leq$ z $\leq$ 2 using SDSS and WISE photometry, obtaining an F1-score above 0.8 in a supervised approach. We then used AMELIA to select new QSO2 candidates in the redshift desert and examined the nature of the candidates using SDSS spectra, when available. In particular, we identified a sub-population of [NeV]3426 emitters at z $\sim$ 1.1, which are highly likely to contain obscured AGNs. We used X-ray and radio cross-matching to validate our classification and investigated the performance of photometric criteria from the literature showing that our candidates have an inherent dusty nature. Finally, we derived physical properties for our QSO2 sample using photoionisation models and verified the AGN classification using an SED fitting.
△ Less
Submitted 27 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Euclid. I. Overview of the Euclid mission
Authors:
Euclid Collaboration,
Y. Mellier,
Abdurro'uf,
J. A. Acevedo Barroso,
A. Achúcarro,
J. Adamek,
R. Adam,
G. E. Addison,
N. Aghanim,
M. Aguena,
V. Ajani,
Y. Akrami,
A. Al-Bahlawan,
A. Alavi,
I. S. Albuquerque,
G. Alestas,
G. Alguero,
A. Allaoui,
S. W. Allen,
V. Allevato,
A. V. Alonso-Tetilla,
B. Altieri,
A. Alvarez-Candal,
A. Amara,
L. Amendola
, et al. (1086 additional authors not shown)
Abstract:
The current standard model of cosmology successfully describes a variety of measurements, but the nature of its main ingredients, dark matter and dark energy, remains unknown. Euclid is a medium-class mission in the Cosmic Vision 2015-2025 programme of the European Space Agency (ESA) that will provide high-resolution optical imaging, as well as near-infrared imaging and spectroscopy, over about 14…
▽ More
The current standard model of cosmology successfully describes a variety of measurements, but the nature of its main ingredients, dark matter and dark energy, remains unknown. Euclid is a medium-class mission in the Cosmic Vision 2015-2025 programme of the European Space Agency (ESA) that will provide high-resolution optical imaging, as well as near-infrared imaging and spectroscopy, over about 14,000 deg^2 of extragalactic sky. In addition to accurate weak lensing and clustering measurements that probe structure formation over half of the age of the Universe, its primary probes for cosmology, these exquisite data will enable a wide range of science. This paper provides a high-level overview of the mission, summarising the survey characteristics, the various data-processing steps, and data products. We also highlight the main science objectives and expected performance.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Selection of powerful radio galaxies with machine learning
Authors:
R. Carvajal,
I. Matute,
J. Afonso,
R. P. Norris,
K. J. Luken,
P. Sánchez-Sáez,
P. A. C. Cunha,
A. Humphrey,
H. Messias,
S. Amarantidis,
D. Barbosa,
H. A. Cruz,
H. Miranda,
A. Paulino-Afonso,
C. Pappalardo
Abstract:
We developed and trained a pipeline of three machine learning (ML) models than can predict which sources are more likely to be an AGN and to be detected in specific radio surveys. Also, it can estimate redshift values for predicted radio-detectable AGNs. These models, which combine predictions from tree-based and gradient-boosting algorithms, have been trained with multi-wavelength data from near-…
▽ More
We developed and trained a pipeline of three machine learning (ML) models than can predict which sources are more likely to be an AGN and to be detected in specific radio surveys. Also, it can estimate redshift values for predicted radio-detectable AGNs. These models, which combine predictions from tree-based and gradient-boosting algorithms, have been trained with multi-wavelength data from near-infrared-selected sources in the Hobby-Eberly Telescope Dark Energy Experiment (HETDEX) Spring field. Training, testing, calibration, and validation were carried out in the HETDEX field. Further validation was performed on near-infrared-selected sources in the Stripe 82 field. In the HETDEX validation subset, our pipeline recovers 96% of the initially labelled AGNs and, from AGNs candidates, we recover 50% of previously detected radio sources. For Stripe 82, these numbers are 94% and 55%. Compared to random selection, these rates are two and four times better for HETDEX, and 1.2 and 12 times better for Stripe 82. The pipeline can also recover the redshift distribution of these sources with $σ_{\mathrm{NMAD}}$ = 0.07 for HETDEX ($σ_{\mathrm{NMAD}}$ = 0.09 for Stripe 82) and an outlier fraction of 19% (25% for Stripe 82), compatible with previous results based on broad-band photometry. Feature importance analysis stresses the relevance of near- and mid-infrared colours to select AGNs and identify their radio and redshift nature. Combining different algorithms in ML models shows an improvement in the prediction power of our pipeline over a random selection of sources. Tree-based ML models (in contrast to deep learning techniques) facilitate the analysis of the impact that features have on the predictions. This prediction can give insight into the potential physical interplay between the properties of radio AGNs (e.g. mass of black hole and accretion rate).
△ Less
Submitted 1 December, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Improving machine learning-derived photometric redshifts and physical property estimates using unlabelled observations
Authors:
A. Humphrey,
P. A. C. Cunha,
A. Paulino-Afonso,
S. Amarantidis,
R. Carvajal,
J. M. Gomes,
I. Matute,
P. Papaderos
Abstract:
In the era of huge astronomical surveys, machine learning offers promising solutions for the efficient estimation of galaxy properties. The traditional, `supervised' paradigm for the application of machine learning involves training a model on labelled data, and using this model to predict the labels of previously unlabelled data. The semi-supervised `pseudo-labelling' technique offers an alternat…
▽ More
In the era of huge astronomical surveys, machine learning offers promising solutions for the efficient estimation of galaxy properties. The traditional, `supervised' paradigm for the application of machine learning involves training a model on labelled data, and using this model to predict the labels of previously unlabelled data. The semi-supervised `pseudo-labelling' technique offers an alternative paradigm, allowing the model training algorithm to learn from both labelled data and as-yet unlabelled data. We test the pseudo-labelling method on the problems of estimating redshift, stellar mass, and star formation rate, using COSMOS2015 broad band photometry and one of several publicly available machine learning algorithms, and we obtain significant improvements compared to purely supervised learning. We find that the gradient-boosting tree methods CatBoost, XGBoost, and LightGBM benefit the most, with reductions of up to ~15% in metrics of absolute error. We also find similar improvements in the photometric redshift catastrophic outlier fraction. We argue that the pseudo-labellng technique will be useful for the estimation of redshift and physical properties of galaxies in upcoming large imaging surveys such as Euclid and LSST, which will provide photometric data for billions of sources.
△ Less
Submitted 5 December, 2022;
originally announced December 2022.
-
Machine-learning classification of astronomical sources: estimating F1-score in the absence of ground truth
Authors:
A. Humphrey,
W. Kuberski,
J. Bialek,
N. Perrakis,
W. Cools,
N. Nuyttens,
H. Elakhrass,
P. A. C. Cunha
Abstract:
Machine-learning based classifiers have become indispensable in the field of astrophysics, allowing separation of astronomical sources into various classes, with computational efficiency suitable for application to the enormous data volumes that wide-area surveys now typically produce. In the standard supervised classification paradigm, a model is typically trained and validated using data from re…
▽ More
Machine-learning based classifiers have become indispensable in the field of astrophysics, allowing separation of astronomical sources into various classes, with computational efficiency suitable for application to the enormous data volumes that wide-area surveys now typically produce. In the standard supervised classification paradigm, a model is typically trained and validated using data from relatively small areas of sky, before being used to classify sources in other areas of the sky. However, population shifts between the training examples and the sources to be classified can lead to `silent' degradation in model performance, which can be challenging to identify when the ground-truth is not available. In this Letter, we present a novel methodology using the NannyML Confidence-Based Performance Estimation (CBPE) method to predict classifier F1-score in the presence of population shifts, but without ground-truth labels. We apply CBPE to the selection of quasars with decision-tree ensemble models, using broad-band photometry, and show that the F1-scores are predicted remarkably well (MAPE ~ 10%; R^2 = 0.74-0.92). We discuss potential use-cases in the domain of astronomy, including machine-learning model and/or hyperparameter selection, and evaluation of the suitability of training datasets for a particular classification problem.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Euclid preparation: XXII. Selection of Quiescent Galaxies from Mock Photometry using Machine Learning
Authors:
Euclid Collaboration,
A. Humphrey,
L. Bisigello,
P. A. C. Cunha,
M. Bolzonella,
S. Fotopoulou,
K. Caputi,
C. Tortora,
G. Zamorani,
P. Papaderos,
D. Vergani,
J. Brinchmann,
M. Moresco,
A. Amara,
N. Auricchio,
M. Baldi,
R. Bender,
D. Bonino,
E. Branchini,
M. Brescia,
S. Camera,
V. Capobianco,
C. Carbone,
J. Carretero,
F. J. Castander
, et al. (184 additional authors not shown)
Abstract:
The Euclid Space Telescope will provide deep imaging at optical and near-infrared wavelengths, along with slitless near-infrared spectroscopy, across ~15,000 sq deg of the sky. Euclid is expected to detect ~12 billion astronomical sources, facilitating new insights into cosmology, galaxy evolution, and various other topics. To optimally exploit the expected very large data set, there is the need t…
▽ More
The Euclid Space Telescope will provide deep imaging at optical and near-infrared wavelengths, along with slitless near-infrared spectroscopy, across ~15,000 sq deg of the sky. Euclid is expected to detect ~12 billion astronomical sources, facilitating new insights into cosmology, galaxy evolution, and various other topics. To optimally exploit the expected very large data set, there is the need to develop appropriate methods and software. Here we present a novel machine-learning based methodology for selection of quiescent galaxies using broad-band Euclid I_E, Y_E, J_E, H_E photometry, in combination with multiwavelength photometry from other surveys. The ARIADNE pipeline uses meta-learning to fuse decision-tree ensembles, nearest-neighbours, and deep-learning methods into a single classifier that yields significantly higher accuracy than any of the individual learning methods separately. The pipeline has `sparsity-awareness', so that missing photometry values are still informative for the classification. Our pipeline derives photometric redshifts for galaxies selected as quiescent, aided by the `pseudo-labelling' semi-supervised method. After application of the outlier filter, our pipeline achieves a normalized mean absolute deviation of ~< 0.03 and a fraction of catastrophic outliers of ~< 0.02 when measured against the COSMOS2015 photometric redshifts. We apply our classification pipeline to mock galaxy photometry catalogues corresponding to three main scenarios: (i) Euclid Deep Survey with ancillary ugriz, WISE, and radio data; (ii) Euclid Wide Survey with ancillary ugriz, WISE, and radio data; (iii) Euclid Wide Survey only. Our classification pipeline outperforms UVJ selection, in addition to the Euclid I_E-Y_E, J_E-H_E and u-I_E,I_E-J_E colour-colour methods, with improvements in completeness and the F1-score of up to a factor of 2. (Abridged)
△ Less
Submitted 5 December, 2022; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Photometric redshift-aided classification using ensemble learning
Authors:
P. A. C. Cunha,
A. Humphrey
Abstract:
We present SHEEP, a new machine learning approach to the classic problem of astronomical source classification, which combines the outputs from the XGBoost, LightGBM, and CatBoost learning algorithms to create stronger classifiers. A novel step in our pipeline is that prior to performing the classification, SHEEP first estimates photometric redshifts, which are then placed into the data set as an…
▽ More
We present SHEEP, a new machine learning approach to the classic problem of astronomical source classification, which combines the outputs from the XGBoost, LightGBM, and CatBoost learning algorithms to create stronger classifiers. A novel step in our pipeline is that prior to performing the classification, SHEEP first estimates photometric redshifts, which are then placed into the data set as an additional feature for classification model training; this results in significant improvements in the subsequent classification performance. SHEEP contains two distinct classification methodologies: (i) Multi-class and (ii) one versus all with correction by a meta-learner. We demonstrate the performance of SHEEP for the classification of stars, galaxies, and quasars using a data set composed of SDSS and WISE photometry of 3.5 million astronomical sources. The resulting F1-scores are as follows: 0.992 for galaxies; 0.967 for quasars; and 0.985 for stars. In terms of the F1-scores for the three classes, SHEEP is found to outperform a recent RandomForest-based classification approach using an essentially identical data set. Our methodology also facilitates model and data set explainability via feature importances; it also allows the selection of sources whose uncertain classifications may make them interesting sources for follow-up observations.
△ Less
Submitted 19 May, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.