XAMI - A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images

Elisabeta-Iulia Dima Corresponding author. Email: [email protected] Department of Computers and Information Technology, Politehnica University of Timişoara, Blvd. V. Pârvan, No. 2, 300223 Timişoara, Romania Pablo Gómez European Space Agency (ESA), European Space Astronomy Centre (ESAC), Camino Bajo del Castillo s/n, 28692 Villanueva de la Cañada, Madrid, Spain Sandor Kruk European Space Agency (ESA), European Space Astronomy Centre (ESAC), Camino Bajo del Castillo s/n, 28692 Villanueva de la Cañada, Madrid, Spain Peter Kretschmar European Space Agency (ESA), European Space Astronomy Centre (ESAC), Camino Bajo del Castillo s/n, 28692 Villanueva de la Cañada, Madrid, Spain
Simon Rosen
Serco Ltd., ESAC, Camino Bajo del Castillo s/n, 28692 Villanueva de la Cañada, Madrid, Spain
Călin-Adrian Popa Department of Computers and Information Technology, Politehnica University of Timişoara, Blvd. V. Pârvan, No. 2, 300223 Timişoara, Romania
Abstract

Reflected or scattered light produce artefacts in astronomical observations that can negatively impact the scientific study. Hence, automated detection of these artefacts is highly beneficial, especially with the increasing amounts of data gathered. Machine learning methods are well-suited to this problem, but currently there is a lack of annotated data to train such approaches to detect artefacts in astronomical observations. In this work, we present a dataset of images from the XMM-Newton space telescope Optical Monitoring camera showing different types of artefacts. We hand-annotated a sample of 1000 images with artefacts which we use to train automated ML methods. We further demonstrate techniques tailored for accurate detection and masking of artefacts using instance segmentation. We adopt a hybrid approach, combining knowledge from both convolutional neural networks (CNNs) and transformer-based models and use their advantages in segmentation.

The presented method and dataset will advance artefact detection in astronomical observations by providing a reproducible baseline. All code and data are made available publicly111https://github.com/ESA-Datalabs/XAMI-model,222https://github.com/ESA-Datalabs/XAMI-dataset.

\makeCustomtitle

1 Introduction

Astronomical surveys and space missions (e.g., LSST [Ivezi__2019] and European Space Agency’s Euclid mission [laureijs2009euclid]) will enhance our understanding of the cosmos by delivering unprecedented images, measurements and insights into billions of stars and galaxies, the expansion of the Universe, dark energy and dark matter. Such surveys will produce enormous amounts of data daily, thus the ongoing demand for the effective processing and analysis of large image data produced by space missions underscores the necessity for automated methodologies. The presence of artefacts (e.g. ghost reflections, star loops, read-out-streaks) (e.g., Figure 1) poses challenges, potentially leading to false detections or affecting the photometric measurements of genuine sources.

Refer to caption
Figure 1: Examples of artefacts in various space missions. (upper left) An optical ghost detected in Euclid’s First Light near-infrared images. (upper right) Ghost rays and stray light patterns present in NuSTAR mission. (bottom left) Star loops and dragon’s breath artefacts appearing in the Hubble Space Telescope images. (bottom right) Star loops and streaks present in the XMM-Newton Optical Monitor.

XMM-Newton Optical Monitor. ESA’s X-ray Multi-Mirror Mission (XMM-Newton) [xmm_newton_2000, Schartel_2022] is an orbiting observatory with the principal goal to conduct detailed X-ray spectroscopy of various celestial objects. The XMM-Newton Optical Monitor (XMM-OM) [Mason_2001, Cordova1989, Lumb1991] extends the simultaneous observational capability of the three main X-ray telescopes into the ultraviolet and optical bands. The XMM-OM source catalogue is a valuable resource containing approximately 9 million detections of around 6 million distinct sources. It plays a pivotal role in individual object analyses [Soria_2001, refId0, 10.1111/j.1365-2966.2004.07660.x] and contributes significantly to survey science. However, the process of source detection within the XMM-OM data analysis process would benefit significantly from improved artefact recognition.

Current non-AI approaches to detecting artefacts [Mukhin_2023, article_nustar_straycats, DESAI201667] often struggle due to their reliance on generalised physical models. These models, while broadly applicable, fail to address specific scenarios effectively, leading to limitations in their practical utility.

AI methods based on CNN and Vision Transformer (ViT) models have achieved notable success and have benefited real-world applications in tasks such as object detection [wang2022yolov7, 10.1007/978-3-031-20053-3_27, maaz2022classagnostic, zong2023detrs] and segmentation [srivastava2023omnivec, wang2022image, hümmer2023vltseg, https://doi.org/10.48550/arxiv.2401.15741, fang2022eva, wang2023internimage, liu2021swin, he2018mask, rs13234779]. Instance segmentation techniques for astronomical sources present significant progress [10.1093/mnras/stad2785, Sortino_2023, hausen2022partialattribution], yet there has been limited focus on artefacts detection [tanoglidis2021deepghostbusters]. ViT models are increasingly preferred in computer vision due to their self-attention mechanisms. The Segment Anything Model (SAM) [kirillov2023segment], a ViT-based architecture, excels in class-agnostic instance segmentation and zero-shot learning, allowing it to identify objects not seen during training.

We introduce XAMI (XMM-Newton optical Artefact Map** for astronomical Instance segmentation), a hybrid CNN and ViT-based model, and XAMI-Dataset, a high-precision instance segmentation dataset for astronomical images. Together, they provide a first baseline demonstrating ML-based artefact detection on astronomical images as well as benchmark and starting point for other researchers to build on.

2 Methods

2.1 Dataset

We use 1000 single-channel images at various wavelengths (see Table 1 and [xmmom_filters_handbook]) from the XMM-OM as the baseline artefacts dataset. Each image comprises a stack of all available windows in a given filter of an observation that, together, cover the full 17×17superscript17superscript1717^{\prime}\times 17^{\prime}17 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × 17 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT field of view. This corresponds to a full frame of 2048×2048204820482048\times 20482048 × 2048 px resolution, with an effective resolution of 0.477superscript0.4770.477^{\prime}0.477 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT/pixel. We rebinned the full-frame images to 512×512512512512\times 512512 × 512 px for computational efficiency. We normalised images using ZScaleInterval algorithm and enhanced them with Asinh stretching to increase dynamic range without negatively affecting contrast.

The XAMI dataset consists of 7021 annotated artefacts which can be divided into the following categories (Figure 3):

  1. 1.

    Read-Out-Streaks (ROS) - arising from shutterless camera and continuous Charge-Coupled Device (CCD) photon recording during readout.

  2. 2.

    Smoke rings (SR) - resulting from internal reflections of starlight within the detector.

  3. 3.

    Central ring (CR) - appearing in the centre of the detector, approximately 2superscript22^{\prime}2 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in diameter, resulting from background light scattering from a chamfer on the detector window mounting ring.

  4. 4.

    Star loops (SL) - elongated scattered light features caused by light from bright stars within a 1215superscript12superscript1512^{\prime}-15^{\prime}12 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 15 start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT off-axis range, scattered from the chamfer.

  5. 5.

    Other - other types of artefacts which usually represent scattered light spread over large areas.

Filter λ𝜆\lambdaitalic_λ(nm) width ##\##images ##\##masks
V 543 70 102 880
B 450 105 116 1259
U 344 84 193 1837
UVW1(L) 291 83 403 2127
UVM2(M) 231 48 175 681
UVW2(S) 212 50 63 226
White(W) 406 347 3 11
Table 1: Dataset information per observing filter, together with their central wavelength and width (nm).
Refer to caption
Figure 2: Artefacts appearing in the XMM-OM observation S0148740701 of the QSO 1939+7000 field (U filter).
Class Train Validation
CR 500 (9.43%) 168 (9.75%)
SR 1267 (23.91%) 402 (23.33%)
SL 1377 (25.99%) 467 (27.10%)
ROS 2122 (40.05%) 677 (39.29%)
Other 32 (0.60%) 9 (0.52%)
Table 2: Dataset distribution across splits, given class labels.
Refer to caption
Figure 3: Distribution of annotation bounding boxes across different classes in the XAMI dataset.

We use the stratified k-fold technique to maintain consistent class proportions across dataset splits, thus ensuring accurate performance estimation. Resulting class distributions can be seen in Table 2.

2.2 Baseline Model

We propose a class-aware approach for instance segmentation that integrates an object detector, specifically the YOLOv8 model [reis2023realtime], into our SAM prediction logic to facilitate auto-generated input prompts.

Unlike CNNs, which strictly delineate object masks by bounding boxes, transformer-based models like SAM integrate self-attention to potentially extend beyond these initial margins. However, spatial invariance and accurate segmentation of faint objects remain a challenge for ViTs, in contrast with CNN approaches. By utilising SAM for smooth masks and YOLOv8 for faint objects with certain classes, we aim to overcome these limitations.

3 Results

Refer to caption
Figure 4: (left) Cumulative distribution of IoUs between predicted and true masks on training and validation sets. (right) Comparison of IoU distributions with higher median and consistency in training data and greater variability in validation data.
Refer to caption
Figure 5: Detected masks across eight fields within the validation set, with increasing mean IoU between predicted and ground-truth masks. The mean IoU on the validation set is 0.696±0.289plus-or-minus0.6960.2890.696\pm 0.2890.696 ± 0.289.

Our methodology initially involves training SAM with ground-truth annotations using a distilled image encoder from MobileSAM [zhang2023faster]. For SAM, images are resized to 1024×1024102410241024\times 10241024 × 1024 px and have their colours normalized. We use a batch size of 8, a warmup learning rate scheduler (lrinit=3×104subscriptlrinit3superscript104\mathrm{lr}_{\mathrm{init}}=3\times 10^{-4}roman_lr start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT = 3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, lrfinal=6×105subscriptlrfinal6superscript105\mathrm{lr}_{\mathrm{final}}=6\times 10^{-5}roman_lr start_POSTSUBSCRIPT roman_final end_POSTSUBSCRIPT = 6 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT) for 16 steps, weight decay of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and AdamW optimizer. We train the Mask Decoder only, while freezing the Image Encoder and Prompt Embedding layers.

Following recommendations in [kirillov2023segment], we utilise the focal loss and dice loss in a 20:1 weighted scheme. At this stage, predicted and actual masks can be directly compared. Unlike usual SAM implementations, we choose to train the Intersection-over-Union (IoU) head to provide more representative mean Average Precision (mAP) metrics (see Eq. 1). Also, when generating masks, we configure the model to allow three predicted mask outputs and select the final mask based on the highest IoU score. The overall loss calculation integrates both segmentation and IoU loss.

After training the YOLO and SAM models separately to optimise their individual performances, we freeze the YOLO layers, couple its predicted bounding boxes to the SAM Prompt Encoder and continue training the SAM Mask Decoder to refine the segmentation process for 10 additional epochs. The alignment of predicted and ground truth masks is managed using the Kuhn-Munkres assignment algorithm [https://doi.org/10.1002/nav.3800020109] by minimizing the IoU cost matrix. Due to higher spatial complexity of certain classes, particularly SL and Other, we select YOLO masks for faint objects of such classes at 1σ1𝜎1\sigma1 italic_σ background level, as these predictions are more stable for low-intensity artefacts. We provide the segmentation mAP (see Table 3) using a fixed seed for reproducibility. The mAP formula for instance segmentation is given by:

mAP=1Qq=1QAPqmAP1𝑄superscriptsubscript𝑞1𝑄subscriptAP𝑞\text{mAP}=\frac{1}{Q}\sum_{q=1}^{Q}\text{AP}_{q}mAP = divide start_ARG 1 end_ARG start_ARG italic_Q end_ARG ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT AP start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (1)

where Q𝑄Qitalic_Q is the number of classes, and APqsubscriptAP𝑞\text{AP}_{q}AP start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the average precision for the q𝑞qitalic_q-th class, calculated as the area under the precision-recall curve at different IoU thresholds.

Category mAP50 mAP75 mAP50-90
Overall 90.1 73.5 55.4
Small 82.6 59.6 48.3
Medium 91.2 75.3 57.1
Large 84.6 63.4 58.0
CR 97.6 94.9 70.2
SR 94.8 77.7 50.6
SL 91.8 75.5 57.3
ROS 82.3 55.2 48.2
Other 83.9 64.1 50.6
Table 3: Mask mAP at different IoU thresholds resulted from SAM predictions on validation set. While smaller mAP for Other class may be caused by its under-representation, the CR class shows highest scores, which may be attributed to its predictable location. Category pixel areas are: small [0,322)0superscript322[0,32^{2})[ 0 , 32 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), medium [322,962)superscript322superscript962[32^{2},96^{2})[ 32 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 96 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and large [962,)superscript962[96^{2},\infty)[ 96 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∞ ).

4 Discussion

In our study, we enhanced artefact detection and segmentation in XMM-Newton images by integrating CNNs with ViT-based models, significantly boosting accuracy and reducing false positives in astronomical analysis. We combined traditional YOLO models for bounding box predictions with advanced SAM models for zero-shot segmentation, demonstrating the benefits of diverse neural network strategies in addressing complex image processing challenges. Despite improvements, high variation in exposure times and intensity levels in space imagery necessitate further model refinement tailored for astronomical missions. Additionally, these variations pose challenges in dataset annotation, eventually making it difficult to establish clear thresholds for distinguishing artefacts from the background. The XAMI average end-to-end inference time per image containing annotations is 100ms100ms100\text{ms}100 ms, suitable for medium to large image data (up until hundreds of thousands of images similar to ours) and applications which do not particularly require instant real-time processing. The SAM heavy architecture still represents a bottleneck for prediction, with 7080ms/70-80\text{ms}/70 - 80 ms /image.

Acknowledgements. The authors acknowledge the contribution of Inès Perez, Léa Zuili, Simon Astarita to dataset annotations. This publication uses data generated via the Roboflow.com and Zooniverse.org platforms.

\printbibliography