XAMI - A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images
Abstract
Reflected or scattered light produce artefacts in astronomical observations that can negatively impact the scientific study. Hence, automated detection of these artefacts is highly beneficial, especially with the increasing amounts of data gathered. Machine learning methods are well-suited to this problem, but currently there is a lack of annotated data to train such approaches to detect artefacts in astronomical observations. In this work, we present a dataset of images from the XMM-Newton space telescope Optical Monitoring camera showing different types of artefacts. We hand-annotated a sample of 1000 images with artefacts which we use to train automated ML methods. We further demonstrate techniques tailored for accurate detection and masking of artefacts using instance segmentation. We adopt a hybrid approach, combining knowledge from both convolutional neural networks (CNNs) and transformer-based models and use their advantages in segmentation.
The presented method and dataset will advance artefact detection in astronomical observations by providing a reproducible baseline. All code and data are made available publicly111https://github.com/ESA-Datalabs/XAMI-model,222https://github.com/ESA-Datalabs/XAMI-dataset.
1 Introduction
Astronomical surveys and space missions (e.g., LSST [Ivezi__2019] and European Space Agency’s Euclid mission [laureijs2009euclid]) will enhance our understanding of the cosmos by delivering unprecedented images, measurements and insights into billions of stars and galaxies, the expansion of the Universe, dark energy and dark matter. Such surveys will produce enormous amounts of data daily, thus the ongoing demand for the effective processing and analysis of large image data produced by space missions underscores the necessity for automated methodologies. The presence of artefacts (e.g. ghost reflections, star loops, read-out-streaks) (e.g., Figure 1) poses challenges, potentially leading to false detections or affecting the photometric measurements of genuine sources.
![Refer to caption](x1.png)
XMM-Newton Optical Monitor. ESA’s X-ray Multi-Mirror Mission (XMM-Newton) [xmm_newton_2000, Schartel_2022] is an orbiting observatory with the principal goal to conduct detailed X-ray spectroscopy of various celestial objects. The XMM-Newton Optical Monitor (XMM-OM) [Mason_2001, Cordova1989, Lumb1991] extends the simultaneous observational capability of the three main X-ray telescopes into the ultraviolet and optical bands. The XMM-OM source catalogue is a valuable resource containing approximately 9 million detections of around 6 million distinct sources. It plays a pivotal role in individual object analyses [Soria_2001, refId0, 10.1111/j.1365-2966.2004.07660.x] and contributes significantly to survey science. However, the process of source detection within the XMM-OM data analysis process would benefit significantly from improved artefact recognition.
Current non-AI approaches to detecting artefacts [Mukhin_2023, article_nustar_straycats, DESAI201667] often struggle due to their reliance on generalised physical models. These models, while broadly applicable, fail to address specific scenarios effectively, leading to limitations in their practical utility.
AI methods based on CNN and Vision Transformer (ViT) models have achieved notable success and have benefited real-world applications in tasks such as object detection [wang2022yolov7, 10.1007/978-3-031-20053-3_27, maaz2022classagnostic, zong2023detrs] and segmentation [srivastava2023omnivec, wang2022image, hümmer2023vltseg, https://doi.org/10.48550/arxiv.2401.15741, fang2022eva, wang2023internimage, liu2021swin, he2018mask, rs13234779]. Instance segmentation techniques for astronomical sources present significant progress [10.1093/mnras/stad2785, Sortino_2023, hausen2022partialattribution], yet there has been limited focus on artefacts detection [tanoglidis2021deepghostbusters]. ViT models are increasingly preferred in computer vision due to their self-attention mechanisms. The Segment Anything Model (SAM) [kirillov2023segment], a ViT-based architecture, excels in class-agnostic instance segmentation and zero-shot learning, allowing it to identify objects not seen during training.
We introduce XAMI (XMM-Newton optical Artefact Map** for astronomical Instance segmentation), a hybrid CNN and ViT-based model, and XAMI-Dataset, a high-precision instance segmentation dataset for astronomical images. Together, they provide a first baseline demonstrating ML-based artefact detection on astronomical images as well as benchmark and starting point for other researchers to build on.
2 Methods
2.1 Dataset
We use 1000 single-channel images at various wavelengths (see Table 1 and [xmmom_filters_handbook]) from the XMM-OM as the baseline artefacts dataset. Each image comprises a stack of all available windows in a given filter of an observation that, together, cover the full field of view. This corresponds to a full frame of px resolution, with an effective resolution of /pixel. We rebinned the full-frame images to px for computational efficiency. We normalised images using ZScaleInterval algorithm and enhanced them with Asinh stretching to increase dynamic range without negatively affecting contrast.
The XAMI dataset consists of 7021 annotated artefacts which can be divided into the following categories (Figure 3):
-
1.
Read-Out-Streaks (ROS) - arising from shutterless camera and continuous Charge-Coupled Device (CCD) photon recording during readout.
-
2.
Smoke rings (SR) - resulting from internal reflections of starlight within the detector.
-
3.
Central ring (CR) - appearing in the centre of the detector, approximately in diameter, resulting from background light scattering from a chamfer on the detector window mounting ring.
-
4.
Star loops (SL) - elongated scattered light features caused by light from bright stars within a off-axis range, scattered from the chamfer.
-
5.
Other - other types of artefacts which usually represent scattered light spread over large areas.
Filter | (nm) | width | images | masks |
---|---|---|---|---|
V | 543 | 70 | 102 | 880 |
B | 450 | 105 | 116 | 1259 |
U | 344 | 84 | 193 | 1837 |
UVW1(L) | 291 | 83 | 403 | 2127 |
UVM2(M) | 231 | 48 | 175 | 681 |
UVW2(S) | 212 | 50 | 63 | 226 |
White(W) | 406 | 347 | 3 | 11 |
![Refer to caption](x2.png)
Class | Train | Validation |
---|---|---|
CR | 500 (9.43%) | 168 (9.75%) |
SR | 1267 (23.91%) | 402 (23.33%) |
SL | 1377 (25.99%) | 467 (27.10%) |
ROS | 2122 (40.05%) | 677 (39.29%) |
Other | 32 (0.60%) | 9 (0.52%) |
![Refer to caption](x3.png)
We use the stratified k-fold technique to maintain consistent class proportions across dataset splits, thus ensuring accurate performance estimation. Resulting class distributions can be seen in Table 2.
2.2 Baseline Model
We propose a class-aware approach for instance segmentation that integrates an object detector, specifically the YOLOv8 model [reis2023realtime], into our SAM prediction logic to facilitate auto-generated input prompts.
Unlike CNNs, which strictly delineate object masks by bounding boxes, transformer-based models like SAM integrate self-attention to potentially extend beyond these initial margins. However, spatial invariance and accurate segmentation of faint objects remain a challenge for ViTs, in contrast with CNN approaches. By utilising SAM for smooth masks and YOLOv8 for faint objects with certain classes, we aim to overcome these limitations.
3 Results
![Refer to caption](x4.png)
![Refer to caption](x5.png)
Our methodology initially involves training SAM with ground-truth annotations using a distilled image encoder from MobileSAM [zhang2023faster]. For SAM, images are resized to px and have their colours normalized. We use a batch size of 8, a warmup learning rate scheduler (, ) for 16 steps, weight decay of and AdamW optimizer. We train the Mask Decoder only, while freezing the Image Encoder and Prompt Embedding layers.
Following recommendations in [kirillov2023segment], we utilise the focal loss and dice loss in a 20:1 weighted scheme. At this stage, predicted and actual masks can be directly compared. Unlike usual SAM implementations, we choose to train the Intersection-over-Union (IoU) head to provide more representative mean Average Precision (mAP) metrics (see Eq. 1). Also, when generating masks, we configure the model to allow three predicted mask outputs and select the final mask based on the highest IoU score. The overall loss calculation integrates both segmentation and IoU loss.
After training the YOLO and SAM models separately to optimise their individual performances, we freeze the YOLO layers, couple its predicted bounding boxes to the SAM Prompt Encoder and continue training the SAM Mask Decoder to refine the segmentation process for 10 additional epochs. The alignment of predicted and ground truth masks is managed using the Kuhn-Munkres assignment algorithm [https://doi.org/10.1002/nav.3800020109] by minimizing the IoU cost matrix. Due to higher spatial complexity of certain classes, particularly SL and Other, we select YOLO masks for faint objects of such classes at background level, as these predictions are more stable for low-intensity artefacts. We provide the segmentation mAP (see Table 3) using a fixed seed for reproducibility. The mAP formula for instance segmentation is given by:
(1) |
where is the number of classes, and is the average precision for the -th class, calculated as the area under the precision-recall curve at different IoU thresholds.
Category | mAP50 | mAP75 | mAP50-90 |
---|---|---|---|
Overall | 90.1 | 73.5 | 55.4 |
Small | 82.6 | 59.6 | 48.3 |
Medium | 91.2 | 75.3 | 57.1 |
Large | 84.6 | 63.4 | 58.0 |
CR | 97.6 | 94.9 | 70.2 |
SR | 94.8 | 77.7 | 50.6 |
SL | 91.8 | 75.5 | 57.3 |
ROS | 82.3 | 55.2 | 48.2 |
Other | 83.9 | 64.1 | 50.6 |
4 Discussion
In our study, we enhanced artefact detection and segmentation in XMM-Newton images by integrating CNNs with ViT-based models, significantly boosting accuracy and reducing false positives in astronomical analysis. We combined traditional YOLO models for bounding box predictions with advanced SAM models for zero-shot segmentation, demonstrating the benefits of diverse neural network strategies in addressing complex image processing challenges. Despite improvements, high variation in exposure times and intensity levels in space imagery necessitate further model refinement tailored for astronomical missions. Additionally, these variations pose challenges in dataset annotation, eventually making it difficult to establish clear thresholds for distinguishing artefacts from the background. The XAMI average end-to-end inference time per image containing annotations is , suitable for medium to large image data (up until hundreds of thousands of images similar to ours) and applications which do not particularly require instant real-time processing. The SAM heavy architecture still represents a bottleneck for prediction, with image.
Acknowledgements. The authors acknowledge the contribution of Inès Perez, Léa Zuili, Simon Astarita to dataset annotations. This publication uses data generated via the Roboflow.com and Zooniverse.org platforms.