Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

Zihao Song1, Huaxi Chen1, Donghui Quan1, Di Li2, Yinghui Zheng2, Shulei Ni1, Yunchuan Chen1, Yun Zheng1
1Zhejiang Lab, Hangzhou, Zhejiang 311121, China
2National Astronomical Observatories, Chinese Academy of Sciences, Bei**g 100101, People’s Republic of China
E-mail: [email protected]: [email protected]
(Accepted XXX. Received YYY; in original form ZZZ)
Abstract

We introduce a machine learning-based method for extracting Hi sources from 3D spectral data, and construct a dedicated dataset of Hi sources from CRAFTS. Our custom dataset provides comprehensive resources for Hi source detection. Utilizing the 3D-Unet segmentation architecture, our method reliably identifies and segments Hi sources, achieving notable performance metrics with recall rates reaching 91.6% and accuracy levels at 95.7%. These outcomes substantiate the value of our custom dataset and the efficacy of our proposed network in identifying Hi source. Our code is publicly available at https://github.com/fishszh/HISF.

keywords:
methods: data analysis – techniques: image processing – methods: observational
pubyear: 2015pagerange: Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning MethodAutomated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

1 Introduction

Neutral hydrogen (Hi ) is a crucial constituent of the interstellar medium. Via 21 cm emission line of Hi , researchers can study the evolution of galaxies and the distribution of matter in the universe (Cheng et al., 2020). Hi emission lines provide vital information on the density and velocity structure of neutral gas within galaxies (Springob et al., 2005). Consequently, over the past few decades, numerous Hi surveys have been conducted to detect Hi in the local universe. Key surveys include Hi Parkes All Sky Survey (HIPASS) (Barnes et al., 2001), which identified over 5000 galaxies across approximately 30000 deg2, and ALFALFA (Giovanelli et al., 2005), which covered an area of approximately 7,000 deg2, cataloging 31,502 galaxies. The FAST All Sky Hi survey (FASHI) (Zhang et al., 2024) aims to survey the entire sky visible to FAST, covering a declination range of -14° to +66°. Its first data release detected 41741 extra-galactic Hi sources. Additionally, the commensal radio astronomy FAST survey (CRAFTS) (Li et al., 2018) is another Hi survey led by FAST, utilizing the same sky coverage, frequency range, sensitivity and resolution of FAST.

With the progression of observational technologies and equipment upgrades, a substantial volume of high-quality astronomical observation data has been generated through various sky survey. However, the processing of these vast datasets imposes stringent requirements on both efficiency and accuracy, which conventional methodologies struggle to fulfill.

In response to this challenge, scientists have embarked on exploring the integration of machine learning into the data processing of astronomical observations(Baron, 2019). A variety of machine learning-driven methods have been employed in diverse applications within astronomy, such as detecting tidal features (Desmons et al., 2023), light curve classification (Cui et al., 2023; Tey et al., 2023), source detection (Liang et al., 2023), spectrum classification (Tan et al., 2022), Radio Frequency Interference (RFI) mitigation (Akeret et al., 2017; Sun et al., 2022; Xiao et al., 2022) and so on.

In this study, we utilized CRAFTS observational data to systematically organize and construct a dedicated dataset of Hi sources. To achieve high-precision identification and segmentation of Hi sources, we implemented a 3D Unet deep learning model, which is capable of effectively extracting and segmenting Hi sources from complex spectral data cubes. The primary objective of this work is to enhance the accuracy and efficiency of Hi source detection by utilizing deep learning technology. This endeavor aims to validate the potent application potential of deep learning in astronomical data processing. Furthermore, it provides new insights and approaches for future astronomical observations and data analysis.

The paper is structured as follows: Section 2 introduces the related Hi survery and Hi source founding works. Section 3 describes the dataset selection and preparation. Section 4 details our model pipeline and experiment results. Our summary is outlined in Section 5.

Refer to caption

Figure 1: Data processing pipeline for Hi source identification.

2 Related Work

In previous Hi surveys, for instance HIPASS, ALFALFA and FASHI, research teams conventionally developed their own automated algorithms or employed software like SoFiA111https://github.com/SoFiA-Admin/SoFiA-2 to identify Hi sources. These detections were subjected to further manual analysis and verification, culminating in the release of comprehensive Hi source catalogs. These catalogs encompassed essential attributes, including spatial coordinate ranges, frequency ranges, red-shifts, and signal-to-noise ratios (SNR), among other key parameters. Based on this spatial and frequency domain data, researchers were generally able to conduct effective analyses of the characteristics of Hi emission lines.

In the realm of Hi source detection, extensive research efforts have been conducted. In SKA Science Data Challenge 2, several teams have devised a range of methods to identify HI sources within a simulated dataset (Hartley et al., 2023). These approaches not only encompass conventional methodologies like SoFiA, but also integrate machine learning techniques, such as 3D Unet for segmentation, CNN for classification, and object detection algorithms like YOLO for Hi source characterization. Liang et al. (2023) tentatively employed the Mask R-CNN model and PointRend approach to identify Hi signals, revealing encouraging outcomes when applied to a simulated 2D dataset. Those exploratory work subtly hits the potential for these advanced deep learning frameworks to make a valuable contribution to the refinement and streamlining of Hi source detection.

3 Data

Previous endeavors have commonly employed either conventional algorithms coupled with manual identification, or were grounded on simulated data alone, without being validated against authentic observational datasets. Furthermore, a dearth of openly accessible annotated observational datasets has hindered advancements. To address this deficiency, we utilize the observational data from CRAFTS to systematically compile a novel Hi source dataset featuring accompanying masks. This innovative effort is intended to provide a much-needed benchmark for evaluating and enhancing Hi source identification techniques within an empirical context.

The construction of Hi source spectral data cubes from CRAFTS raw data follows a meticulously designed pipeline depicted in Figure 1, including a series of critical steps such as RFI flagging, ripple removal, baseline removal, and other essential processing measures. However, due to the inherent difficulty in completely removing RFI, only a portion of the prominently discernible RFI has been excised here. Consequently, the generated spectral data cubes still contain a substantial amount of RFI, which poses a significant challenge for identifying Hi sources.

To confirm the Hi sources within the CRAFTS spectral data cube, we integrate expert verification and cross-validation methods utilizing other Hi surveys and observations in different wavebands. Currently, we have annotated data for two sky regions, as depicted in Figure 2.


Refer to caption

Figure 2: The region depicted between the red lines reveals the overall sky coverage of CRAFTS. Notably, the regions designated as R1 and R2 are the distinct areas where we have performed data annotation.

In Region R1, we analyzed 646 3D spectral data cubes and confirmed 2050 Hi sources through expert verification and ALFALFA cross-validation. Among these, 1749 Hi sources correspond to those detected by ALFALFA. Nonetheless, they still contained unprocessed RFI signals. Due to differences in frequency coverage and sensitivity between ALFALFA and CRAFTS, we also referred to the FASHI Hi source catalog.

For Region R2, we manually eliminated additional RFI signals, thereby rendering this portion of data comparatively cleaner. For the identification process of Hi sources, we manually identified potential Hi signals, subjected them to expert voting, and referenced ALFALFA and other waveband information, resulting 469 Hi sources, see Table 1.

Following the approximate coordinate (frequency, R.A., Dec.) of Hi sources from the identification process, we utilized 3D Slicer (Fedorov et al., 2012) to visualize Hi signals on three orthogonal planes, and we annotate the Hi source on RA-Frequency plane, check on the other two planes. The signal boundaries were not strictly defined, focusing on regions with distinct signal characteristics and a minor inclusion of non-Hi signal areas was tolerated.

It is noteworthy to mention that, due to the difference in frequency coverage between ALFALFA and CRAFTS, despite our meticulous manual verification, the dataset might still contain a small number of instance where Hi sources are either falsely identified or inadvertently omitted.

Table 1: Hi source dataset overview.

Region No. Cube No. Source Shape
Train Valid Test
R1 540 36 70 2050 (3930-3932,23,231-261)
R2 100 15 42 469 (3275-4325,158-181,191-248)

We conducted a basic statistical analysis on all Hi sources, considering parameters such as source size and the SNR of the top 10% value within the mask regions, see Figure 3. Concurrently with the annotation process, we assessed the ease of identifying Hi source signals and classified them into three categories: C1 (easiest), C2 (intermediate), and C3 (most challenging). The observed difficulty levels were found to be generally consistent with the distribution of the top 10% SNR values.


Refer to caption

Figure 3: Hi sources size distribution and top 10% SNR distribution for C1, C2, C3. The frequency span of Hi sources is significantly broader than their spatial extent.

In accordance with the data release policy of FAST, the data from Region R1 is anticipated to be made publicly available in the near future, while the data from Region R2 will be released at a later date.

4 Model and Experiments

In the CRAFTS spectral data cube, identifying Hi sources is regarded as a target detection problem within deep learning. Given Unet’s outstanding performance in 3D image segmentation and object detection tasks(Çiçek et al., 2016; ** et al., 2020; Isensee et al., 2020), we employ a 3D-Unet architecture as the fundamental framework for this task, with its precise segmentation facilitates subsequent Hi emission line analysis.

Given that the frequency resolution of Hi sources in CRAFTS is significantly higher than the spatial resolution, they span approximately ten pixels in space but encompass hundreds of pixels along the frequency axis, see Figure 3. To address this disparity, we implement two strategies: (a) utilizing a large convolution kernel (7,3,3) along frequency axis to capture more contextual information; (b) rebin: during the data preprocessing, applying an average pooling layer with convolution kernel of (6,1,1) and stride of (4,1,1) on the frequency axis to reduce its dimensionality, which allows a larger batch size, thereby enhancing training efficiency.

Simultaneously, we employ random flip** and noise for data augmentation. To improve the recognition capability of weak Hi signals, we randomly transform high SNR Hi sources into weaker signals (cut mix, i.e. crop the Hi source, randomly transform intensity, mix with a new background). Throughout the training process, we utilize the Adam optimizer and combine dice loss and binary cross-entropy loss functions (Equation 1) to train the network. The model is trained for a total of 600 epochs with a batch size of 2, utilizing an NVIDIA A40 GPU.

Loss=LDice+0.5LCross𝐿𝑜𝑠𝑠subscript𝐿𝐷𝑖𝑐𝑒0.5subscript𝐿𝐶𝑟𝑜𝑠𝑠Loss=L_{Dice}+0.5L_{Cross}italic_L italic_o italic_s italic_s = italic_L start_POSTSUBSCRIPT italic_D italic_i italic_c italic_e end_POSTSUBSCRIPT + 0.5 italic_L start_POSTSUBSCRIPT italic_C italic_r italic_o italic_s italic_s end_POSTSUBSCRIPT (1)
LDice=12NiNyiy^i+ϵiNyi+iNy^i+ϵsubscript𝐿𝐷𝑖𝑐𝑒12𝑁superscriptsubscript𝑖𝑁subscript𝑦𝑖subscript^𝑦𝑖italic-ϵsuperscriptsubscript𝑖𝑁subscript𝑦𝑖superscriptsubscript𝑖𝑁subscript^𝑦𝑖italic-ϵL_{Dice}=1-\frac{2}{N}\frac{\sum_{i}^{N}y_{i}\cdot\hat{y}_{i}+\epsilon}{\sum_{% i}^{N}y_{i}+\sum_{i}^{N}\hat{y}_{i}+\epsilon}italic_L start_POSTSUBSCRIPT italic_D italic_i italic_c italic_e end_POSTSUBSCRIPT = 1 - divide start_ARG 2 end_ARG start_ARG italic_N end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ end_ARG (2)
IoU(y,y^)=yy^y+y^yy^𝐼𝑜𝑈𝑦^𝑦𝑦^𝑦𝑦^𝑦𝑦^𝑦IoU(y,\hat{y})=\frac{\sum y\cdot\hat{y}}{\sum y+\sum\hat{y}-\sum y\cdot\hat{y}}italic_I italic_o italic_U ( italic_y , over^ start_ARG italic_y end_ARG ) = divide start_ARG ∑ italic_y ⋅ over^ start_ARG italic_y end_ARG end_ARG start_ARG ∑ italic_y + ∑ over^ start_ARG italic_y end_ARG - ∑ italic_y ⋅ over^ start_ARG italic_y end_ARG end_ARG (3)

To enhance the comprehensiveness and depth of our comparative analysis, we also utilized SoFiA on the same dataset, with an SoFiA setup: detection threshold is 5σ𝜎\sigmaitalic_σ; smoothing kernels are kernelsXY = 0, 3, 6 and kernelsZ = 0, 3, 7, 15; the minimum number of spatial and spectral pixels is 5 in XY and Z space, while the maximum size is 50 pixels in XY space, but not limited in Z space.

In addition, we have also employed two SOTA frameworks, namely Swin-UNETR (Hatamizadeh et al., 2022) and UX-Net (Lee et al., 2022), on the 3D medical image segmentation task. We conduct segmentation in a sliding window manner, adopting a patch size of (1024, 32, 64) with a stride equal to half the patch size. Both native and re-bined resolutions were considered for the input volume data. We maintain a balanced ratio of 1:1 for positive and negative samples, ensuring each class is adequately represented during training. Specifically, positive samples are cropped in a manner that ensures they encapsulate at least half the area of the Hi source,whereas negative samples are randomly extracted within the confines of the spectral data cube. This strategy allows the model to learn more effectively from the target areas and enhances its segmentation performance.

Refer to caption

Figure 4: A visualization of the predicted segmentation for one example from the Hi source test dataset. The bottom right panel presents a comparison of the smoothed Hi emission lines. Where SoFiA gets finer segmentation with a 5σ𝜎\sigmaitalic_σ detection threshold setting, the rebinning of input data results in better segmentation outcomes.

As illustrated in Figure 4 and Table 2, our method successfully attains a recall rate of 91.6% and an impressive accuracy rate of 95.7%, which distinctly surpasses the performance of the commonly employed SoFiA. Notably, our approach demonstrates exceptional proficiency not only in recognition precision but also achieves an acceptable level of segmentation effectiveness. The dice coefficients for our method reach 78.4% on the training set, 74.3% on the validation set, and 72.6% on the test set.

Relative to the performances achieved by Swin-UNETR and UX-Net, our proposed method displays enhanced results in both recall and precision metrics, possibly due to the fact that the elongated morphology of Hi sources necessitates a larger global receptive field. By addressing this need, our approach seems to be more adept at handling such structural intricacies, leading to improved recognition and segmentation outcomes.These results firmly substantiate the stability and generalization capabilities of our network, exhibiting formidable strengths in both precise source identification and effective data segmentation.

Table 2: A performance comparison of SoFiA, Unet-LK, Swin-UNETR and UX-Net on Test Set. For the high threshold configuration within SoFiA, in this study, we adopt IoU (Equation 3) \geq 0.2 as the detection threshold criterion.
Methods Detection Segmentation
Recall Precision IoU Dice
SofiA 64.2% 2.3% 1.5% 2.9%
UX-Net (rebin) 89.4% 93.5% 56.0% 71.8%
UX-Net (crop) 89.7% 78.9% 49.7% 66.4%
Swin-UNETR (crop) 90.5% 51.2% 45.6% 62.7%
Swin-UNETR (rebin+crop) 85.9% 90.3% 47.8% 64.7%
Unet-LK (rebin) 91.6% 95.7% 59.1% 74.3%

5 Summary

In this work, we propose a novel method for Hi source detection that harnesses the power of 3D-Unet segmentation network to accurately identify and segment Hi sources. Experimental results demonstrate remarkable performance on our custom test set, achieving high recall (91.6%) and accuracy (95.7%), while maintaining good consistency across different datasets.

Compared to the SoFiA software, our proposed approach exhibits a significant improvement in recognition precision and attains satisfactory segmentation outcomes within the context of our dataset. Comparative analysis with state-of-the-art network architectures such as Swin-UNETR and UX-Net indicates that customizing the network architecture in accordance with the specific attributes of the data and target features is indeed a critical factor in optimizing the overall functionality and performance of the model. This not only validates the efficacy of our adopted method but also highlights the profound value of our tailored dataset in enhancing the precision and efficiency of Hi source detection tasks.

Additionally, the meticulously constructed and annotated custom dataset we have developed plays a pivotal role in future identification tasks concerning Hi sources. This dataset encompasses a rich array of Hi source examples, covering a wide range of observing conditions and signal strengths, with a particular emphasis on cases where Hi sources are difficult to discern amidst complex background noise and low signal-to-noise ratio environments. Through diligent manual annotation, we ensure the authenticity and integrity of every source in the dataset, which is essential for training and validating identification algorithms.

Despite its promising performance, the proposed method has potential for refinement. Improving the model’s sensitivity to low SNR Hi sources is a notable aspect. Additionally, noise and data variability in Hi datasets might affect generalizability across diverse environments. Future work could thus focus on refining pre-processing techniques to handle these complexities and enhancing network resilience to SNR variations.

Furthermore, given the success with our custom dataset and architecture, future directions include expanding the dataset diversity, develo** adaptive learning strategies, and exploring ways to integrate extra contextual information to boost the accuracy of Hi source identification and segmentation.

In conclusion, the promising outcomes of this research have not only made a substantial contribution to the advancement of Hi source detection methodologies, but also revealed an expanded scope of potential applications within the critical task of extracting and analyzing complex sources in the realms of radio astronomy and its associated domains.

Acknowledgements

This work was Supported by National Key R&\&&D Program of China No. 2022YFB4501405 and National Natural Science Foundation of China grant No. 12373026.

Data Availability

The labeled CRAFTS Data used in this paper will be available in the near future, and the data access URLs will be synchronized on GitHub https://github.com/fishszh/HISF.

References