Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

Zihao Song¹, Huaxi Chen¹, Donghui Quan¹, Di Li², Yinghui Zheng², Shulei Ni¹, Yunchuan Chen¹, Yun Zheng¹
¹Zhejiang Lab, Hangzhou, Zhejiang 311121, China
²National Astronomical Observatories, Chinese Academy of Sciences, Bei**g 100101, People’s Republic of China E-mail: [email protected]: [email protected]

(Accepted XXX. Received YYY; in original form ZZZ)

Abstract

We introduce a machine learning-based method for extracting Hi sources from 3D spectral data, and construct a dedicated dataset of Hi sources from CRAFTS. Our custom dataset provides comprehensive resources for Hi source detection. Utilizing the 3D-Unet segmentation architecture, our method reliably identifies and segments Hi sources, achieving notable performance metrics with recall rates reaching 91.6% and accuracy levels at 95.7%. These outcomes substantiate the value of our custom dataset and the efficacy of our proposed network in identifying Hi source. Our code is publicly available at https://github.com/fishszh/HISF.

keywords:

methods: data analysis – techniques: image processing – methods: observational

^†^†pubyear: 2015^†^†pagerange: Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method–Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

1 Introduction

Neutral hydrogen (Hi ) is a crucial constituent of the interstellar medium. Via 21 cm emission line of Hi , researchers can study the evolution of galaxies and the distribution of matter in the universe (Cheng et al., 2020). Hi emission lines provide vital information on the density and velocity structure of neutral gas within galaxies (Springob et al., 2005). Consequently, over the past few decades, numerous Hi surveys have been conducted to detect Hi in the local universe. Key surveys include Hi Parkes All Sky Survey (HIPASS) (Barnes et al., 2001), which identified over 5000 galaxies across approximately 30000 deg², and ALFALFA (Giovanelli et al., 2005), which covered an area of approximately 7,000 deg², cataloging 31,502 galaxies. The FAST All Sky Hi survey (FASHI) (Zhang et al., 2024) aims to survey the entire sky visible to FAST, covering a declination range of -14^° to +66^°. Its first data release detected 41741 extra-galactic Hi sources. Additionally, the commensal radio astronomy FAST survey (CRAFTS) (Li et al., 2018) is another Hi survey led by FAST, utilizing the same sky coverage, frequency range, sensitivity and resolution of FAST.

With the progression of observational technologies and equipment upgrades, a substantial volume of high-quality astronomical observation data has been generated through various sky survey. However, the processing of these vast datasets imposes stringent requirements on both efficiency and accuracy, which conventional methodologies struggle to fulfill.

In response to this challenge, scientists have embarked on exploring the integration of machine learning into the data processing of astronomical observations(Baron, 2019). A variety of machine learning-driven methods have been employed in diverse applications within astronomy, such as detecting tidal features (Desmons et al., 2023), light curve classification (Cui et al., 2023; Tey et al., 2023), source detection (Liang et al., 2023), spectrum classification (Tan et al., 2022), Radio Frequency Interference (RFI) mitigation (Akeret et al., 2017; Sun et al., 2022; Xiao et al., 2022) and so on.

In this study, we utilized CRAFTS observational data to systematically organize and construct a dedicated dataset of Hi sources. To achieve high-precision identification and segmentation of Hi sources, we implemented a 3D Unet deep learning model, which is capable of effectively extracting and segmenting Hi sources from complex spectral data cubes. The primary objective of this work is to enhance the accuracy and efficiency of Hi source detection by utilizing deep learning technology. This endeavor aims to validate the potent application potential of deep learning in astronomical data processing. Furthermore, it provides new insights and approaches for future astronomical observations and data analysis.

The paper is structured as follows: Section 2 introduces the related Hi survery and Hi source founding works. Section 3 describes the dataset selection and preparation. Section 4 details our model pipeline and experiment results. Our summary is outlined in Section 5.

Refer to caption — Figure 1: Data processing pipeline for Hi source identification.

2 Related Work

In previous Hi surveys, for instance HIPASS, ALFALFA and FASHI, research teams conventionally developed their own automated algorithms or employed software like SoFiA¹¹1https://github.com/SoFiA-Admin/SoFiA-2 to identify Hi sources. These detections were subjected to further manual analysis and verification, culminating in the release of comprehensive Hi source catalogs. These catalogs encompassed essential attributes, including spatial coordinate ranges, frequency ranges, red-shifts, and signal-to-noise ratios (SNR), among other key parameters. Based on this spatial and frequency domain data, researchers were generally able to conduct effective analyses of the characteristics of Hi emission lines.

In the realm of Hi source detection, extensive research efforts have been conducted. In SKA Science Data Challenge 2, several teams have devised a range of methods to identify HI sources within a simulated dataset (Hartley et al., 2023). These approaches not only encompass conventional methodologies like SoFiA, but also integrate machine learning techniques, such as 3D Unet for segmentation, CNN for classification, and object detection algorithms like YOLO for Hi source characterization. Liang et al. (2023) tentatively employed the Mask R-CNN model and PointRend approach to identify Hi signals, revealing encouraging outcomes when applied to a simulated 2D dataset. Those exploratory work subtly hits the potential for these advanced deep learning frameworks to make a valuable contribution to the refinement and streamlining of Hi source detection.

3 Data

Previous endeavors have commonly employed either conventional algorithms coupled with manual identification, or were grounded on simulated data alone, without being validated against authentic observational datasets. Furthermore, a dearth of openly accessible annotated observational datasets has hindered advancements. To address this deficiency, we utilize the observational data from CRAFTS to systematically compile a novel Hi source dataset featuring accompanying masks. This innovative effort is intended to provide a much-needed benchmark for evaluating and enhancing Hi source identification techniques within an empirical context.

The construction of Hi source spectral data cubes from CRAFTS raw data follows a meticulously designed pipeline depicted in Figure 1, including a series of critical steps such as RFI flagging, ripple removal, baseline removal, and other essential processing measures. However, due to the inherent difficulty in completely removing RFI, only a portion of the prominently discernible RFI has been excised here. Consequently, the generated spectral data cubes still contain a substantial amount of RFI, which poses a significant challenge for identifying Hi sources.

To confirm the Hi sources within the CRAFTS spectral data cube, we integrate expert verification and cross-validation methods utilizing other Hi surveys and observations in different wavebands. Currently, we have annotated data for two sky regions, as depicted in Figure 2.

In Region R1, we analyzed 646 3D spectral data cubes and confirmed 2050 Hi sources through expert verification and ALFALFA cross-validation. Among these, 1749 Hi sources correspond to those detected by ALFALFA. Nonetheless, they still contained unprocessed RFI signals. Due to differences in frequency coverage and sensitivity between ALFALFA and CRAFTS, we also referred to the FASHI Hi source catalog.

For Region R2, we manually eliminated additional RFI signals, thereby rendering this portion of data comparatively cleaner. For the identification process of Hi sources, we manually identified potential Hi signals, subjected them to expert voting, and referenced ALFALFA and other waveband information, resulting 469 Hi sources, see Table 1.

Following the approximate coordinate (frequency, R.A., Dec.) of Hi sources from the identification process, we utilized 3D Slicer (Fedorov et al., 2012) to visualize Hi signals on three orthogonal planes, and we annotate the Hi source on RA-Frequency plane, check on the other two planes. The signal boundaries were not strictly defined, focusing on regions with distinct signal characteristics and a minor inclusion of non-Hi signal areas was tolerated.

It is noteworthy to mention that, due to the difference in frequency coverage between ALFALFA and CRAFTS, despite our meticulous manual verification, the dataset might still contain a small number of instance where Hi sources are either falsely identified or inadvertently omitted.

Table 1: Hi source dataset overview.

Region	No. Cube			No. Source	Shape
	Train	Valid	Test
R1	540	36	70	2050	(3930-3932,23,231-261)
R2	100	15	42	469	(3275-4325,158-181,191-248)

We conducted a basic statistical analysis on all Hi sources, considering parameters such as source size and the SNR of the top 10% value within the mask regions, see Figure 3. Concurrently with the annotation process, we assessed the ease of identifying Hi source signals and classified them into three categories: C1 (easiest), C2 (intermediate), and C3 (most challenging). The observed difficulty levels were found to be generally consistent with the distribution of the top 10% SNR values.

In accordance with the data release policy of FAST, the data from Region R1 is anticipated to be made publicly available in the near future, while the data from Region R2 will be released at a later date.

4 Model and Experiments

In the CRAFTS spectral data cube, identifying Hi sources is regarded as a target detection problem within deep learning. Given Unet’s outstanding performance in 3D image segmentation and object detection tasks(Çiçek et al., 2016; ** et al., 2020; Isensee et al., 2020), we employ a 3D-Unet architecture as the fundamental framework for this task, with its precise segmentation facilitates subsequent Hi emission line analysis.

Given that the frequency resolution of Hi sources in CRAFTS is significantly higher than the spatial resolution, they span approximately ten pixels in space but encompass hundreds of pixels along the frequency axis, see Figure 3. To address this disparity, we implement two strategies: (a) utilizing a large convolution kernel (7,3,3) along frequency axis to capture more contextual information; (b) rebin: during the data preprocessing, applying an average pooling layer with convolution kernel of (6,1,1) and stride of (4,1,1) on the frequency axis to reduce its dimensionality, which allows a larger batch size, thereby enhancing training efficiency.

Simultaneously, we employ random flip** and noise for data augmentation. To improve the recognition capability of weak Hi signals, we randomly transform high SNR Hi sources into weaker signals (cut mix, i.e. crop the Hi source, randomly transform intensity, mix with a new background). Throughout the training process, we utilize the Adam optimizer and combine dice loss and binary cross-entropy loss functions (Equation 1) to train the network. The model is trained for a total of 600 epochs with a batch size of 2, utilizing an NVIDIA A40 GPU.

Loss=L_{Dice}+0.5L_{Cross}

(1)

L_{Dice}=1-\frac{2}{N}\frac{\sum_{i}^{N}y_{i}\cdot\hat{y}_{i}+\epsilon}{\sum_{% i}^{N}y_{i}+\sum_{i}^{N}\hat{y}_{i}+\epsilon}

(2)

IoU(y,\hat{y})=\frac{\sum y\cdot\hat{y}}{\sum y+\sum\hat{y}-\sum y\cdot\hat{y}}

(3)

To enhance the comprehensiveness and depth of our comparative analysis, we also utilized SoFiA on the same dataset, with an SoFiA setup: detection threshold is 5 $\sigma$ ; smoothing kernels are kernelsXY = 0, 3, 6 and kernelsZ = 0, 3, 7, 15; the minimum number of spatial and spectral pixels is 5 in XY and Z space, while the maximum size is 50 pixels in XY space, but not limited in Z space.

In addition, we have also employed two SOTA frameworks, namely Swin-UNETR (Hatamizadeh et al., 2022) and UX-Net (Lee et al., 2022), on the 3D medical image segmentation task. We conduct segmentation in a sliding window manner, adopting a patch size of (1024, 32, 64) with a stride equal to half the patch size. Both native and re-bined resolutions were considered for the input volume data. We maintain a balanced ratio of 1:1 for positive and negative samples, ensuring each class is adequately represented during training. Specifically, positive samples are cropped in a manner that ensures they encapsulate at least half the area of the Hi source,whereas negative samples are randomly extracted within the confines of the spectral data cube. This strategy allows the model to learn more effectively from the target areas and enhances its segmentation performance.

As illustrated in Figure 4 and Table 2, our method successfully attains a recall rate of 91.6% and an impressive accuracy rate of 95.7%, which distinctly surpasses the performance of the commonly employed SoFiA. Notably, our approach demonstrates exceptional proficiency not only in recognition precision but also achieves an acceptable level of segmentation effectiveness. The dice coefficients for our method reach 78.4% on the training set, 74.3% on the validation set, and 72.6% on the test set.

Relative to the performances achieved by Swin-UNETR and UX-Net, our proposed method displays enhanced results in both recall and precision metrics, possibly due to the fact that the elongated morphology of Hi sources necessitates a larger global receptive field. By addressing this need, our approach seems to be more adept at handling such structural intricacies, leading to improved recognition and segmentation outcomes.These results firmly substantiate the stability and generalization capabilities of our network, exhibiting formidable strengths in both precise source identification and effective data segmentation.

Table 2: A performance comparison of SoFiA, Unet-LK, Swin-UNETR and UX-Net on Test Set. For the high threshold configuration within SoFiA, in this study, we adopt IoU (Equation 3)

\geq

0.2 as the detection threshold criterion.

Methods	Detection		Segmentation
	Recall	Precision	IoU	Dice
SofiA	64.2%	2.3%	1.5%	2.9%
UX-Net (rebin)	89.4%	93.5%	56.0%	71.8%
UX-Net (crop)	89.7%	78.9%	49.7%	66.4%
Swin-UNETR (crop)	90.5%	51.2%	45.6%	62.7%
Swin-UNETR (rebin+crop)	85.9%	90.3%	47.8%	64.7%
Unet-LK (rebin)	91.6%	95.7%	59.1%	74.3%

5 Summary

In this work, we propose a novel method for Hi source detection that harnesses the power of 3D-Unet segmentation network to accurately identify and segment Hi sources. Experimental results demonstrate remarkable performance on our custom test set, achieving high recall (91.6%) and accuracy (95.7%), while maintaining good consistency across different datasets.

Compared to the SoFiA software, our proposed approach exhibits a significant improvement in recognition precision and attains satisfactory segmentation outcomes within the context of our dataset. Comparative analysis with state-of-the-art network architectures such as Swin-UNETR and UX-Net indicates that customizing the network architecture in accordance with the specific attributes of the data and target features is indeed a critical factor in optimizing the overall functionality and performance of the model. This not only validates the efficacy of our adopted method but also highlights the profound value of our tailored dataset in enhancing the precision and efficiency of Hi source detection tasks.

Additionally, the meticulously constructed and annotated custom dataset we have developed plays a pivotal role in future identification tasks concerning Hi sources. This dataset encompasses a rich array of Hi source examples, covering a wide range of observing conditions and signal strengths, with a particular emphasis on cases where Hi sources are difficult to discern amidst complex background noise and low signal-to-noise ratio environments. Through diligent manual annotation, we ensure the authenticity and integrity of every source in the dataset, which is essential for training and validating identification algorithms.

Despite its promising performance, the proposed method has potential for refinement. Improving the model’s sensitivity to low SNR Hi sources is a notable aspect. Additionally, noise and data variability in Hi datasets might affect generalizability across diverse environments. Future work could thus focus on refining pre-processing techniques to handle these complexities and enhancing network resilience to SNR variations.

Furthermore, given the success with our custom dataset and architecture, future directions include expanding the dataset diversity, develo** adaptive learning strategies, and exploring ways to integrate extra contextual information to boost the accuracy of Hi source identification and segmentation.

In conclusion, the promising outcomes of this research have not only made a substantial contribution to the advancement of Hi source detection methodologies, but also revealed an expanded scope of potential applications within the critical task of extracting and analyzing complex sources in the realms of radio astronomy and its associated domains.

Acknowledgements

This work was Supported by National Key R $\&$ D Program of China No. 2022YFB4501405 and National Natural Science Foundation of China grant No. 12373026.

Data Availability

The labeled CRAFTS Data used in this paper will be available in the near future, and the data access URLs will be synchronized on GitHub https://github.com/fishszh/HISF.

References

Akeret et al. (2017) Akeret J., Chang C., Lucchi A., Refregier A., 2017, Astronomy and Computing, 18, 35
Barnes et al. (2001) Barnes D. G., et al., 2001, MNRAS, 322, 486
Baron (2019) Baron D., 2019, arXiv e-prints, p. arXiv:1904.07248
Cheng et al. (2020) Cheng C., et al., 2020, A&A, 638, L14
Cui et al. (2023) Cui K., Armstrong D. J., Feng F., 2023, arXiv e-prints, p. arXiv:2311.08080
Desmons et al. (2023) Desmons A., Brough S., Lanusse F., 2023, arXiv e-prints, p. arXiv:2307.04967
Fedorov et al. (2012) Fedorov A., et al., 2012, Magnetic Resonance Imaging, 30, 1323
Giovanelli et al. (2005) Giovanelli R., et al., 2005, AJ, 130, 2598
Hartley et al. (2023) Hartley P., et al., 2023, MNRAS, 523, 1967
Hatamizadeh et al. (2022) Hatamizadeh A., Nath V., Tang Y., Yang D., Roth H., Xu D., 2022, arXiv e-prints, p. arXiv:2201.01266
Isensee et al. (2020) Isensee F., Jaeger P. F., Kohl S. A. A., Petersen J., Maier-Hein K. H., 2020, Nature Methods, 18, 203–211
** et al. (2020) ** L., et al., 2020, EBioMedicine
Lee et al. (2022) Lee H. H., Bao S., Huo Y., Landman B. A., 2022, arXiv e-prints, p. arXiv:2209.15076
Li et al. (2018) Li D., et al., 2018, IEEE Microwave Magazine, 19, 112
Liang et al. (2023) Liang R., et al., 2023, Research in Astronomy and Astrophysics, 23, 115006
Springob et al. (2005) Springob C. M., Haynes M. P., Giovanelli R., Kent B. R., 2005, ApJS, 160, 149
Sun et al. (2022) Sun H., Deng H., Wang F., Mei Y., Xu T., Smirnov O., Deng L., Wei S., 2022, MNRAS, 512, 2025
Tan et al. (2022) Tan L., Mei Y., Liu Z., Luo Y., Deng H., Wang F., Deng L., Liu C., 2022, arXiv e-prints, p. arXiv:2201.08967
Tey et al. (2023) Tey E., et al., 2023, The Astronomical Journal, 165, 95
Xiao et al. (2022) Xiao J., Zhang Y., Zhang B., Yang Z., Yu C., Cui C., 2022, New Astron., 96, 101825
Zhang et al. (2024) Zhang C.-P., et al., 2024, Science China Physics, Mechanics, and Astronomy, 67, 219511
Çiçek et al. (2016) Çiçek Ö., Abdulkadir A., Lienkamp S. S., Brox T., Ronneberger O., 2016, arXiv e-prints, p. arXiv:1606.06650