Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method
Abstract
We introduce a machine learning-based method for extracting Hi sources from 3D spectral data, and construct a dedicated dataset of Hi sources from CRAFTS. Our custom dataset provides comprehensive resources for Hi source detection. Utilizing the 3D-Unet segmentation architecture, our method reliably identifies and segments Hi sources, achieving notable performance metrics with recall rates reaching 91.6% and accuracy levels at 95.7%. These outcomes substantiate the value of our custom dataset and the efficacy of our proposed network in identifying Hi source. Our code is publicly available at https://github.com/fishszh/HISF.
keywords:
methods: data analysis – techniques: image processing – methods: observational1 Introduction
Neutral hydrogen (Hi ) is a crucial constituent of the interstellar medium. Via 21 cm emission line of Hi , researchers can study the evolution of galaxies and the distribution of matter in the universe (Cheng et al., 2020). Hi emission lines provide vital information on the density and velocity structure of neutral gas within galaxies (Springob et al., 2005). Consequently, over the past few decades, numerous Hi surveys have been conducted to detect Hi in the local universe. Key surveys include Hi Parkes All Sky Survey (HIPASS) (Barnes et al., 2001), which identified over 5000 galaxies across approximately 30000 deg2, and ALFALFA (Giovanelli et al., 2005), which covered an area of approximately 7,000 deg2, cataloging 31,502 galaxies. The FAST All Sky Hi survey (FASHI) (Zhang et al., 2024) aims to survey the entire sky visible to FAST, covering a declination range of -14° to +66°. Its first data release detected 41741 extra-galactic Hi sources. Additionally, the commensal radio astronomy FAST survey (CRAFTS) (Li et al., 2018) is another Hi survey led by FAST, utilizing the same sky coverage, frequency range, sensitivity and resolution of FAST.
With the progression of observational technologies and equipment upgrades, a substantial volume of high-quality astronomical observation data has been generated through various sky survey. However, the processing of these vast datasets imposes stringent requirements on both efficiency and accuracy, which conventional methodologies struggle to fulfill.
In response to this challenge, scientists have embarked on exploring the integration of machine learning into the data processing of astronomical observations(Baron, 2019). A variety of machine learning-driven methods have been employed in diverse applications within astronomy, such as detecting tidal features (Desmons et al., 2023), light curve classification (Cui et al., 2023; Tey et al., 2023), source detection (Liang et al., 2023), spectrum classification (Tan et al., 2022), Radio Frequency Interference (RFI) mitigation (Akeret et al., 2017; Sun et al., 2022; Xiao et al., 2022) and so on.
In this study, we utilized CRAFTS observational data to systematically organize and construct a dedicated dataset of Hi sources. To achieve high-precision identification and segmentation of Hi sources, we implemented a 3D Unet deep learning model, which is capable of effectively extracting and segmenting Hi sources from complex spectral data cubes. The primary objective of this work is to enhance the accuracy and efficiency of Hi source detection by utilizing deep learning technology. This endeavor aims to validate the potent application potential of deep learning in astronomical data processing. Furthermore, it provides new insights and approaches for future astronomical observations and data analysis.
2 Related Work
In previous Hi surveys, for instance HIPASS, ALFALFA and FASHI, research teams conventionally developed their own automated algorithms or employed software like SoFiA111https://github.com/SoFiA-Admin/SoFiA-2 to identify Hi sources. These detections were subjected to further manual analysis and verification, culminating in the release of comprehensive Hi source catalogs. These catalogs encompassed essential attributes, including spatial coordinate ranges, frequency ranges, red-shifts, and signal-to-noise ratios (SNR), among other key parameters. Based on this spatial and frequency domain data, researchers were generally able to conduct effective analyses of the characteristics of Hi emission lines.
In the realm of Hi source detection, extensive research efforts have been conducted. In SKA Science Data Challenge 2, several teams have devised a range of methods to identify HI sources within a simulated dataset (Hartley et al., 2023). These approaches not only encompass conventional methodologies like SoFiA, but also integrate machine learning techniques, such as 3D Unet for segmentation, CNN for classification, and object detection algorithms like YOLO for Hi source characterization. Liang et al. (2023) tentatively employed the Mask R-CNN model and PointRend approach to identify Hi signals, revealing encouraging outcomes when applied to a simulated 2D dataset. Those exploratory work subtly hits the potential for these advanced deep learning frameworks to make a valuable contribution to the refinement and streamlining of Hi source detection.
3 Data
Previous endeavors have commonly employed either conventional algorithms coupled with manual identification, or were grounded on simulated data alone, without being validated against authentic observational datasets. Furthermore, a dearth of openly accessible annotated observational datasets has hindered advancements. To address this deficiency, we utilize the observational data from CRAFTS to systematically compile a novel Hi source dataset featuring accompanying masks. This innovative effort is intended to provide a much-needed benchmark for evaluating and enhancing Hi source identification techniques within an empirical context.
The construction of Hi source spectral data cubes from CRAFTS raw data follows a meticulously designed pipeline depicted in Figure 1, including a series of critical steps such as RFI flagging, ripple removal, baseline removal, and other essential processing measures. However, due to the inherent difficulty in completely removing RFI, only a portion of the prominently discernible RFI has been excised here. Consequently, the generated spectral data cubes still contain a substantial amount of RFI, which poses a significant challenge for identifying Hi sources.
To confirm the Hi sources within the CRAFTS spectral data cube, we integrate expert verification and cross-validation methods utilizing other Hi surveys and observations in different wavebands. Currently, we have annotated data for two sky regions, as depicted in Figure 2.
In Region R1, we analyzed 646 3D spectral data cubes and confirmed 2050 Hi sources through expert verification and ALFALFA cross-validation. Among these, 1749 Hi sources correspond to those detected by ALFALFA. Nonetheless, they still contained unprocessed RFI signals. Due to differences in frequency coverage and sensitivity between ALFALFA and CRAFTS, we also referred to the FASHI Hi source catalog.
For Region R2, we manually eliminated additional RFI signals, thereby rendering this portion of data comparatively cleaner. For the identification process of Hi sources, we manually identified potential Hi signals, subjected them to expert voting, and referenced ALFALFA and other waveband information, resulting 469 Hi sources, see Table 1.
Following the approximate coordinate (frequency, R.A., Dec.) of Hi sources from the identification process, we utilized 3D Slicer (Fedorov et al., 2012) to visualize Hi signals on three orthogonal planes, and we annotate the Hi source on RA-Frequency plane, check on the other two planes. The signal boundaries were not strictly defined, focusing on regions with distinct signal characteristics and a minor inclusion of non-Hi signal areas was tolerated.
It is noteworthy to mention that, due to the difference in frequency coverage between ALFALFA and CRAFTS, despite our meticulous manual verification, the dataset might still contain a small number of instance where Hi sources are either falsely identified or inadvertently omitted.
Region | No. Cube | No. Source | Shape | ||
---|---|---|---|---|---|
Train | Valid | Test | |||
R1 | 540 | 36 | 70 | 2050 | (3930-3932,23,231-261) |
R2 | 100 | 15 | 42 | 469 | (3275-4325,158-181,191-248) |
We conducted a basic statistical analysis on all Hi sources, considering parameters such as source size and the SNR of the top 10% value within the mask regions, see Figure 3. Concurrently with the annotation process, we assessed the ease of identifying Hi source signals and classified them into three categories: C1 (easiest), C2 (intermediate), and C3 (most challenging). The observed difficulty levels were found to be generally consistent with the distribution of the top 10% SNR values.
In accordance with the data release policy of FAST, the data from Region R1 is anticipated to be made publicly available in the near future, while the data from Region R2 will be released at a later date.
4 Model and Experiments
In the CRAFTS spectral data cube, identifying Hi sources is regarded as a target detection problem within deep learning. Given Unet’s outstanding performance in 3D image segmentation and object detection tasks(Çiçek et al., 2016; ** et al., 2020; Isensee et al., 2020), we employ a 3D-Unet architecture as the fundamental framework for this task, with its precise segmentation facilitates subsequent Hi emission line analysis.
Given that the frequency resolution of Hi sources in CRAFTS is significantly higher than the spatial resolution, they span approximately ten pixels in space but encompass hundreds of pixels along the frequency axis, see Figure 3. To address this disparity, we implement two strategies: (a) utilizing a large convolution kernel (7,3,3) along frequency axis to capture more contextual information; (b) rebin: during the data preprocessing, applying an average pooling layer with convolution kernel of (6,1,1) and stride of (4,1,1) on the frequency axis to reduce its dimensionality, which allows a larger batch size, thereby enhancing training efficiency.
Simultaneously, we employ random flip** and noise for data augmentation. To improve the recognition capability of weak Hi signals, we randomly transform high SNR Hi sources into weaker signals (cut mix, i.e. crop the Hi source, randomly transform intensity, mix with a new background). Throughout the training process, we utilize the Adam optimizer and combine dice loss and binary cross-entropy loss functions (Equation 1) to train the network. The model is trained for a total of 600 epochs with a batch size of 2, utilizing an NVIDIA A40 GPU.
(1) |
(2) |
(3) |
To enhance the comprehensiveness and depth of our comparative analysis, we also utilized SoFiA on the same dataset, with an SoFiA setup: detection threshold is 5; smoothing kernels are kernelsXY = 0, 3, 6 and kernelsZ = 0, 3, 7, 15; the minimum number of spatial and spectral pixels is 5 in XY and Z space, while the maximum size is 50 pixels in XY space, but not limited in Z space.
In addition, we have also employed two SOTA frameworks, namely Swin-UNETR (Hatamizadeh et al., 2022) and UX-Net (Lee et al., 2022), on the 3D medical image segmentation task. We conduct segmentation in a sliding window manner, adopting a patch size of (1024, 32, 64) with a stride equal to half the patch size. Both native and re-bined resolutions were considered for the input volume data. We maintain a balanced ratio of 1:1 for positive and negative samples, ensuring each class is adequately represented during training. Specifically, positive samples are cropped in a manner that ensures they encapsulate at least half the area of the Hi source,whereas negative samples are randomly extracted within the confines of the spectral data cube. This strategy allows the model to learn more effectively from the target areas and enhances its segmentation performance.
As illustrated in Figure 4 and Table 2, our method successfully attains a recall rate of 91.6% and an impressive accuracy rate of 95.7%, which distinctly surpasses the performance of the commonly employed SoFiA. Notably, our approach demonstrates exceptional proficiency not only in recognition precision but also achieves an acceptable level of segmentation effectiveness. The dice coefficients for our method reach 78.4% on the training set, 74.3% on the validation set, and 72.6% on the test set.
Relative to the performances achieved by Swin-UNETR and UX-Net, our proposed method displays enhanced results in both recall and precision metrics, possibly due to the fact that the elongated morphology of Hi sources necessitates a larger global receptive field. By addressing this need, our approach seems to be more adept at handling such structural intricacies, leading to improved recognition and segmentation outcomes.These results firmly substantiate the stability and generalization capabilities of our network, exhibiting formidable strengths in both precise source identification and effective data segmentation.
Methods | Detection | Segmentation | |||
---|---|---|---|---|---|
Recall | Precision | IoU | Dice | ||
SofiA | 64.2% | 2.3% | 1.5% | 2.9% | |
UX-Net (rebin) | 89.4% | 93.5% | 56.0% | 71.8% | |
UX-Net (crop) | 89.7% | 78.9% | 49.7% | 66.4% | |
Swin-UNETR (crop) | 90.5% | 51.2% | 45.6% | 62.7% | |
Swin-UNETR (rebin+crop) | 85.9% | 90.3% | 47.8% | 64.7% | |
Unet-LK (rebin) | 91.6% | 95.7% | 59.1% | 74.3% |
5 Summary
In this work, we propose a novel method for Hi source detection that harnesses the power of 3D-Unet segmentation network to accurately identify and segment Hi sources. Experimental results demonstrate remarkable performance on our custom test set, achieving high recall (91.6%) and accuracy (95.7%), while maintaining good consistency across different datasets.
Compared to the SoFiA software, our proposed approach exhibits a significant improvement in recognition precision and attains satisfactory segmentation outcomes within the context of our dataset. Comparative analysis with state-of-the-art network architectures such as Swin-UNETR and UX-Net indicates that customizing the network architecture in accordance with the specific attributes of the data and target features is indeed a critical factor in optimizing the overall functionality and performance of the model. This not only validates the efficacy of our adopted method but also highlights the profound value of our tailored dataset in enhancing the precision and efficiency of Hi source detection tasks.
Additionally, the meticulously constructed and annotated custom dataset we have developed plays a pivotal role in future identification tasks concerning Hi sources. This dataset encompasses a rich array of Hi source examples, covering a wide range of observing conditions and signal strengths, with a particular emphasis on cases where Hi sources are difficult to discern amidst complex background noise and low signal-to-noise ratio environments. Through diligent manual annotation, we ensure the authenticity and integrity of every source in the dataset, which is essential for training and validating identification algorithms.
Despite its promising performance, the proposed method has potential for refinement. Improving the model’s sensitivity to low SNR Hi sources is a notable aspect. Additionally, noise and data variability in Hi datasets might affect generalizability across diverse environments. Future work could thus focus on refining pre-processing techniques to handle these complexities and enhancing network resilience to SNR variations.
Furthermore, given the success with our custom dataset and architecture, future directions include expanding the dataset diversity, develo** adaptive learning strategies, and exploring ways to integrate extra contextual information to boost the accuracy of Hi source identification and segmentation.
In conclusion, the promising outcomes of this research have not only made a substantial contribution to the advancement of Hi source detection methodologies, but also revealed an expanded scope of potential applications within the critical task of extracting and analyzing complex sources in the realms of radio astronomy and its associated domains.
Acknowledgements
This work was Supported by National Key RD Program of China No. 2022YFB4501405 and National Natural Science Foundation of China grant No. 12373026.
Data Availability
The labeled CRAFTS Data used in this paper will be available in the near future, and the data access URLs will be synchronized on GitHub https://github.com/fishszh/HISF.
References
- Akeret et al. (2017) Akeret J., Chang C., Lucchi A., Refregier A., 2017, Astronomy and Computing, 18, 35
- Barnes et al. (2001) Barnes D. G., et al., 2001, MNRAS, 322, 486
- Baron (2019) Baron D., 2019, arXiv e-prints, p. arXiv:1904.07248
- Cheng et al. (2020) Cheng C., et al., 2020, A&A, 638, L14
- Cui et al. (2023) Cui K., Armstrong D. J., Feng F., 2023, arXiv e-prints, p. arXiv:2311.08080
- Desmons et al. (2023) Desmons A., Brough S., Lanusse F., 2023, arXiv e-prints, p. arXiv:2307.04967
- Fedorov et al. (2012) Fedorov A., et al., 2012, Magnetic Resonance Imaging, 30, 1323
- Giovanelli et al. (2005) Giovanelli R., et al., 2005, AJ, 130, 2598
- Hartley et al. (2023) Hartley P., et al., 2023, MNRAS, 523, 1967
- Hatamizadeh et al. (2022) Hatamizadeh A., Nath V., Tang Y., Yang D., Roth H., Xu D., 2022, arXiv e-prints, p. arXiv:2201.01266
- Isensee et al. (2020) Isensee F., Jaeger P. F., Kohl S. A. A., Petersen J., Maier-Hein K. H., 2020, Nature Methods, 18, 203–211
- ** et al. (2020) ** L., et al., 2020, EBioMedicine
- Lee et al. (2022) Lee H. H., Bao S., Huo Y., Landman B. A., 2022, arXiv e-prints, p. arXiv:2209.15076
- Li et al. (2018) Li D., et al., 2018, IEEE Microwave Magazine, 19, 112
- Liang et al. (2023) Liang R., et al., 2023, Research in Astronomy and Astrophysics, 23, 115006
- Springob et al. (2005) Springob C. M., Haynes M. P., Giovanelli R., Kent B. R., 2005, ApJS, 160, 149
- Sun et al. (2022) Sun H., Deng H., Wang F., Mei Y., Xu T., Smirnov O., Deng L., Wei S., 2022, MNRAS, 512, 2025
- Tan et al. (2022) Tan L., Mei Y., Liu Z., Luo Y., Deng H., Wang F., Deng L., Liu C., 2022, arXiv e-prints, p. arXiv:2201.08967
- Tey et al. (2023) Tey E., et al., 2023, The Astronomical Journal, 165, 95
- Xiao et al. (2022) Xiao J., Zhang Y., Zhang B., Yang Z., Yu C., Cui C., 2022, New Astron., 96, 101825
- Zhang et al. (2024) Zhang C.-P., et al., 2024, Science China Physics, Mechanics, and Astronomy, 67, 219511
- Çiçek et al. (2016) Çiçek Ö., Abdulkadir A., Lienkamp S. S., Brox T., Ronneberger O., 2016, arXiv e-prints, p. arXiv:1606.06650