Search | arXiv e-print repository

A Novel Generative AI-Based Framework for Anomaly Detection in Multicast Messages in Smart Grid Communications

Authors: Aydin Zaboli, Seong Lok Choi, Tai-** Song, Junho Hong

Abstract: Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a… ▽ More Cybersecurity breaches in digital substations can pose significant challenges to the stability and reliability of power system operations. To address these challenges, defense and mitigation techniques are required. Identifying and detecting anomalies in information and communication technology (ICT) is crucial to ensure secure device interactions within digital substations. This paper proposes a task-oriented dialogue (ToD) system for anomaly detection (AD) in datasets of multicast messages e.g., generic object oriented substation event (GOOSE) and sampled value (SV) in digital substations using large language models (LLMs). This model has a lower potential error and better scalability and adaptability than a process that considers the cybersecurity guidelines recommended by humans, known as the human-in-the-loop (HITL) process. Also, this methodology significantly reduces the effort required when addressing new cyber threats or anomalies compared with machine learning (ML) techniques, since it leaves the models complexity and precision unaffected and offers a faster implementation. These findings present a comparative assessment, conducted utilizing standard and advanced performance evaluation metrics for the proposed AD framework and the HITL process. To generate and extract datasets of IEC 61850 communications, a hardware-in-the-loop (HIL) testbed was employed. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 10 pages, 10 figures, Submitted to IEEE Transactions on Information Forensics and Security

arXiv:2403.13254 [pdf, other]

Onset and offset weighted loss function for sound event detection

Authors: Tao Song

Abstract: In a typical sound event detection (SED) system, the existence of a sound event is detected at a frame level, and consecutive frames with the same event detected are combined as one sound event. The median filter is applied as a post-processing step to remove detection errors as much as possible. However, detection errors occurring around the onset and offset of a sound event are beyond the capaci… ▽ More In a typical sound event detection (SED) system, the existence of a sound event is detected at a frame level, and consecutive frames with the same event detected are combined as one sound event. The median filter is applied as a post-processing step to remove detection errors as much as possible. However, detection errors occurring around the onset and offset of a sound event are beyond the capacity of the median filter. To address this issue, an onset and offset weighted binary cross-entropy (OWBCE) loss function is proposed in this paper, which trains the DNN model to be more robust on frames around (a) onsets and offsets. Experiments are carried out in the context of DCASE 2022 task 4. Results show that OWBCE outperforms BCE when different models are considered. For a basic CRNN, relative improvements of 6.43% in event-F1, 1.96% in PSDS1, and 2.43% in PSDS2 can be achieved by OWBCE. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.13252 [pdf, other]

Frequency-aware convolution for sound event detection

Authors: Tao Song

Abstract: In sound event detection (SED), convolution neural networks (CNNs) are widely used to extract time-frequency patterns from the input spectrogram. However, features extracted by CNN can be insensitive to the shift of time-frequency patterns along the frequency axis. To address this issue, frequency dynamic convolution (FDY) has been proposed, which applies different kernels to different frequency c… ▽ More In sound event detection (SED), convolution neural networks (CNNs) are widely used to extract time-frequency patterns from the input spectrogram. However, features extracted by CNN can be insensitive to the shift of time-frequency patterns along the frequency axis. To address this issue, frequency dynamic convolution (FDY) has been proposed, which applies different kernels to different frequency components. Compared to the vannila CNN, FDY requires several times more parameters. In this paper, a more efficient solution named frequency-aware convolution (FAC) is proposed. In FAC, frequency-positional information is encoded in a vector and added to the input spectrogram. To match the amplitude of input, the encoding vector is scaled adaptively and channel-independently. Experiments are carried out in the context of DCASE 2022 task 4, and the results demonstrate that FAC can achieve comparable performance to that of FDY with only 515 additional parameters, while FDY requires 8.02 million additional parameters. The ablation study shows that scaling the encoding vector adaptively and channel-independently is critical to the performance of FAC. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2402.18923 [pdf, other]

Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition

Authors: Jeehyun Lee, Yerin Choi, Tae-** Song, Myoung-Wan Koo

Abstract: Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an in… ▽ More Dysarthria, a common issue among stroke patients, severely impacts speech intelligibility. Inappropriate pauses are crucial indicators in severity assessment and speech-language therapy. We propose to extend a large-scale speech recognition model for inappropriate pause detection in dysarthric speech. To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. First, we treat pause detection as speech recognition, using an automatic speech recognition (ASR) model to convert speech into text with pause tags. According to the newly designed task, we label pause locations at the text level and their appropriateness. We collaborate with speech-language pathologists to establish labeling criteria, ensuring high-quality annotated data. Finally, we extend the ASR model with an inappropriate pause prediction layer for end-to-end inappropriate pause detection. Moreover, we propose a task-tailored metric for evaluating inappropriate pause detection independent of ASR performance. Our experiments show that the proposed method better detects inappropriate pauses in dysarthric speech than baselines. (Inappropriate Pause Error Rate: 14.47%) △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP 2024

arXiv:2311.08024 [pdf, other]

MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT

Authors: Tao Song, Ruizhi Hou, Lisong Dai, Lei Xiang

Abstract: Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potentia… ▽ More Image quality assessment (IQA) plays a critical role in optimizing radiation dose and develo** novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.05462 [pdf, other]

ChatGPT and Other Large Language Models for Cybersecurity of Smart Grid Applications

Authors: Aydin Zaboli, Seong Lok Choi, Tai-** Song, Junho Hong

Abstract: Cybersecurity breaches targeting electrical substations constitute a significant threat to the integrity of the power grid, necessitating comprehensive defense and mitigation strategies. Any anomaly in information and communication technology (ICT) should be detected for secure communications between devices in digital substations. This paper proposes large language models (LLM), e.g., ChatGPT, fo… ▽ More Cybersecurity breaches targeting electrical substations constitute a significant threat to the integrity of the power grid, necessitating comprehensive defense and mitigation strategies. Any anomaly in information and communication technology (ICT) should be detected for secure communications between devices in digital substations. This paper proposes large language models (LLM), e.g., ChatGPT, for the cybersecurity of IEC 61850-based digital substation communications. Multicast messages such as generic object oriented system event (GOOSE) and sampled value (SV) are used for case studies. The proposed LLM-based cybersecurity framework includes, for the first time, data pre-processing of communication systems and human-in-the-loop (HITL) training (considering the cybersecurity guidelines recommended by humans). The results show a comparative analysis of detected anomaly data carried out based on the performance evaluation metrics for different LLMs. A hardware-in-the-loop (HIL) testbed is used to generate and extract dataset of IEC 61850 communications. △ Less

Submitted 25 February, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Comments: 5 pages, 2 figures, Accepted, 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA

arXiv:2310.19202 [pdf]

Improved Motor Imagery Classification Using Adaptive Spatial Filters Based on Particle Swarm Optimization Algorithm

Authors: Xiong Xiong, Ying Wang, Tianyuan Song, **guo Huang, Guixia Kang

Abstract: As a typical self-paced brain-computer interface (BCI) system, the motor imagery (MI) BCI has been widely applied in fields such as robot control, stroke rehabilitation, and assistance for patients with stroke or spinal cord injury. Many studies have focused on the traditional spatial filters obtained through the common spatial pattern (CSP) method. However, the CSP method can only obtain fixed sp… ▽ More As a typical self-paced brain-computer interface (BCI) system, the motor imagery (MI) BCI has been widely applied in fields such as robot control, stroke rehabilitation, and assistance for patients with stroke or spinal cord injury. Many studies have focused on the traditional spatial filters obtained through the common spatial pattern (CSP) method. However, the CSP method can only obtain fixed spatial filters for specific input signals. Besides, CSP method only focuses on the variance difference of two types of electroencephalogram (EEG) signals, so the decoding ability of EEG signals is limited. To obtain more effective spatial filters for better extraction of spatial features that can improve classification to MI-EEG, this paper proposes an adaptive spatial filter solving method based on particle swarm optimization algorithm (PSO). A training and testing framework based on filter bank and spatial filters (FBCSP-ASP) is designed for MI EEG signal classification. Comparative experiments are conducted on two public datasets (2a and 2b) from BCI competition IV, which show the outstanding average recognition accuracy of FBCSP-ASP. The proposed method has achieved significant performance improvement on MI-BCI. The classification accuracy of the proposed method has reached 74.61% and 81.19% on datasets 2a and 2b, respectively. Compared with the baseline algorithm (FBCSP), the proposed algorithm improves 11.44% and 7.11% on two datasets respectively. Furthermore, the analysis based on mutual information, t-SNE and Shapley values further proves that ASP features have excellent decoding ability for MI-EEG signals, and explains the improvement of classification performance by the introduction of ASP features. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: 25 pages, 8 figures

arXiv:2309.11715

Deshadow-Anything: When Segment Anything Model Meets Zero-shot shadow removal

Authors: Xiao Feng Zhang, Tian Yi Song, Jia Wei Yao

Abstract: Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed F… ▽ More Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, hel** to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance. △ Less

Submitted 2 January, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: it needs revised

arXiv:2307.00969 [pdf, other]

High Altitude Platform Stations: the New Network Energy Efficiency Enabler in the 6G Era

Authors: Tailai Song, David Lopez, Michela Meo, Nicola Piovesan, Daniela Renga

Abstract: The rapidly evolving communication landscape, with the advent of 6G technology, brings new challenges to the design and operation of wireless networks. One of the key concerns is the energy efficiency of the Radio Access Network (RAN), as the exponential growth in wireless traffic demands increasingly higher energy consumption. In this paper, we assess the potential of integrating a High Altitude… ▽ More The rapidly evolving communication landscape, with the advent of 6G technology, brings new challenges to the design and operation of wireless networks. One of the key concerns is the energy efficiency of the Radio Access Network (RAN), as the exponential growth in wireless traffic demands increasingly higher energy consumption. In this paper, we assess the potential of integrating a High Altitude Platform Station (HAPS) to improve the energy efficiency of a RAN, and quantify the potential energy conservation through meticulously designed simulations. We propose a quantitative framework based on real traffic patterns to estimate the energy consumption of the HAPS integrated RAN and compare it with the conventional terrestrial RAN. Our simulation results elucidate that HAPS can significantly reduce energy consumption by up to almost 30\% by exploiting the unique advantages of HAPS, such as its self-sustainability, high altitude, and wide coverage. We further analyze the impact of different system parameters on performance, and provide insights for the design and optimization of future 6G networks. Our work sheds light on the potential of HAPS integrated RAN to mitigate the energy challenges in the 6G era, and contributes to the sustainable development of wireless communications. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2305.17860 [pdf, other]

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Authors: Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang

Abstract: In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d… ▽ More In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction. △ Less

Submitted 30 May, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

arXiv:2305.16789 [pdf, other]

Modulate Your Spectrum in Self-Supervised Learning

Authors: Xi Weng, Yunhao Ni, Tengwei Song, Jie Luo, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan, Lei Huang

Abstract: Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond… ▽ More Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show that whitening is a special instance of ST by definition, and our empirical investigations unveil other ST instances capable of preventing collapse. Additionally, we propose a novel ST instance named IterNorm with trace loss (INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse and modulating the spectrum of embedding toward equal-eigenvalues during optimization. Our experiments on ImageNet classification and COCO object detection demonstrate INTL's potential in learning superior representations. The code is available at https://github.com/winci-ai/INTL. △ Less

Submitted 21 January, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted at ICLR 2024. The code is available at https://github.com/winci-ai/intl

arXiv:2211.08530 [pdf, ps, other]

Cyber-Attack Event Analysis for EV Charging Stations

Authors: Mansi Girdhar, Junho Hong, Yongsik You, Tai-** Song, Manimaran Govindarasu

Abstract: Safe and secure electric vehicle charging stations (EVCSs) are important in smart transportation infrastructure. The prevalence of EVCSs has rapidly increased over time in response to the rising demand for EV charging. However, developments in information and communication technologies (ICT) have made the cyber-physical system (CPS) of EVCSs susceptible to cyber-attacks, which might destabilize th… ▽ More Safe and secure electric vehicle charging stations (EVCSs) are important in smart transportation infrastructure. The prevalence of EVCSs has rapidly increased over time in response to the rising demand for EV charging. However, developments in information and communication technologies (ICT) have made the cyber-physical system (CPS) of EVCSs susceptible to cyber-attacks, which might destabilize the infrastructure of the electric grid as well as the environment for charging. This study suggests a 5Ws \& 1H-based investigation approach to deal with cyber-attack-related incidents due to the incapacity of the current investigation frameworks to comprehend and handle these mishaps. Also, a stochastic anomaly detection system (ADS) is proposed to identify the anomalies, abnormal activities, and unusual operations of the station entities as a post cyber event analysis. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: 5 Pages, 2 Figures, 2 Tables, 10 Mathematical Equations, PES GM Conference Paper

arXiv:2211.06136 [pdf, other]

Fleet Rebalancing for Expanding Shared e-Mobility Systems: A Multi-agent Deep Reinforcement Learning Approach

Authors: Man Luo, Bowen Du, Wenzhe Zhang, Tianyou Song, Kun Li, Hongming Zhu, Mark Birkin, Hongkai Wen

Abstract: The electrification of shared mobility has become popular across the globe. Many cities have their new shared e-mobility systems deployed, with continuously expanding coverage from central areas to the city edges. A key challenge in the operation of these systems is fleet rebalancing, i.e., how EVs should be repositioned to better satisfy future demand. This is particularly challenging in the cont… ▽ More The electrification of shared mobility has become popular across the globe. Many cities have their new shared e-mobility systems deployed, with continuously expanding coverage from central areas to the city edges. A key challenge in the operation of these systems is fleet rebalancing, i.e., how EVs should be repositioned to better satisfy future demand. This is particularly challenging in the context of expanding systems, because i) the range of the EVs is limited while charging time is typically long, which constrain the viable rebalancing operations; and ii) the EV stations in the system are dynamically changing, i.e., the legitimate targets for rebalancing operations can vary over time. We tackle these challenges by first investigating rich sets of data collected from a real-world shared e-mobility system for one year, analyzing the operation model, usage patterns and expansion dynamics of this new mobility mode. With the learned knowledge we design a high-fidelity simulator, which is able to abstract key operation details of EV sharing at fine granularity. Then we model the rebalancing task for shared e-mobility systems under continuous expansion as a Multi-Agent Reinforcement Learning (MARL) problem, which directly takes the range and charging properties of the EVs into account. We further propose a novel policy optimization approach with action cascading, which is able to cope with the expansion dynamics and solve the formulated MARL. We evaluate the proposed approach extensively, and experimental results show that our approach outperforms the state-of-the-art, offering significant performance gain in both satisfied demand and net revenue. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2211.01046 [pdf, other]

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Authors: Tongtong Song, Qiang Xu, Haoyu Lu, Longbiao Wang, Hao Shi, Yuqin Lin, Yanbing Yang, Jianwu Dang

Abstract: The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recogn… ▽ More The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recognizers fusion method for CS ASR. It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage. In the SA stage, acoustic features are mapped to two language-specific predictions by two independent MAMs. To keep the MAMs focused on their own language, we further extend the language-aware training strategy for the MAMs. In the LF stage, the BELM fuses two language-specific predictions to get the final prediction. Moreover, we propose a text simulation strategy to simplify the training process of the BELM and reduce reliance on CS data. Experiments on a Mandarin-English corpus show the efficiency of the proposed method. The mix error rate is significantly reduced on the test set after using open-source pre-trained MAMs. △ Less

Submitted 2 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP2023

arXiv:2208.10644 [pdf, other]

Machine Learning-Enabled Cyber Attack Prediction and Mitigation for EV Charging Stations

Authors: Mansi Girdhar, Junho Hong, Yongsik Yoo, Tai-** Song

Abstract: Safe and reliable electric vehicle charging stations (EVCSs) have become imperative in an intelligent transportation infrastructure. Over the years, there has been a rapid increase in the deployment of EVCSs to address the upsurging charging demands. However, advances in information and communication technologies (ICT) have rendered this cyber-physical system (CPS) vulnerable to suffering cyber th… ▽ More Safe and reliable electric vehicle charging stations (EVCSs) have become imperative in an intelligent transportation infrastructure. Over the years, there has been a rapid increase in the deployment of EVCSs to address the upsurging charging demands. However, advances in information and communication technologies (ICT) have rendered this cyber-physical system (CPS) vulnerable to suffering cyber threats, thereby destabilizing the charging ecosystem and even the entire electric grid infrastructure. This paper develops an advanced cybersecurity framework, where STRIDE threat modeling is used to identify potential vulnerabilities in an EVCS. Further, the weighted attack defense tree approach is employed to create multiple attack scenarios, followed by develo** Hidden Markov Model (HMM) and Partially Observable Monte-Carlo Planning (POMCP) algorithms for modeling the security attacks. Also, potential mitigation strategies are suggested for the identified threats. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 5 pages, 4 figures, 11 mathematical equations

arXiv:2206.14580 [pdf, other]

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Authors: Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang

Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili… ▽ More Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two language-specific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method. △ Less

Submitted 11 July, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted by Interspeech 2022

arXiv:2203.02106 [pdf, other]

Scribble-Supervised Medical Image Segmentation via Dual-Branch Network and Dynamically Mixed Pseudo Labels Supervision

Authors: Xiangde Luo, Minhao Hu, Wenjun Liao, Shuwei Zhai, Tao Song, Guotai Wang, Shaoting Zhang

Abstract: Medical image segmentation plays an irreplaceable role in computer-assisted diagnosis, treatment planning, and following-up. Collecting and annotating a large-scale dataset is crucial to training a powerful segmentation model, but producing high-quality segmentation masks is an expensive and time-consuming procedure. Recently, weakly-supervised learning that uses sparse annotations (points, scribb… ▽ More Medical image segmentation plays an irreplaceable role in computer-assisted diagnosis, treatment planning, and following-up. Collecting and annotating a large-scale dataset is crucial to training a powerful segmentation model, but producing high-quality segmentation masks is an expensive and time-consuming procedure. Recently, weakly-supervised learning that uses sparse annotations (points, scribbles, bounding boxes) for network training has achieved encouraging performance and shown the potential for annotation cost reduction. However, due to the limited supervision signal of sparse annotations, it is still challenging to employ them for networks training directly. In this work, we propose a simple yet efficient scribble-supervised image segmentation method and apply it to cardiac MRI segmentation. Specifically, we employ a dual-branch network with one encoder and two slightly different decoders for image segmentation and dynamically mix the two decoders' predictions to generate pseudo labels for auxiliary supervision. By combining the scribble supervision and auxiliary pseudo labels supervision, the dual-branch network can efficiently learn from scribble annotations end-to-end. Experiments on the public ACDC dataset show that our method performs better than current scribble-supervised segmentation methods and also outperforms several semi-supervised segmentation methods. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 11 pages, 4 figures,code is available: https://github.com/HiLab-git/WSL4MIS.This is a comprehensive study about scribble-supervised medical image segmentation based on the ACDC dataset

arXiv:2112.04894 [pdf, other]

Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer

Authors: Xiangde Luo, Minhao Hu, Tao Song, Guotai Wang, Shaoting Zhang

Abstract: Recently, deep learning with Convolutional Neural Networks (CNNs) and Transformers has shown encouraging results in fully supervised medical image segmentation. However, it is still challenging for them to achieve good performance with limited annotations for training. In this work, we present a very simple yet efficient framework for semi-supervised medical image segmentation by introducing the c… ▽ More Recently, deep learning with Convolutional Neural Networks (CNNs) and Transformers has shown encouraging results in fully supervised medical image segmentation. However, it is still challenging for them to achieve good performance with limited annotations for training. In this work, we present a very simple yet efficient framework for semi-supervised medical image segmentation by introducing the cross teaching between CNN and Transformer. Specifically, we simplify the classical deep co-training from consistency regularization to cross teaching, where the prediction of a network is used as the pseudo label to supervise the other network directly end-to-end. Considering the difference in learning paradigm between CNN and Transformer, we introduce the Cross Teaching between CNN and Transformer rather than just using CNNs. Experiments on a public benchmark show that our method outperforms eight existing semi-supervised learning methods just with a simpler framework. Notably, this work may be the first attempt to combine CNN and transformer for semi-supervised medical image segmentation and achieve promising results on a public benchmark. The code will be released at: https://github.com/HiLab-git/SSL4MIS. △ Less

Submitted 1 March, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

Comments: accepted to MIDL2022, code in SSL4MIS:https://github.com/HiLab-git/SSL4MIS

arXiv:2111.02403 [pdf, other]

doi 10.1016/j.media.2022.102642

WORD: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from CT image

Authors: Xiangde Luo, Wenjun Liao, Jianghong Xiao, Jieneng Chen, Tao Song, Xiaofan Zhang, Kang Li, Dimitris N. Metaxas, Guotai Wang, Shaoting Zhang

Abstract: Whole abdominal organ segmentation is important in diagnosing abdomen lesions, radiotherapy, and follow-up. However, oncologists' delineating all abdominal organs from 3D volumes is time-consuming and very expensive. Deep learning-based medical image segmentation has shown the potential to reduce manual delineation efforts, but it still requires a large-scale fine annotated dataset for training, a… ▽ More Whole abdominal organ segmentation is important in diagnosing abdomen lesions, radiotherapy, and follow-up. However, oncologists' delineating all abdominal organs from 3D volumes is time-consuming and very expensive. Deep learning-based medical image segmentation has shown the potential to reduce manual delineation efforts, but it still requires a large-scale fine annotated dataset for training, and there is a lack of large-scale datasets covering the whole abdomen region with accurate and detailed annotations for the whole abdominal organ segmentation. In this work, we establish a new large-scale \textit{W}hole abdominal \textit{OR}gan \textit{D}ataset (\textit{WORD}) for algorithm research and clinical application development. This dataset contains 150 abdominal CT volumes (30495 slices). Each volume has 16 organs with fine pixel-level annotations and scribble-based sparse annotations, which may be the largest dataset with whole abdominal organ annotation. Several state-of-the-art segmentation methods are evaluated on this dataset. And we also invited three experienced oncologists to revise the model predictions to measure the gap between the deep learning method and oncologists. Afterwards, we investigate the inference-efficient learning on the WORD, as the high-resolution image requires large GPU memory and a long inference time in the test stage. We further evaluate the scribble-based annotation-efficient learning on this dataset, as the pixel-wise manual annotation is time-consuming and expensive. The work provided a new benchmark for the abdominal multi-organ segmentation task, and these experiments can serve as the baseline for future research and clinical application development. △ Less

Submitted 12 February, 2023; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: Accepted to Medical Image Analysis, dataset at: https://github.com/HiLab-git/WORD (we corrected the results or description in this version.)

arXiv:2107.13595 [pdf]

doi 10.1016/j.cpet.2021.06.005

Artificial Intelligence-Based Image Enhancement in PET Imaging: Noise Reduction and Resolution Enhancement

Authors: Juan Liu, Masoud Malekzadeh, Niloufar Mirian, Tzu-An Song, Chi Liu, Joyita Dutta

Abstract: High noise and low spatial resolution are two key confounding factors that limit the qualitative and quantitative accuracy of PET images. AI models for image denoising and deblurring are becoming increasingly popular for post-reconstruction enhancement of PET images. We present here a detailed review of recent efforts for AI-based PET image enhancement with a focus on network architectures, data t… ▽ More High noise and low spatial resolution are two key confounding factors that limit the qualitative and quantitative accuracy of PET images. AI models for image denoising and deblurring are becoming increasingly popular for post-reconstruction enhancement of PET images. We present here a detailed review of recent efforts for AI-based PET image enhancement with a focus on network architectures, data types, loss functions, and evaluation metrics. We also highlight emerging areas in this field that are quickly gaining popularity, identify barriers to large-scale adoption of AI models for PET image enhancement, and discuss future directions. △ Less

Submitted 28 July, 2021; originally announced July 2021.

arXiv:2009.10549 [pdf, other]

doi 10.1109/TMI.2020.3035253

CA-Net: Comprehensive Attention Convolutional Neural Networks for Explainable Medical Image Segmentation

Authors: Ran Gu, Guotai Wang, Tao Song, Rui Huang, Michael Aertsen, Jan Deprest, Sébastien Ourselin, Tom Vercauteren, Shaoting Zhang

Abstract: Accurate medical image segmentation is essential for diagnosis and treatment planning of diseases. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they are still challenged by complicated conditions where the segmentation target has large variations of position, shape and scale, and existing CNNs have a poor explain… ▽ More Accurate medical image segmentation is essential for diagnosis and treatment planning of diseases. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they are still challenged by complicated conditions where the segmentation target has large variations of position, shape and scale, and existing CNNs have a poor explainability that limits their application to clinical decisions. In this work, we make extensive use of multiple attentions in a CNN architecture and propose a comprehensive attention-based CNN (CA-Net) for more accurate and explainable medical image segmentation that is aware of the most important spatial positions, channels and scales at the same time. In particular, we first propose a joint spatial attention module to make the network focus more on the foreground region. Then, a novel channel attention module is proposed to adaptively recalibrate channel-wise feature responses and highlight the most relevant feature channels. Also, we propose a scale attention module implicitly emphasizing the most salient feature maps among multiple scales so that the CNN is adaptive to the size of an object. Extensive experiments on skin lesion segmentation from ISIC 2018 and multi-class segmentation of fetal MRI found that our proposed CA-Net significantly improved the average segmentation Dice score from 87.77% to 92.08% for skin lesion, 84.79% to 87.08% for the placenta and 93.20% to 95.88% for the fetal brain respectively compared with U-Net. It reduced the model size to around 15 times smaller with close or even better accuracy compared with state-of-the-art DeepLabv3+. In addition, it has a much higher explainability than existing networks by visualizing the attention weight maps. Our code is available at https://github.com/HiLab-git/CA-Net △ Less

Submitted 22 September, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

arXiv:2007.03294 [pdf, other]

doi 10.1016/j.media.2020.101787

Automatic Ischemic Stroke Lesion Segmentation from Computed Tomography Perfusion Images by Image Synthesis and Attention-Based Deep Neural Networks

Authors: Guotai Wang, Tao Song, Qiang Dong, Mei Cui, Ning Huang, Shaoting Zhang

Abstract: Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weight… ▽ More Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weighted Imaging (DWI) from perfusion parameter maps to obtain better image quality for more accurate segmentation. Our framework consists of three components based on Convolutional Neural Networks (CNNs) and is trained end-to-end. First, a feature extractor is used to obtain both a low-level and high-level compact representation of the raw spatiotemporal Computed Tomography Angiography (CTA) images. Second, a pseudo DWI generator takes as input the concatenation of CTP perfusion parameter maps and our extracted features to obtain the synthesized pseudo DWI. To achieve better synthesis quality, we propose a hybrid loss function that pays more attention to lesion regions and encourages high-level contextual consistency. Finally, we segment the lesion region from the synthesized pseudo DWI, where the segmentation network is based on switchable normalization and channel calibration for better performance. Experimental results showed that our framework achieved the top performance on ISLES 2018 challenge and: 1) our method using synthesized pseudo DWI outperformed methods segmenting the lesion from perfusion parameter maps directly; 2) the feature extractor exploiting additional spatiotemporal CTA images led to better synthesized pseudo DWI quality and higher segmentation accuracy; and 3) the proposed loss functions and network structure improved the pseudo DWI synthesis and lesion segmentation performance. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 14 pages, 10 figures

arXiv:2005.08701 [pdf, other]

doi 10.1007/s40042-021-00056-8

Machine learning for the diagnosis of early stage diabetes using temporal glucose profiles

Authors: Woo Seok Lee, Junghyo Jo, Taegeun Song

Abstract: Machine learning shows remarkable success for recognizing patterns in data. Here we apply the machine learning (ML) for the diagnosis of early stage diabetes, which is known as a challenging task in medicine. Blood glucose levels are tightly regulated by two counter-regulatory hormones, insulin and glucagon, and the failure of the glucose homeostasis leads to the common metabolic disease, diabetes… ▽ More Machine learning shows remarkable success for recognizing patterns in data. Here we apply the machine learning (ML) for the diagnosis of early stage diabetes, which is known as a challenging task in medicine. Blood glucose levels are tightly regulated by two counter-regulatory hormones, insulin and glucagon, and the failure of the glucose homeostasis leads to the common metabolic disease, diabetes mellitus. It is a chronic disease that has a long latent period the complicates detection of the disease at an early stage. The vast majority of diabetics result from that diminished effectiveness of insulin action. The insulin resistance must modify the temporal profile of blood glucose. Thus we propose to use ML to detect the subtle change in the temporal pattern of glucose concentration. Time series data of blood glucose with sufficient resolution is currently unavailable, so we confirm the proposal using synthetic data of glucose profiles produced by a biophysical model that considers the glucose regulation and hormone action. Multi-layered perceptrons, convolutional neural networks, and recurrent neural networks all identified the degree of insulin resistance with high accuracy above $85\%$. △ Less

Submitted 18 May, 2020; originally announced May 2020.

Comments: 4 pages, 2 figure

arXiv:2004.08063 [pdf, other]

Outage Analysis for Intelligent Reflecting Surface Assisted Vehicular Communication Networks

Authors: Jue Wang, Wence Zhang, Xu Bao, Tiecheng Song, Cunhua Pan

Abstract: Vehicular communication is an important application of the fifth generation of mobile communication systems (5G). Due to its low cost and energy efficiency, intelligent reflecting surface (IRS) has been envisioned as a promising technique that can enhance the coverage performance significantly by passive beamforming. In this paper, we analyze the outage probability performance in IRS-assisted vehi… ▽ More Vehicular communication is an important application of the fifth generation of mobile communication systems (5G). Due to its low cost and energy efficiency, intelligent reflecting surface (IRS) has been envisioned as a promising technique that can enhance the coverage performance significantly by passive beamforming. In this paper, we analyze the outage probability performance in IRS-assisted vehicular communication networks. We derive the expression of outage probability by utilizing series expansion and central limit theorem. Numerical results show that the IRS can significantly reduce the outage probability for vehicles in its vicinity. The outage probability is closely related to the vehicle density and the number of IRS elements, and better performance is achieved with more reflecting elements. △ Less

Submitted 20 April, 2020; v1 submitted 17 April, 2020; originally announced April 2020.

arXiv:2004.07031 [pdf, other]

SenseCare: A Research Platform for Medical Image Informatics and Interactive 3D Visualization

Authors: Qi Duan, Guotai Wang, Rui Wang, Chao Fu, Xinjun Li, Na Wang, Yechong Huang, Xiaodi Huang, Tao Song, Liang Zhao, Xinglong Liu, Qing Xia, Zhiqiang Hu, Yinan Chen, Shaoting Zhang

Abstract: Clinical research on smart health has an increasing demand for intelligent and clinic-oriented medical image computing algorithms and platforms that support various applications. To this end, we have developed SenseCare research platform, which is designed to facilitate translational research on intelligent diagnosis and treatment planning in various clinical scenarios. To enable clinical research… ▽ More Clinical research on smart health has an increasing demand for intelligent and clinic-oriented medical image computing algorithms and platforms that support various applications. To this end, we have developed SenseCare research platform, which is designed to facilitate translational research on intelligent diagnosis and treatment planning in various clinical scenarios. To enable clinical research with Artificial Intelligence (AI), SenseCare provides a range of AI toolkits for different tasks, including image segmentation, registration, lesion and landmark detection from various image modalities ranging from radiology to pathology. In addition, SenseCare is clinic-oriented and supports a wide range of clinical applications such as diagnosis and surgical planning for lung cancer, pelvic tumor, coronary artery disease, etc. SenseCare provides several appealing functions and features such as advanced 3D visualization, concurrent and efficient web-based access, fast data synchronization and high data security, multi-center deployment, support for collaborative research, etc. In this report, we present an overview of SenseCare as an efficient platform providing comprehensive toolkits and high extensibility for intelligent image analysis and clinical research in different application scenarios. We also summarize the research outcome through the collaboration with multiple hospitals. △ Less

Submitted 2 September, 2022; v1 submitted 2 April, 2020; originally announced April 2020.

Comments: 15 pages, 16 figures

arXiv:2003.02260 [pdf, other]

Spatiotemporal-Aware Augmented Reality: Redefining HCI in Image-Guided Therapy

Authors: Javad Fotouhi, Arian Mehrfard, Tianyu Song, Alex Johnson, Greg Osgood, Mathias Unberath, Mehran Armand, Nassir Navab

Abstract: Suboptimal interaction with patient data and challenges in mastering 3D anatomy based on ill-posed 2D interventional images are essential concerns in image-guided therapies. Augmented reality (AR) has been introduced in the operating rooms in the last decade; however, in image-guided interventions, it has often only been considered as a visualization device improving traditional workflows. As a co… ▽ More Suboptimal interaction with patient data and challenges in mastering 3D anatomy based on ill-posed 2D interventional images are essential concerns in image-guided therapies. Augmented reality (AR) has been introduced in the operating rooms in the last decade; however, in image-guided interventions, it has often only been considered as a visualization device improving traditional workflows. As a consequence, the technology is gaining minimum maturity that it requires to redefine new procedures, user interfaces, and interactions. The main contribution of this paper is to reveal how exemplary workflows are redefined by taking full advantage of head-mounted displays when entirely co-registered with the imaging system at all times. The proposed AR landscape is enabled by co-localizing the users and the imaging devices via the operating room environment and exploiting all involved frustums to move spatial information between different bodies. The awareness of the system from the geometric and physical characteristics of X-ray imaging allows the redefinition of different human-machine interfaces. We demonstrate that this AR paradigm is generic, and can benefit a wide variety of procedures. Our system achieved an error of $4.76\pm2.91$ mm for placing K-wire in a fracture management procedure, and yielded errors of $1.57\pm1.16^\circ$ and $1.46\pm1.00^\circ$ in the abduction and anteversion angles, respectively, for total hip arthroplasty. We hope that our holistic approach towards improving the interface of surgery not only augments the surgeon's capabilities but also augments the surgical team's experience in carrying out an effective intervention with reduced complications and provide novel approaches of documenting procedures for training purposes. △ Less

Submitted 4 March, 2020; originally announced March 2020.

arXiv:1906.03645 [pdf, other]

doi 10.1109/TCI.2020.2964229

Super-resolution PET imaging using convolutional neural networks

Authors: Tzu-An Song, Samadrita Roy Chowdhury, Fan Yang, Joyita Dutta

Abstract: Positron emission tomography (PET) suffers from severe resolution limitations which limit its quantitative accuracy. In this paper, we present a super-resolution (SR) imaging technique for PET based on convolutional neural networks (CNNs). To facilitate the resolution recovery process, we incorporate high-resolution (HR) anatomical information based on magnetic resonance (MR) imaging. We introduce… ▽ More Positron emission tomography (PET) suffers from severe resolution limitations which limit its quantitative accuracy. In this paper, we present a super-resolution (SR) imaging technique for PET based on convolutional neural networks (CNNs). To facilitate the resolution recovery process, we incorporate high-resolution (HR) anatomical information based on magnetic resonance (MR) imaging. We introduce the spatial location information of the input image patches as additional CNN inputs to accommodate the spatially-variant nature of the blur kernels in PET. We compared the performance of shallow (3-layer) and very deep (20-layer) CNNs with various combinations of the following inputs: low-resolution (LR) PET, radial locations, axial locations, and HR MR. To validate the CNN architectures, we performed both realistic simulation studies using the BrainWeb digital phantom and clinical neuroimaging data analysis. For both simulation and clinical studies, the LR PET images were based on the Siemens HR+ scanner. Two different scenarios were examined in simulation: one where the target HR image is the ground-truth phantom image and another where the target HR image is based on the Siemens HRRT scanner --- a high-resolution dedicated brain PET scanner. The latter scenario was also examined using clinical neuroimaging datasets. A number of factors affected relative performance of the different CNN designs examined, including network depth, target image quality, and the resemblance between the target and anatomical images. In general, however, all CNNs outperformed classical penalized deconvolution techniques by large margins both qualitatively (e.g., edge and contrast recovery) and quantitatively (as indicated by two metrics: peak signal-to-noise-ratio and structural similarity index). △ Less

Submitted 9 June, 2019; originally announced June 2019.

arXiv:1906.02392 [pdf]

Generative Model-Based Ischemic Stroke Lesion Segmentation

Authors: Tao Song

Abstract: CT perfusion (CTP) has been used to triage ischemic stroke patients in the early stage, because of its speed, availability, and lack of contraindications. Perfusion parameters including cerebral blood volume (CBV), cerebral blood flow (CBF), mean transit time (MTT) and time of peak (Tmax) could also be computed from CTP data. However, CTP data or the perfusion parameters, are ambiguous to locate t… ▽ More CT perfusion (CTP) has been used to triage ischemic stroke patients in the early stage, because of its speed, availability, and lack of contraindications. Perfusion parameters including cerebral blood volume (CBV), cerebral blood flow (CBF), mean transit time (MTT) and time of peak (Tmax) could also be computed from CTP data. However, CTP data or the perfusion parameters, are ambiguous to locate the infarct core or tissue at risk (penumbra), which is normally confirmed by the follow-up Diffusion Weighted Imaging (DWI) or perfusion diffusion mismatch. In this paper, we propose a novel generative modelbased segmentation framework composed of an extractor, a generator and a segmentor for ischemic stroke lesion segmentation. First, an extractor is used to directly extract the representative feature images from the CTP feature images. Second, a generator is used to generate the clinical relevant DWI images using the output from the extractor and perfusion parameters. Finally, the segmentor is used to precisely segment the ischemic stroke lesion using the generated DWI from the generator. Meanwhile, a novel pixel-region loss function, generalized dice combined with weighted cross entropy, is used to handle data unbalance problem which is commonly encountered in medical image segmentation. All networks are trained end-to-end from scratch using the 2018 Ischemic Stroke Lesion Segmentation Challenge (ISLES) dataset and our method won the first place in the 2018 ischemic stroke lesions segmentation challenge in the test stage. △ Less

Submitted 5 June, 2019; originally announced June 2019.

arXiv:1906.01704 [pdf, other]

A Novel Bi-hemispheric Discrepancy Model for EEG Emotion Recognition

Authors: Yang Li, Wenming Zheng, Lei Wang, Yuan Zong, Lei Qi, Zhen Cui, Tong Zhang, Tengfei Song

Abstract: The neuroscience study has revealed the discrepancy of emotion expression between left and right hemispheres of human brain. Inspired by this study, in this paper, we propose a novel bi-hemispheric discrepancy model (BiHDM) to learn the asymmetric differences between two hemispheres for electroencephalograph (EEG) emotion recognition. Concretely, we first employ four directed recurrent neural netw… ▽ More The neuroscience study has revealed the discrepancy of emotion expression between left and right hemispheres of human brain. Inspired by this study, in this paper, we propose a novel bi-hemispheric discrepancy model (BiHDM) to learn the asymmetric differences between two hemispheres for electroencephalograph (EEG) emotion recognition. Concretely, we first employ four directed recurrent neural networks (RNNs) based on two spatial orientations to traverse electrode signals on two separate brain regions, which enables the model to obtain the deep representations of all the EEG electrodes' signals while kee** the intrinsic spatial dependence. Then we design a pairwise subnetwork to capture the discrepancy information between two hemispheres and extract higher-level features for final classification. Besides, in order to reduce the domain shift between training and testing data, we use a domain discriminator that adversarially induces the overall feature learning module to generate emotion-related but domain-invariant feature, which can further promote EEG emotion recognition. We conduct experiments on three public EEG emotional datasets, and the experiments show that the new state-of-the-art results can be achieved. △ Less

Submitted 10 May, 2019; originally announced June 2019.

Showing 1–29 of 29 results for author: Song, T