Search | arXiv e-print repository

A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge

Authors: Xiaopeng Wang, Yi Lu, Xin Qi, Zhiyong Wang, Yuankun Xie, Shuchen Shi, Ruibo Fu

Abstract: This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-… ▽ More This paper presents the development of a speech synthesis system for the LIMMITS'24 Challenge, focusing primarily on Track 2. The objective of the challenge is to establish a multi-speaker, multi-lingual Indic Text-to-Speech system with voice cloning capabilities, covering seven Indian languages with both male and female speakers. The system was trained using challenge data and fine-tuned for few-shot voice cloning on target speakers. Evaluation included both mono-lingual and cross-lingual synthesis across all seven languages, with subjective tests assessing naturalness and speaker similarity. Our system uses the VITS2 architecture, augmented with a multi-lingual ID and a BERT model to enhance contextual language comprehension. In Track 1, where no additional data usage was permitted, our model achieved a Speaker Similarity score of 4.02. In Track 2, which allowed the use of extra data, it attained a Speaker Similarity score of 4.17. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.10591 [pdf, other]

MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation

Authors: Ruibo Fu, Shuchen Shi, Hongming Guo, Tao Wang, Chunyu Qiang, Zhengqi Wen, Jianhua Tao, Xin Qi, Yi Lu, Xiaopeng Wang, Zhiyong Wang, Yukun Liu, Xuefei Liu, Shuai Zhang, Guanjun Li

Abstract: Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on… ▽ More Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT . △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.08112 [pdf, other]

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

Abstract: With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to… ▽ More With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skip** the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880

arXiv:2406.04683 [pdf, other]

PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

Authors: Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

Abstract: Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge abo… ▽ More Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge about textual descriptions inherent in large language models to effectively enhance the robustness of TTA acoustic models without altering the acoustic training set. Furthermore, a Chain-of-Thought that mimics human verification is introduced to enhance the accuracy of audio descriptions, thereby improving the accuracy of generated content in practical applications. The experiments show that our method achieves a state-of-the-art Inception Score (IS) of 8.72, surpassing AudioGen, AudioLDM and Tango. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: accepted by INTERSPEECH2024

arXiv:2406.03247 [pdf, other]

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

Authors: Xiaopeng Wang, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Yuankun Xie, Yukun Liu, Jianhua Tao, Xuefei Liu, Yongwei Li, Xin Qi, Yi Lu, Shuchen Shi

Abstract: The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation… ▽ More The generalization of Fake Audio Detection (FAD) is critical due to the emergence of new spoofing techniques. Traditional FAD methods often focus solely on distinguishing between genuine and known spoofed audio. We propose a Genuine-Focused Learning (GFL) framework guided, aiming for highly generalized FAD, called GFL-FAD. This method incorporates a Counterfactual Reasoning Enhanced Representation (CRER) based on audio reconstruction using the Mask AutoEncoder (MAE) architecture to accurately model genuine audio features. To reduce the influence of spoofed audio during training, we introduce a genuine audio reconstruction loss, maintaining the focus on learning genuine data features. In addition, content-related bottleneck (BN) features are extracted from the MAE to supplement the knowledge of the original audio. These BN features are adaptively fused with CRER to further improve robustness. Our method achieves state-of-the-art performance with an EER of 0.25% on ASVspoof2019 LA. △ Less

Submitted 9 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.03237 [pdf, other]

Generalized Fake Audio Detection via Deep Stable Learning

Authors: Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, Shuchen Shi

Abstract: Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate t… ▽ More Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: accepted by INTERSPEECH2024

arXiv:2405.04880 [pdf, other]

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Authors: Yuankun Xie, Yi Lu, Ruibo Fu, Zhengqi Wen, Zhiyong Wang, Jianhua Tao, Xin Qi, Xiaopeng Wang, Yukun Liu, Haonan Cheng, Long Ye, Yi Sun

Abstract: With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on… ▽ More With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for generalized detection methods. ALM-based deepfake audio currently exhibits widespread, high deception, and type versatility, posing a significant challenge to current audio deepfake detection (ADD) models trained solely on vocoded data. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including 2 languages, over 1M audio samples, and various test conditions, focus on ALM-based audio detection. As countermeasure, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. In our experiments, we first demonstrate that ADD model training with the Codecfake dataset can effectively detects ALM-based audio. Furthermore, our proposed generalization countermeasure yields the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models. The dataset and associated code are available online. △ Less

Submitted 15 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.11525 [pdf, other]

JointViT: Modeling Oxygen Saturation Levels with Joint Supervision on Long-Tailed OCTA

Authors: Zeyu Zhang, Xuyin Qi, Mingxi Chen, Guangxi Li, Ryan Pham, Ayub Qassim, Ella Berry, Zhibin Liao, Owen Siggs, Robert Mclaughlin, Jamie Craig, Minh-Son To

Abstract: The oxygen saturation level in the blood (SaO2) is crucial for health, particularly in relation to sleep-related breathing disorders. However, continuous monitoring of SaO2 is time-consuming and highly variable depending on patients' conditions. Recently, optical coherence tomography angiography (OCTA) has shown promising development in rapidly and effectively screening eye-related lesions, offeri… ▽ More The oxygen saturation level in the blood (SaO2) is crucial for health, particularly in relation to sleep-related breathing disorders. However, continuous monitoring of SaO2 is time-consuming and highly variable depending on patients' conditions. Recently, optical coherence tomography angiography (OCTA) has shown promising development in rapidly and effectively screening eye-related lesions, offering the potential for diagnosing sleep-related disorders. To bridge this gap, our paper presents three key contributions. Firstly, we propose JointViT, a novel model based on the Vision Transformer architecture, incorporating a joint loss function for supervision. Secondly, we introduce a balancing augmentation technique during data preprocessing to improve the model's performance, particularly on the long-tail distribution within the OCTA dataset. Lastly, through comprehensive experiments on the OCTA dataset, our proposed method significantly outperforms other state-of-the-art methods, achieving improvements of up to 12.28% in overall accuracy. This advancement lays the groundwork for the future utilization of OCTA in diagnosing sleep-related disorders. See project website https://steve-zeyu-zhang.github.io/JointViT △ Less

Submitted 18 April, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

arXiv:2312.12824 [pdf, other]

FedSODA: Federated Cross-assessment and Dynamic Aggregation for Histopathology Segmentation

Authors: Yuan Zhang, Yaolei Qi, Xiaoming Qi, Lotfi Senhadji, Yongyue Wei, Feng Chen, Guanyu Yang

Abstract: Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a… ▽ More Federated learning (FL) for histopathology image segmentation involving multiple medical sites plays a crucial role in advancing the field of accurate disease diagnosis and treatment. However, it is still a task of great challenges due to the sample imbalance across clients and large data heterogeneity from disparate organs, variable segmentation tasks, and diverse distribution. Thus, we propose a novel FL approach for histopathology nuclei and tissue segmentation, FedSODA, via synthetic-driven cross-assessment operation (SO) and dynamic stratified-layer aggregation (DA). Our SO constructs a cross-assessment strategy to connect clients and mitigate the representation bias under sample imbalance. Our DA utilizes layer-wise interaction and dynamic aggregation to diminish heterogeneity and enhance generalization. The effectiveness of our FedSODA has been evaluated on the most extensive histopathology image segmentation dataset from 7 independent datasets. The code is available at https://github.com/yuanzhang7/FedSODA. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP2024

arXiv:2309.00223 [pdf, other]

The FruitShell French synthesis system at the Blizzard 2023 Challenge

Authors: Xin Qi, Xiaopeng Wang, Zhiyong Wang, Wang Liu, Mingming Ding, Shuchen Shi

Abstract: This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phoneme… ▽ More This paper presents a French text-to-speech synthesis system for the Blizzard Challenge 2023. The challenge consists of two tasks: generating high-quality speech from female speakers and generating speech that closely resembles specific individuals. Regarding the competition data, we conducted a screening process to remove missing or erroneous text data. We organized all symbols except for phonemes and eliminated symbols that had no pronunciation or zero duration. Additionally, we added word boundary and start/end symbols to the text, which we have found to improve speech quality based on our previous experience. For the Spoke task, we performed data augmentation according to the competition rules. We used an open-source G2P model to transcribe the French texts into phonemes. As the G2P model uses the International Phonetic Alphabet (IPA), we applied the same transcription process to the provided competition data for standardization. However, due to compiler limitations in recognizing special symbols from the IPA chart, we followed the rules to convert all phonemes into the phonetic scheme used in the competition data. Finally, we resampled all competition audio to a uniform sampling rate of 16 kHz. We employed a VITS-based acoustic model with the hifigan vocoder. For the Spoke task, we trained a multi-speaker model and incorporated speaker information into the duration predictor, vocoder, and flow layers of the model. The evaluation results of our system showed a quality MOS score of 3.6 for the Hub task and 3.4 for the Spoke task, placing our system at an average level among all participating teams. △ Less

Submitted 31 August, 2023; originally announced September 2023.

arXiv:2307.16620 [pdf, other]

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu

Abstract: The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient… ▽ More The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects. △ Less

Submitted 31 July, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: This paper has been received by ACM MM 23

arXiv:2307.08388 [pdf, other]

Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation

Authors: Yaolei Qi, Yuting He, Xiaoming Qi, Yuan Zhang, Guanyu Yang

Abstract: Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneou… ▽ More Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneously enhance perception in three stages: feature extraction, feature fusion, and loss constraint. First, we propose a dynamic snake convolution to accurately capture the features of tubular structures by adaptively focusing on slender and tortuous local structures. Subsequently, we propose a multi-view feature fusion strategy to complement the attention to features from multiple perspectives during feature fusion, ensuring the retention of important information from different global morphologies. Finally, a continuity constraint loss function, based on persistent homology, is proposed to constrain the topological continuity of the segmentation better. Experiments on 2D and 3D datasets show that our DSCNet provides better accuracy and continuity on the tubular structure segmentation task compared with several methods. Our codes will be publicly available. △ Less

Submitted 18 August, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV 2023

arXiv:2306.07505 [pdf]

Deep learning radiomics for assessment of gastroesophageal varices in people with compensated advanced chronic liver disease

Authors: Lan Wang, Ruiling He, Lili Zhao, Jia Wang, Zhengzi Geng, Tao Ren, Guo Zhang, Peng Zhang, Kaiqiang Tang, Chaofei Gao, Fei Chen, Liting Zhang, Yonghe Zhou, Xin Li, Fanbin He, Hui Huan, Wenjuan Wang, Yunxiao Liang, Juan Tang, Fang Ai, Tingyu Wang, Liyun Zheng, Zhongwei Zhao, Jiansong Ji, Wei Liu , et al. (22 additional authors not shown)

Abstract: Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with… ▽ More Objective: Bleeding from gastroesophageal varices (GEV) is a medical emergency associated with high mortality. We aim to construct an artificial intelligence-based model of two-dimensional shear wave elastography (2D-SWE) of the liver and spleen to precisely assess the risk of GEV and high-risk gastroesophageal varices (HRV). Design: A prospective multicenter study was conducted in patients with compensated advanced chronic liver disease. 305 patients were enrolled from 12 hospitals, and finally 265 patients were included, with 1136 liver stiffness measurement (LSM) images and 1042 spleen stiffness measurement (SSM) images generated by 2D-SWE. We leveraged deep learning methods to uncover associations between image features and patient risk, and thus conducted models to predict GEV and HRV. Results: A multi-modality Deep Learning Risk Prediction model (DLRP) was constructed to assess GEV and HRV, based on LSM and SSM images, and clinical information. Validation analysis revealed that the AUCs of DLRP were 0.91 for GEV (95% CI 0.90 to 0.93, p < 0.05) and 0.88 for HRV (95% CI 0.86 to 0.89, p < 0.01), which were significantly and robustly better than canonical risk indicators, including the value of LSM and SSM. Moreover, DLPR was better than the model using individual parameters, including LSM and SSM images. In HRV prediction, the 2D-SWE images of SSM outperform LSM (p < 0.01). Conclusion: DLRP shows excellent performance in predicting GEV and HRV over canonical risk indicators LSM and SSM. Additionally, the 2D-SWE images of SSM provided more information for better accuracy in predicting HRV than the LSM. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2305.13869 [pdf, other]

Trend-Based SAC Beam Control Method with Zero-Shot in Superconducting Linear Accelerator

Authors: Xiaolong Chen, Xin Qi, Chunguang Su, Yuan He, Zhijun Wang, Kunxiang Sun, Chao **, Weilong Chen, Shuhui Liu, Xiaoying Zhao, Duanyang Jia, Man Yi

Abstract: The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated en… ▽ More The superconducting linear accelerator is a highly flexiable facility for modern scientific discoveries, necessitating weekly reconfiguration and tuning. Accordingly, minimizing setup time proves essential in affording users with ample experimental time. We propose a trend-based soft actor-critic(TBSAC) beam control method with strong robustness, allowing the agents to be trained in a simulated environment and applied to the real accelerator directly with zero-shot. To validate the effectiveness of our method, two different typical beam control tasks were performed on China Accelerator Facility for Superheavy Elements (CAFe II) and a light particle injector(LPI) respectively. The orbit correction tasks were performed in three cryomodules in CAFe II seperately, the time required for tuning has been reduced to one-tenth of that needed by human experts, and the RMS values of the corrected orbit were all less than 1mm. The other transmission efficiency optimization task was conducted in the LPI, our agent successfully optimized the transmission efficiency of radio-frequency quadrupole(RFQ) to over $85\%$ within 2 minutes. The outcomes of these two experiments offer substantiation that our proposed TBSAC approach can efficiently and effectively accomplish beam commissioning tasks while upholding the same standard as skilled human experts. As such, our method exhibits potential for future applications in other accelerator commissioning fields. △ Less

Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2304.14503 [pdf]

UHRNet: A Deep Learning-Based Method for Accurate 3D Reconstruction from a Single Fringe-Pattern

Authors: Yixiao Wang, Canlin Zhou, Xingyang Qi, Hui Li

Abstract: The quick and accurate retrieval of an object height from a single fringe pattern in Fringe Projection Profilometry has been a topic of ongoing research. While a single shot fringe to depth CNN based method can restore height map directly from a single pattern, its accuracy is currently inferior to the traditional phase shifting technique. To improve this method's accuracy, we propose using a U sh… ▽ More The quick and accurate retrieval of an object height from a single fringe pattern in Fringe Projection Profilometry has been a topic of ongoing research. While a single shot fringe to depth CNN based method can restore height map directly from a single pattern, its accuracy is currently inferior to the traditional phase shifting technique. To improve this method's accuracy, we propose using a U shaped High resolution Network (UHRNet). The network uses UNet encoding and decoding structure as backbone, with Multi-Level convolution Block and High resolution Fusion Block applied to extract local features and global features. We also designed a compound loss function by combining Structural Similarity Index Measure Loss (SSIMLoss) function and chunked L2 loss function to improve 3D reconstruction details.We conducted several experiments to demonstrate the validity and robustness of our proposed method. A few experiments have been conducted to demonstrate the validity and robustness of the proposed method, The average RMSE of 3D reconstruction by our method is only 0.443(mm). which is 41.13% of the UNet method and 33.31% of Wang et al hNet method. Our experimental results show that our proposed method can increase the accuracy of 3D reconstruction from a single fringe pattern. △ Less

Submitted 23 April, 2023; originally announced April 2023.

arXiv:2304.12988 [pdf, other]

doi 10.59275/j.melba.2023-7e96

Multi-Scale Feature Fusion using Parallel-Attention Block for COVID-19 Chest X-ray Diagnosis

Authors: Xiao Qi, David J. Foran, John L. Nosher, Ilker Hacihaliloglu

Abstract: Under the global COVID-19 crisis, accurate diagnosis of COVID-19 from Chest X-ray (CXR) images is critical. To reduce intra- and inter-observer variability, during the radiological assessment, computer-aided diagnostic tools have been utilized to supplement medical decision-making and subsequent disease management. Computational methods with high accuracy and robustness are required for rapid tria… ▽ More Under the global COVID-19 crisis, accurate diagnosis of COVID-19 from Chest X-ray (CXR) images is critical. To reduce intra- and inter-observer variability, during the radiological assessment, computer-aided diagnostic tools have been utilized to supplement medical decision-making and subsequent disease management. Computational methods with high accuracy and robustness are required for rapid triaging of patients and aiding radiologists in the interpretation of the collected data. In this study, we propose a novel multi-feature fusion network using parallel attention blocks to fuse the original CXR images and local-phase feature-enhanced CXR images at multi-scales. We examine our model on various COVID-19 datasets acquired from different organizations to assess the generalization ability. Our experiments demonstrate that our method achieves state-of-art performance and has improved generalization capability, which is crucial for widespread deployment. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2023:008

Journal ref: Machine.Learning.for.Biomedical.Imaging. 2 (2023)

arXiv:2303.00369 [pdf, other]

Indescribable Multi-modal Spatial Evaluator

Authors: Lingke Kong, X. Sharon Qi, Qi** Shen, Jiacheng Wang, **gyi Zhang, Yanle Hu, Qichao Zhou

Abstract: Multi-modal image registration spatially aligns two images with different distributions. One of its major challenges is that images acquired from different imaging machines have different imaging distributions, making it difficult to focus only on the spatial aspect of the images and ignore differences in distributions. In this study, we developed a self-supervised approach, Indescribable Multi-mo… ▽ More Multi-modal image registration spatially aligns two images with different distributions. One of its major challenges is that images acquired from different imaging machines have different imaging distributions, making it difficult to focus only on the spatial aspect of the images and ignore differences in distributions. In this study, we developed a self-supervised approach, Indescribable Multi-model Spatial Evaluator (IMSE), to address multi-modal image registration. IMSE creates an accurate multi-modal spatial evaluator to measure spatial differences between two images, and then optimizes registration by minimizing the error predicted of the evaluator. To optimize IMSE performance, we also proposed a new style enhancement method called Shuffle Remap which randomizes the image distribution into multiple segments, and then randomly disorders and remaps these segments, so that the distribution of the original image is changed. Shuffle Remap can help IMSE to predict the difference in spatial location from unseen target distributions. Our results show that IMSE outperformed the existing methods for registration using T1-T2 and CT-MRI datasets. IMSE also can be easily integrated into the traditional registration process, and can provide a convenient way to evaluate and visualize registration results. IMSE also has the potential to be used as a new paradigm for image-to-image translation. Our code is available at https://github.com/Kid-Liet/IMSE. △ Less

Submitted 1 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR2023

arXiv:2302.09621 [pdf]

Augmenting endometriosis analysis from ultrasound data with deep learning

Authors: Adrian Balica, Jennifer Dai, Kayla Piiwaa, Xiao Qi, Ashlee N. Green, Nancy Phillips, Susan Egan, Ilker Hacihaliloglu

Abstract: Endometriosis is a non-malignant disorder that affects 176 million women globally. Diagnostic delays result in severe dysmenorrhea, dyspareunia, chronic pelvic pain, and infertility. Therefore, there is a significant need to diagnose patients at an early stage. Our objective in this work is to investigate the potential of deep learning methods to classify endometriosis from ultrasound data. Retros… ▽ More Endometriosis is a non-malignant disorder that affects 176 million women globally. Diagnostic delays result in severe dysmenorrhea, dyspareunia, chronic pelvic pain, and infertility. Therefore, there is a significant need to diagnose patients at an early stage. Our objective in this work is to investigate the potential of deep learning methods to classify endometriosis from ultrasound data. Retrospective data from 100 subjects were collected at the Rutgers Robert Wood Johnson University Hospital (New Brunswick, NJ, USA). Endometriosis was diagnosed via laparoscopy or laparotomy. We designed and trained five different deep learning methods (Xception, Inception-V4, ResNet50, DenseNet, and EfficientNetB2) for the classification of endometriosis from ultrasound data. Using 5-fold cross-validation study we achieved an average area under the receiver operator curve (AUC) of 0.85 and 0.90 respectively for the two evaluation studies. △ Less

Submitted 19 February, 2023; originally announced February 2023.

Comments: Accepted to 2023 SPIE Medical Imaging Conference

arXiv:2211.00899 [pdf, other]

LightVessel: Exploring Lightweight Coronary Artery Vessel Segmentation via Similarity Knowledge Distillation

Authors: Hao Dang, Yuekai Zhang, Xingqun Qi, Wanting Zhou, Muyi Sun

Abstract: In recent years, deep convolution neural networks (DCNNs) have achieved great prospects in coronary artery vessel segmentation. However, it is difficult to deploy complicated models in clinical scenarios since high-performance approaches have excessive parameters and high computation costs. To tackle this problem, we propose \textbf{LightVessel}, a Similarity Knowledge Distillation Framework, for… ▽ More In recent years, deep convolution neural networks (DCNNs) have achieved great prospects in coronary artery vessel segmentation. However, it is difficult to deploy complicated models in clinical scenarios since high-performance approaches have excessive parameters and high computation costs. To tackle this problem, we propose \textbf{LightVessel}, a Similarity Knowledge Distillation Framework, for lightweight coronary artery vessel segmentation. Primarily, we propose a Feature-wise Similarity Distillation (FSD) module for semantic-shift modeling. Specifically, we calculate the feature similarity between the symmetric layers from the encoder and decoder. Then the similarity is transferred as knowledge from a cumbersome teacher network to a non-trained lightweight student network. Meanwhile, for encouraging the student model to learn more pixel-wise semantic information, we introduce the Adversarial Similarity Distillation (ASD) module. Concretely, the ASD module aims to construct the spatial adversarial correlation between the annotation and prediction from the teacher and student models, respectively. Through the ASD module, the student model obtains fined-grained subtle edge segmented results of the coronary artery vessel. Extensive experiments conducted on Clinical Coronary Artery Vessel Dataset demonstrate that LightVessel outperforms various knowledge distillation counterparts. △ Less

Submitted 25 February, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: 5 pages, 7 figures, conference

arXiv:2208.01843 [pdf, other]

Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis

Authors: Xiao Qi, David J. Foran, John L. Nosher, Ilker Hacihaliloglu

Abstract: The role of chest X-ray (CXR) imaging, due to being more cost-effective, widely available, and having a faster acquisition time compared to CT, has evolved during the COVID-19 pandemic. To improve the diagnostic performance of CXR imaging a growing number of studies have investigated whether supervised deep learning methods can provide additional support. However, supervised methods rely on a larg… ▽ More The role of chest X-ray (CXR) imaging, due to being more cost-effective, widely available, and having a faster acquisition time compared to CT, has evolved during the COVID-19 pandemic. To improve the diagnostic performance of CXR imaging a growing number of studies have investigated whether supervised deep learning methods can provide additional support. However, supervised methods rely on a large number of labeled radiology images, which is a time-consuming and complex procedure requiring expert clinician input. Due to the relative scarcity of COVID-19 patient data and the costly labeling process, self-supervised learning methods have gained momentum and has been proposed achieving comparable results to fully supervised learning approaches. In this work, we study the effectiveness of self-supervised learning in the context of diagnosing COVID-19 disease from CXR images. We propose a multi-feature Vision Transformer (ViT) guided architecture where we deploy a cross-attention mechanism to learn information from both original CXR images and corresponding enhanced local phase CXR images. We demonstrate the performance of the baseline self-supervised learning models can be further improved by leveraging the local phase-based enhanced CXR images. By using 10\% labeled CXR scans, the proposed model achieves 91.10\% and 96.21\% overall accuracy tested on total 35,483 CXR images of healthy (8,851), regular pneumonia (6,045), and COVID-19 (18,159) scans and shows significant improvement over state-of-the-art techniques. Code is available https://github.com/endiqq/Multi-Feature-ViT △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted to the 2022 MICCAI Workshop on Medical Image Learning with Limited and Noisy Data

arXiv:2205.14411 [pdf, other]

Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Authors: Liguang Zhou, Yuhongze Zhou, Xiaonan Qi, Junjie Hu, Tin Lun Lam, Yangsheng Xu

Abstract: Environmental sound classification (ESC) is a challenging problem due to the unstructured spatial-temporal relations that exist in the sound signals. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of semantically relevant frames of sound signals has been overlooked. To this end, we present an end-to-end framework, namely feature py… ▽ More Environmental sound classification (ESC) is a challenging problem due to the unstructured spatial-temporal relations that exist in the sound signals. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of semantically relevant frames of sound signals has been overlooked. To this end, we present an end-to-end framework, namely feature pyramid attention network (FPAM), focusing on abstracting the semantically relevant features for ESC. We first extract the feature maps of the preprocessed spectrogram of the sound waveform by a backbone network. Then, to build multi-scale hierarchical features of sound spectrograms, we construct a feature pyramid representation of the sound spectrograms by aggregating the feature maps from multi-scale layers, where the temporal frames and spatial locations of semantically relevant frames are localized by FPAM. Specifically, the multiple features are first processed by a dimension alignment module. Afterward, the pyramid spatial attention module (PSA) is attached to localize the important frequency regions spatially with a spatial attention module (SAM). Last, the processed feature maps are refined by a pyramid channel attention (PCA) to localize the important temporal frames. To justify the effectiveness of the proposed FPAM, visualization of attention maps on the spectrograms has been presented. The visualization results show that FPAM can focus more on the semantic relevant regions while neglecting the noises. The effectiveness of the proposed methods is validated on two widely used ESC datasets: the ESC-50 and ESC-10 datasets. The experimental results show that the FPAM yields comparable performance to state-of-the-art methods. A substantial performance increase has been achieved by FPAM compared with the baseline methods. △ Less

Submitted 28 May, 2022; originally announced May 2022.

arXiv:2205.04846 [pdf, other]

MNet: Rethinking 2D/3D Networks for Anisotropic Medical Image Segmentation

Authors: Zhangfu Dong, Yuting He, Xiaoming Qi, Yang Chen, Huazhong Shu, Jean-Louis Coatrieux, Guanyu Yang, Shuo Li

Abstract: The nature of thick-slice scanning causes severe inter-slice discontinuities of 3D medical images, and the vanilla 2D/3D convolutional neural networks (CNNs) fail to represent sparse inter-slice information and dense intra-slice information in a balanced way, leading to severe underfitting to inter-slice features (for vanilla 2D CNNs) and overfitting to noise from long-range slices (for vanilla 3D… ▽ More The nature of thick-slice scanning causes severe inter-slice discontinuities of 3D medical images, and the vanilla 2D/3D convolutional neural networks (CNNs) fail to represent sparse inter-slice information and dense intra-slice information in a balanced way, leading to severe underfitting to inter-slice features (for vanilla 2D CNNs) and overfitting to noise from long-range slices (for vanilla 3D CNNs). In this work, a novel mesh network (MNet) is proposed to balance the spatial representation inter axes via learning. 1) Our MNet latently fuses plenty of representation processes by embedding multi-dimensional convolutions deeply into basic modules, making the selections of representation processes flexible, thus balancing representation for sparse inter-slice information and dense intra-slice information adaptively. 2) Our MNet latently fuses multi-dimensional features inside each basic module, simultaneously taking the advantages of 2D (high segmentation accuracy of the easily recognized regions in 2D view) and 3D (high smoothness of 3D organ contour) representations, thus obtaining more accurate modeling for target regions. Comprehensive experiments are performed on four public datasets (CT\&MR), the results consistently demonstrate the proposed MNet outperforms the other methods. The code and datasets are available at: https://github.com/zfdong-code/MNet △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: Accepted by IJCAI 2022

arXiv:2205.00698 [pdf]

Unsupervised Denoising of Optical Coherence Tomography Images with Dual_Merged CycleWGAN

Authors: Jie Du, Xujian Yang, Kecheng **, Xuanzheng Qi, Hu Chen

Abstract: Nosie is an important cause of low quality Optical coherence tomography (OCT) image. The neural network model based on Convolutional neural networks(CNNs) has demonstrated its excellent performance in image denoising. However, OCT image denoising still faces great challenges because many previous neural network algorithms required a large number of labeled data, which might cost much time or is ex… ▽ More Nosie is an important cause of low quality Optical coherence tomography (OCT) image. The neural network model based on Convolutional neural networks(CNNs) has demonstrated its excellent performance in image denoising. However, OCT image denoising still faces great challenges because many previous neural network algorithms required a large number of labeled data, which might cost much time or is expensive. Besides, these CNN-based algorithms need numerous parameters and good tuning techniques, which is hardware resources consuming. To solved above problems, We proposed a new Cycle-Consistent Generative Adversarial Nets called Dual-Merged Cycle-WGAN for retinal OCT image denoiseing, which has remarkable performance with less unlabeled traning data. Our model consists of two Cycle-GAN networks with imporved generator, descriminator and wasserstein loss to achieve good training stability and better performance. Using image merge technique between two Cycle-GAN networks, our model could obtain more detailed information and hence better training effect. The effectiveness and generality of our proposed network has been proved via ablation experiments and comparative experiments. Compared with other state-of-the-art methods, our unsupervised method obtains best subjective visual effect and higher evaluation objective indicators. △ Less

Submitted 2 May, 2022; originally announced May 2022.

Comments: Mr. Hu Chen is our corresponding author

arXiv:2204.06260 [pdf, other]

Self-critical Sequence Training for Automatic Speech Recognition

Authors: Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng

Abstract: Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2… ▽ More Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2) The teacher-forcing method leads to the dependence on ground truth during training, which means that model has never been exposed to its own prediction before testing. In this paper, we propose an optimization method called self-critical sequence training (SCST) to make the training procedure much closer to the testing phase. As a reinforcement learning (RL) based method, SCST utilizes a customized reward function to associate the training criterion and WER. Furthermore, it removes the reliance on teacher-forcing and harmonizes the model with respect to its inference procedure. We conducted experiments on both clean and noisy speech datasets, and the results show that the proposed SCST respectively achieves 8.7% and 7.8% relative improvements over the baseline in terms of WER. △ Less

Submitted 13 April, 2022; originally announced April 2022.

Comments: Accepted by ICASSP 2022

arXiv:2203.15526 [pdf, other]

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

Authors: Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng

Abstract: Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual… ▽ More Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual information. Specifically, the proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information. Furthermore, we also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions. Experimental results show that the proposed CLIP-AAC approach surpasses the best baseline by a significant margin on the Clotho dataset in terms of NLP evaluation metrics. The ablation study indicates that both the pre-trained model and contrastive learning contribute to the performance gain of the AAC model. △ Less

Submitted 12 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Submitted to Interspeech 2022

arXiv:2201.11871 [pdf, other]

Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey

Authors: Zhengwei Bai, Guoyuan Wu, Xuewei Qi, Yongkang Liu, Kentaro Oguchi, Matthew J. Barth

Abstract: Object detection plays a fundamental role in enabling Cooperative Driving Automation (CDA), which is regarded as the revolutionary solution to addressing safety, mobility, and sustainability issues of contemporary transportation systems. Although current computer vision technologies could provide satisfactory object detection results in occlusion-free scenarios, the perception performance of onboa… ▽ More Object detection plays a fundamental role in enabling Cooperative Driving Automation (CDA), which is regarded as the revolutionary solution to addressing safety, mobility, and sustainability issues of contemporary transportation systems. Although current computer vision technologies could provide satisfactory object detection results in occlusion-free scenarios, the perception performance of onboard sensors could be inevitably limited by the range and occlusion. Owing to flexible position and pose for sensor installation, infrastructure-based detection and tracking systems can enhance the perception capability for connected vehicles and thus quickly become one of the most popular research topics. In this paper, we review the research progress for infrastructure-based object detection and tracking systems. Architectures of roadside perception systems based on different types of sensors are reviewed to show a high-level description of the workflows for infrastructure-based perception systems. Roadside sensors and different perception methodologies are reviewed and analyzed with detailed literature to provide a low-level explanation for specific methods followed by Datasets and Simulators to draw an overall landscape of infrastructure-based object detection and tracking methods. Discussions are conducted to point out current opportunities, open problems, and anticipated future trends. △ Less

Submitted 19 March, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

arXiv:2201.03313 [pdf, other]

Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection

Authors: **g Du, Shiliang Pu, Qinbo Dong, Chao **, Xin Qi, Dian Gu, Ru Wu, Hongwei Zhou

Abstract: Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints… ▽ More Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints a confidence estimator and an error corrector in multi-task learning fashion and 3) unifies error correction and utterance rejection modules. Compared with single-modal or single-task models, our proposed system is proved to be more effective and efficient. Experiment result shows that our post-processing system leads to more than 10% relative reduction of character error rate (CER) for both single-speaker and multi-speaker speech on our industrial ASR system, with about 1.7ms latency for each token, which ensures that extra latency introduced by post-processing is acceptable in streaming speech recognition. △ Less

Submitted 10 January, 2022; originally announced January 2022.

Comments: submit to ICASSP2022, 5 pages, 3 figures

arXiv:2106.04130 [pdf, other]

EnMcGAN: Adversarial Ensemble Learning for 3D Complete Renal Structures Segmentation

Authors: Yuting He, Rongjun Ge, Xiaoming Qi, Guanyu Yang, Yang Chen, Youyong Kong, Huazhong Shu, Jean-Louis Coatrieux, Shuo Li

Abstract: 3D complete renal structures(CRS) segmentation targets on segmenting the kidneys, tumors, renal arteries and veins in one inference. Once successful, it will provide preoperative plans and intraoperative guidance for laparoscopic partial nephrectomy(LPN), playing a key role in the renal cancer treatment. However, no success has been reported in 3D CRS segmentation due to the complex shapes of rena… ▽ More 3D complete renal structures(CRS) segmentation targets on segmenting the kidneys, tumors, renal arteries and veins in one inference. Once successful, it will provide preoperative plans and intraoperative guidance for laparoscopic partial nephrectomy(LPN), playing a key role in the renal cancer treatment. However, no success has been reported in 3D CRS segmentation due to the complex shapes of renal structures, low contrast and large anatomical variation. In this study, we utilize the adversarial ensemble learning and propose Ensemble Multi-condition GAN(EnMcGAN) for 3D CRS segmentation for the first time. Its contribution is three-fold. 1)Inspired by windowing, we propose the multi-windowing committee which divides CTA image into multiple narrow windows with different window centers and widths enhancing the contrast for salient boundaries and soft tissues. And then, it builds an ensemble segmentation model on these narrow windows to fuse the segmentation superiorities and improve whole segmentation quality. 2)We propose the multi-condition GAN which equips the segmentation model with multiple discriminators to encourage the segmented structures meeting their real shape conditions, thus improving the shape feature extraction ability. 3)We propose the adversarial weighted ensemble module which uses the trained discriminators to evaluate the quality of segmented structures, and normalizes these evaluation scores for the ensemble weights directed at the input image, thus enhancing the ensemble results. 122 patients are enrolled in this study and the mean Dice coefficient of the renal structures achieves 84.6%. Extensive experiments with promising results on renal structures reveal powerful segmentation accuracy and great clinical significance in renal cancer treatment. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Journal ref: Information Processing in Medical Imaging (IPMI) 2021

arXiv:2104.01617 [pdf, other]

doi 10.1007/978-3-030-87589-3_16

Multi-Feature Semi-Supervised Learning for COVID-19 Diagnosis from Chest X-ray Images

Authors: Xiao Qi, John L. Nosher, David J. Foran, Ilker Hacihaliloglu

Abstract: Computed tomography (CT) and chest X-ray (CXR) have been the two dominant imaging modalities deployed for improved management of Coronavirus disease 2019 (COVID-19). Due to faster imaging, less radiation exposure, and being cost-effective CXR is preferred over CT. However, the interpretation of CXR images, compared to CT, is more challenging due to low image resolution and COVID-19 image features… ▽ More Computed tomography (CT) and chest X-ray (CXR) have been the two dominant imaging modalities deployed for improved management of Coronavirus disease 2019 (COVID-19). Due to faster imaging, less radiation exposure, and being cost-effective CXR is preferred over CT. However, the interpretation of CXR images, compared to CT, is more challenging due to low image resolution and COVID-19 image features being similar to regular pneumonia. Computer-aided diagnosis via deep learning has been investigated to help mitigate these problems and help clinicians during the decision-making process. The requirement for a large amount of labeled data is one of the major problems of deep learning methods when deployed in the medical domain. To provide a solution to this, in this work, we propose a semi-supervised learning (SSL) approach using minimal data for training. We integrate local-phase CXR image features into a multi-feature convolutional neural network architecture where the training of SSL method is obtained with a teacher/student paradigm. Quantitative evaluation is performed on 8,851 normal (healthy), 6,045 pneumonia, and 3,795 COVID-19 CXR scans. By only using 7.06% labeled and 16.48% unlabeled data for training, 5.53% for validation, our method achieves 93.61\% mean accuracy on a large-scale (70.93%) test data. We provide comparison results against fully supervised and SSL methods. Code: https://github.com/endiqq/Multi-Feature-Semi-Supervised-Learning-for-COVID-19-CXR-Images △ Less

Submitted 14 April, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

arXiv:2011.03585 [pdf, other]

Chest X-ray Image Phase Features for Improved Diagnosis of COVID-19 Using Convolutional Neural Network

Authors: Xiao Qi, Lloyd Brown, David J. Foran, Ilker Hacihaliloglu

Abstract: Recently, the outbreak of the novel Coronavirus disease 2019 (COVID-19) pandemic has seriously endangered human health and life. Due to limited availability of test kits, the need for auxiliary diagnostic approach has increased. Recent research has shown radiography of COVID-19 patient, such as CT and X-ray, contains salient information about the COVID-19 virus and could be used as an alternative… ▽ More Recently, the outbreak of the novel Coronavirus disease 2019 (COVID-19) pandemic has seriously endangered human health and life. Due to limited availability of test kits, the need for auxiliary diagnostic approach has increased. Recent research has shown radiography of COVID-19 patient, such as CT and X-ray, contains salient information about the COVID-19 virus and could be used as an alternative diagnosis method. Chest X-ray (CXR) due to its faster imaging time, wide availability, low cost and portability gains much attention and becomes very promising. Computational methods with high accuracy and robustness are required for rapid triaging of patients and aiding radiologist in the interpretation of the collected data. In this study, we design a novel multi-feature convolutional neural network (CNN) architecture for multi-class improved classification of COVID-19 from CXR images. CXR images are enhanced using a local phase-based image enhancement method. The enhanced images, together with the original CXR data, are used as an input to our proposed CNN architecture. Using ablation studies, we show the effectiveness of the enhanced images in improving the diagnostic accuracy. We provide quantitative evaluation on two datasets and qualitative results for visual inspection. Quantitative evaluation is performed on data consisting of 8,851 normal (healthy), 6,045 pneumonia, and 3,323 Covid-19 CXR scans. In Dataset-1, our model achieves 95.57\% average accuracy for a three classes classification, 99\% precision, recall, and F1-scores for COVID-19 cases. For Dataset-2, we have obtained 94.44\% average accuracy, and 95\% precision, recall, and F1-scores for detection of COVID-19. Our proposed multi-feature guided CNN achieves improved results compared to single-feature CNN proving the importance of the local phase-based CXR image enhancement (https://github.com/endiqq/Fus-CNNs_COVID-19). △ Less

Submitted 14 April, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: 16 pages, 9 figures

Journal ref: International Journal of Computer Assisted Radiology and Surgery, 2021

arXiv:2010.07408 [pdf, other]

Reconfigurable Intelligent Surface: Design the Channel -- a New Opportunity for Future Wireless Networks

Authors: Miguel Dajer, Zhengxiang Ma, Leonard Piazzi, Narayan Prasad, Xiao-Feng Qi, Baoling Sheen, ** Yang, Guosen Yue

Abstract: In this paper, we survey state-of-the-art research outcomes in the burgeoning field of reconfigurable intelligent surface (RIS) in view of its potential for significant performance enhancement for next generation wireless communication networks by means of adapting the propagation environment. Emphasis has been placed on several aspects gating the commercially viability of a future network deploym… ▽ More In this paper, we survey state-of-the-art research outcomes in the burgeoning field of reconfigurable intelligent surface (RIS) in view of its potential for significant performance enhancement for next generation wireless communication networks by means of adapting the propagation environment. Emphasis has been placed on several aspects gating the commercially viability of a future network deployment. Comprehensive summaries are provided for practical hardware design considerations and broad implications of artificial intelligence techniques, so are in-depth outlooks on salient aspects of system models, use cases, and physical layer optimization techniques. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: 22 pages, 18 figures

arXiv:2005.04901 [pdf]

doi 10.1002/mp.14800

A novel 3D multi-path DenseNet for improving automatic segmentation of glioblastoma on pre-operative multi-modal MR images

Authors: Jie Fu, Kamal Singhrao, X. Sharon Qi, Yingli Yang, Dan Ruan, John H. Lewis

Abstract: Convolutional neural networks have achieved excellent results in automatic medical image segmentation. In this study, we proposed a novel 3D multi-path DenseNet for generating the accurate glioblastoma (GBM) tumor contour from four multi-modal pre-operative MR images. We hypothesized that the multi-path architecture could achieve more accurate segmentation than a single-path architecture. 258 GBM… ▽ More Convolutional neural networks have achieved excellent results in automatic medical image segmentation. In this study, we proposed a novel 3D multi-path DenseNet for generating the accurate glioblastoma (GBM) tumor contour from four multi-modal pre-operative MR images. We hypothesized that the multi-path architecture could achieve more accurate segmentation than a single-path architecture. 258 GBM patients were included in this study. Each patient had four MR images (T1-weighted, contrast-enhanced T1-weighted, T2-weighted, and FLAIR) and the manually segmented tumor contour. We built a 3D multi-path DenseNet that could be trained to generate the corresponding GBM tumor contour from the four MR images. A 3D single-path DenseNet was also built for comparison. Both DenseNets were based on the encoder-decoder architecture. All four images were concatenated and fed into a single encoder path in the single-path DenseNet, while each input image had its own encoder path in the multi-path DenseNet. The patient cohort was randomly split into a training set of 180 patients, a validation set of 39 patients, and a testing set of 39 patients. Model performance was evaluated using the Dice similarity coefficient (DSC), average surface distance (ASD), and 95% Hausdorff distance (HD95%). Wilcoxon signed-rank tests were conducted to examine the model differences. The single-path DenseNet achieved a DSC of 0.911$\pm$0.060, ASD of 1.3$\pm$0.7 mm, and HD95% of 5.2$\pm$7.1 mm, while the multi-path DenseNet achieved a DSC of 0.922$\pm$0.041, ASD of 1.1$\pm$0.5 mm, and HD95% of 3.9$\pm$3.3 mm. The p-values of all Wilcoxon signed-rank tests were less than 0.05. Both 3D DenseNets generated GBM tumor contours in good agreement with the manually segmented contours from multi-modal MR images. The multi-path DenseNet achieved more accurate tumor segmentation than the single-path DenseNet. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: 15 pages, 6 figures, review in progress

Journal ref: 2021 Medical Physics

arXiv:2003.13898 [pdf, other]

Edge Guided GANs with Contrastive Learning for Semantic Image Synthesis

Authors: Hao Tang, Xiaojuan Qi, Guolei Sun, Dan Xu, Nicu Sebe, Radu Timofte, Luc Van Gool

Abstract: We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operati… ▽ More We propose a novel ECGAN for the challenging semantic image synthesis task. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to three largely unresolved challenges. 1) The semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. 2) The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss and thus cannot fully preserve the original semantic information, leading to semantically inconsistent results. 3) Existing semantic image synthesis methods focus on modeling local semantic information from a single input semantic layout. However, they ignore global semantic information of multiple input semantic layouts, i.e., semantic cross-relations between pixels across different input layouts. To tackle 1), we propose to use edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information. To tackle 3), inspired by current methods in contrastive learning, we propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content than those from different classes. Doing so can capture more semantic relations by explicitly exploring the structures of labeled pixels from multiple input semantic layouts. Experiments on three challenging datasets show that our ECGAN achieves significantly better results than state-of-the-art methods. △ Less

Submitted 27 March, 2023; v1 submitted 30 March, 2020; originally announced March 2020.

arXiv:2001.03698 [pdf, other]

AE-OT-GAN: Training GANs from data specific latent distribution

Authors: Dongsheng An, Yang Guo, Min Zhang, Xin Qi, Na Lei, Shing-Tung Yau, Xianfeng Gu

Abstract: Though generative adversarial networks (GANs) areprominent models to generate realistic and crisp images,they often encounter the mode collapse problems and arehard to train, which comes from approximating the intrinsicdiscontinuous distribution transform map with continuousDNNs. The recently proposed AE-OT model addresses thisproblem by explicitly computing the discontinuous distribu-tion transfo… ▽ More Though generative adversarial networks (GANs) areprominent models to generate realistic and crisp images,they often encounter the mode collapse problems and arehard to train, which comes from approximating the intrinsicdiscontinuous distribution transform map with continuousDNNs. The recently proposed AE-OT model addresses thisproblem by explicitly computing the discontinuous distribu-tion transform map through solving a semi-discrete optimaltransport (OT) map in the latent space of the autoencoder.However the generated images are blurry. In this paper, wepropose the AE-OT-GAN model to utilize the advantages ofthe both models: generate high quality images and at thesame time overcome the mode collapse/mixture problems.Specifically, we first faithfully embed the low dimensionalimage manifold into the latent space by training an autoen-coder (AE). Then we compute the optimal transport (OT)map that pushes forward the uniform distribution to the la-tent distribution supported on the latent manifold. Finally,our GAN model is trained to generate high quality imagesfrom the latent distribution, the distribution transform mapfrom which to the empirical data distribution will be con-tinuous. The paired data between the latent code and thereal images gives us further constriction about the generator.Experiments on simple MNIST dataset and complex datasetslike Cifar-10 and CelebA show the efficacy and efficiency ofour proposed method. △ Less

Submitted 27 January, 2020; v1 submitted 10 January, 2020; originally announced January 2020.

arXiv:1909.04012 [pdf]

doi 10.1088/1361-6560/ab7970

Deep Learning-based Radiomic Features for Improving Neoadjuvant Chemoradiation Response Prediction in Locally Advanced Rectal Cancer

Authors: Jie Fu, Xinran Zhong, Ning Li, Ritchell Van Dams, John Lewis, Kyunghyun Sung, Ann C. Raldow, **g **, X. Sharon Qi

Abstract: Radiomic features achieve promising results in cancer diagnosis, treatment response prediction, and survival prediction. Our goal is to compare the handcrafted (explicitly designed) and deep learning (DL)-based radiomic features extracted from pre-treatment diffusion-weighted magnetic resonance images (DWIs) for predicting neoadjuvant chemoradiation treatment (nCRT) response in patients with local… ▽ More Radiomic features achieve promising results in cancer diagnosis, treatment response prediction, and survival prediction. Our goal is to compare the handcrafted (explicitly designed) and deep learning (DL)-based radiomic features extracted from pre-treatment diffusion-weighted magnetic resonance images (DWIs) for predicting neoadjuvant chemoradiation treatment (nCRT) response in patients with locally advanced rectal cancer (LARC). 43 patients receiving nCRT were included. All patients underwent DWIs before nCRT and total mesorectal excision surgery 6-12 weeks after completion of nCRT. Gross tumor volume (GTV) contours were drawn by an experienced radiation oncologist on DWIs. The patient-cohort was split into the responder group (n=22) and the non-responder group (n=21) based on the post-nCRT response assessed by postoperative pathology, MRI or colonoscopy. Handcrafted and DL-based features were extracted from the apparent diffusion coefficient (ADC) map of the DWI using conventional computer-aided diagnosis methods and a pre-trained convolution neural network, respectively. Least absolute shrinkage and selection operator (LASSO)-logistic regression models were constructed using extracted features for predicting treatment response. The model performance was evaluated with repeated 20 times stratified 4-fold cross-validation using receiver operating characteristic (ROC) curves and compared using the corrected resampled t-test. The model built with handcrafted features achieved the mean area under the ROC curve (AUC) of 0.64, while the one built with DL-based features yielded the mean AUC of 0.73. The corrected resampled t-test on AUC showed P-value < 0.05. DL-based features extracted from pre-treatment DWIs achieved significantly better classification performance compared with handcrafted features for predicting nCRT response in patients with LARC. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: Review in progress

Journal ref: 2020 Phys. Med. Biol

arXiv:1907.00482 [pdf, other]

Base Station Antenna Selection for Low-Resolution ADC Systems

Authors: **seok Choi, Junmo Sung, Narayan Prasad, Xiao-Feng Qi, Brian L. Evans, Alan Gatherer

Abstract: This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antenn… ▽ More This paper investigates antenna selection at a base station with large antenna arrays and low-resolution analog-to-digital converters. For downlink transmit antenna selection for narrowband channels, we show (1) a selection criterion that maximizes sum rate with zero-forcing precoding equivalent to that of a perfect quantization system; (2) maximum sum rate increases with number of selected antennas; (3) derivation of the sum rate loss function from using a subset of antennas; and (4) unlike high-resolution converter systems, sum rate loss reaches a maximum at a point of total transmit power and decreases beyond that point to converge to zero. For wideband orthogonal-frequency-division-multiplexing (OFDM) systems, our results hold when entire subcarriers share a common subset of antennas. For uplink receive antenna selection for narrowband channels, we (1) generalize a greedy antenna selection criterion to capture tradeoffs between channel gain and quantization error; (2) propose a quantization-aware fast antenna selection algorithm using the criterion; and (3) derive a lower bound on sum rate achieved by the proposed algorithm based on submodular functions. For wideband OFDM systems, we extend our algorithm and derive a lower bound on its sum rate. Simulation results validate theoretical analyses and show increases in sum rate over conventional algorithms. △ Less

Submitted 30 June, 2019; originally announced July 2019.

Comments: Submitted to IEEE Transactions on Communications

arXiv:1904.09316 [pdf]

A Low Complexity Near-Maximum Likelihood MIMO Receiver with Low Resolution Analog-to-Digital Converters

Authors: Arkady Molev-Shteiman, Xiao-Feng Qi, Laurence Mailaender

Abstract: Based on a new equivalent model of quantizer with noisy input recently presented in [23], we propose a new low complexity receiver that takes into account the nonlinear distortion (NLD) generated by Analog to Digital converter (ADC) with insufficient resolution. The strength of new model is that it presents the NLD as a function of only the desired part of input signal (without noise). Therefore i… ▽ More Based on a new equivalent model of quantizer with noisy input recently presented in [23], we propose a new low complexity receiver that takes into account the nonlinear distortion (NLD) generated by Analog to Digital converter (ADC) with insufficient resolution. The strength of new model is that it presents the NLD as a function of only the desired part of input signal (without noise). Therefore it can easily be used in a variety of NLD mitigation techniques. Here, as an illustration of this, we use a pseudo-ML approach to detect the original QAM modulation based on the equivalent transfer function and exhaustive search. Simulation results for a single user QAM under flat fading show performance equivalent to a true ML receiver, but with much lower computational complexity. The excellent performance of our receiver is an independent validation of the model [23]. △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1904.09312 [pdf]

Low Resolution Digital-to-Analog Converter with Digital Dithering for MIMO Transmitter

Authors: Arkady Molev-Shteiman, Xiao-Feng Qi, Laurence Mailaender

Abstract: Based on an equivalent model for quantizers with noisy inputs recently presented in [35], we propose a method of digital dithering at the transmitter that may significantly reduce the resolution requirements of MIMO downlink Digital to Analog Convertors (DAC). We use this equivalent model to analyze the effect of the dither Probability Density Function (PFD), and show that the uniform PDF produces… ▽ More Based on an equivalent model for quantizers with noisy inputs recently presented in [35], we propose a method of digital dithering at the transmitter that may significantly reduce the resolution requirements of MIMO downlink Digital to Analog Convertors (DAC). We use this equivalent model to analyze the effect of the dither Probability Density Function (PFD), and show that the uniform PDF produces an optimal (linear) result. Relative to other methods of DAC quantization error reduction our approach has the benefits of low computational complexity, compatibility with all existing standards, and blindness (no need for channel state information). △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1904.08519 [pdf]

New equivalent model of quantizer with noisy input and its application for ADC resolution determination in an uplink MIMO receiver

Authors: Arkady Molev-Shteiman, Xiao-Feng Qi, Laurence Mailaender, Narayan Prasad, Bertrand Hochwald

Abstract: When a quantizer input signal is the sum of the desired signal and input white noise, the quantization error is a function of total input signal. Our new equivalent model splits the quantization error into two components: a non-linear distortion (NLD) that is a function of only the desired part of input signal (without noise), and an equivalent out-put white noise. This separation is important bec… ▽ More When a quantizer input signal is the sum of the desired signal and input white noise, the quantization error is a function of total input signal. Our new equivalent model splits the quantization error into two components: a non-linear distortion (NLD) that is a function of only the desired part of input signal (without noise), and an equivalent out-put white noise. This separation is important because these two terms affect MIMO system performance differently. This paper introduces our model, and applies it to determine the minimal Analog-to-Digital Converter (ADC) resolution necessary to operate a conventional MIMO receiver with negligible performance degradation. We also provide numerical simulations to confirm the theory. Broad ramifications of our model are further demonstrated in two companion papers presenting low-complexity suppression of the NLD arising from insufficient ADC resolution, and a digital dithering that significantly reduces the MIMO transmitter Digital-to-Analog Converters (DAC) resolution requirement. △ Less

Submitted 17 April, 2019; originally announced April 2019.

arXiv:1811.11102 [pdf]

Maximal Entropy Reduction Algorithm for SAR ADC Clock Compression

Authors: Arkady Molev-Shteiman, Xiao-Feng Qi

Abstract: Reduction of comparison cycles leads to power savings of a successive-approximation-register (SAR) analog-to-digital converters (ADC). We establish that the lowest average number of comparison cycles of a SAR ADC approaches the entropy of the ADC output, and proposed a simple adaptive algorithm that approaches this lower bound. Today's SAR ADC uses binary search, which consumes more power than nec… ▽ More Reduction of comparison cycles leads to power savings of a successive-approximation-register (SAR) analog-to-digital converters (ADC). We establish that the lowest average number of comparison cycles of a SAR ADC approaches the entropy of the ADC output, and proposed a simple adaptive algorithm that approaches this lower bound. Today's SAR ADC uses binary search, which consumes more power than necessary for non-uniform input distributions commonly found in practice. We refer to a SAR ADC employing such algorithm the maximal entropy reduction (MER) ADC. △ Less

Submitted 7 November, 2018; originally announced November 2018.

arXiv:1810.07522 [pdf, other]

Optimizing Beams and Bits: A Novel Approach for Massive MIMO Base-Station Design

Authors: Narayan Prasad, Xiao-Feng Qi, Alan Gatherer

Abstract: We consider the problem of jointly optimizing ADC bit resolution and analog beamforming over a frequency-selective massive MIMO uplink. We build upon a popular model to incorporate the impact of low bit resolution ADCs, that hitherto has mostly been employed over flat-fading systems. We adopt weighted sum rate (WSR) as our objective and show that WSR maximization under finite buffer limits and imp… ▽ More We consider the problem of jointly optimizing ADC bit resolution and analog beamforming over a frequency-selective massive MIMO uplink. We build upon a popular model to incorporate the impact of low bit resolution ADCs, that hitherto has mostly been employed over flat-fading systems. We adopt weighted sum rate (WSR) as our objective and show that WSR maximization under finite buffer limits and important practical constraints on choices of beams and ADC bit resolutions can equivalently be posed as constrained submodular set function maximization. This enables us to design a constant-factor approximation algorithm. Upon incorporating further enhancements we obtain an efficient algorithm that significantly outperforms state-of-the-art ones. △ Less

Submitted 26 February, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: Tech. Report. Appeared in part in IEEE ICNC 2019. Added few more comments and corrected minor typos

arXiv:1312.2632 [pdf, other]

SEED: Public Energy and Environment Dataset for Optimizing HVAC Operation in Subway Stations

Authors: Yongcai Wang, Haoran Feng, Xiao Qi

Abstract: For sustainability and energy saving, the problem to optimize the control of heating, ventilating, and air-conditioning (HVAC) systems has attracted great attentions, but analyzing the signatures of thermal environments and HVAC systems and the evaluation of the optimization policies has encountered inefficiency and inconvenient problems due to the lack of public dataset. In this paper, we present… ▽ More For sustainability and energy saving, the problem to optimize the control of heating, ventilating, and air-conditioning (HVAC) systems has attracted great attentions, but analyzing the signatures of thermal environments and HVAC systems and the evaluation of the optimization policies has encountered inefficiency and inconvenient problems due to the lack of public dataset. In this paper, we present the Subway station Energy and Environment Dataset (SEED), which was collected from a line of Bei**g subway stations, providing minute-resolution data regarding the environment dynamics (temperature, humidity, CO2, etc.) working states and energy consumptions of the HVAC systems (ventilators, refrigerators, pumps), and hour-resolution data of passenger flows. We describe the sensor deployments and the HVAC systems for data collection and for environment control, and also present initial investigation for the energy disaggregation of HVAC system, the signatures of the thermal load, cooling supply, and the passenger flow using the dataset. △ Less

Submitted 9 December, 2013; originally announced December 2013.

Comments: 5 pages, 14 figures

Showing 1–42 of 42 results for author: Qi, X