Search | arXiv e-print repository

A Multi-Resolution Mutual Learning Network for Multi-Label ECG Classification

Authors: Wei Huang, Ning Wang, Panpan Feng, Haiyan Wang, Zongmin Wang, Bing Zhou

Abstract: Electrocardiograms (ECG), which record the electrophysiological activity of the heart, have become a crucial tool for diagnosing these diseases. In recent years, the application of deep learning techniques has significantly improved the performance of ECG signal classification. Multi-resolution feature analysis, which captures and processes information at different time scales, can extract subtle… ▽ More Electrocardiograms (ECG), which record the electrophysiological activity of the heart, have become a crucial tool for diagnosing these diseases. In recent years, the application of deep learning techniques has significantly improved the performance of ECG signal classification. Multi-resolution feature analysis, which captures and processes information at different time scales, can extract subtle changes and overall trends in ECG signals, showing unique advantages. However, common multi-resolution analysis methods based on simple feature addition or concatenation may lead to the neglect of low-resolution features, affecting model performance. To address this issue, this paper proposes the Multi-Resolution Mutual Learning Network (MRM-Net). MRM-Net includes a dual-resolution attention architecture and a feature complementary mechanism. The dual-resolution attention architecture processes high-resolution and low-resolution features in parallel. Through the attention mechanism, the high-resolution and low-resolution branches can focus on subtle waveform changes and overall rhythm patterns, enhancing the ability to capture critical features in ECG signals. Meanwhile, the feature complementary mechanism introduces mutual feature learning after each layer of the feature extractor. This allows features at different resolutions to reinforce each other, thereby reducing information loss and improving model performance and robustness. Experiments on the PTB-XL and CPSC2018 datasets demonstrate that MRM-Net significantly outperforms existing methods in multi-label ECG classification performance. The code for our framework will be publicly available at https://github.com/wxhdf/MRM. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.01802 [pdf, other]

Motion Planning for Hybrid Dynamical Systems: Framework, Algorithm Template, and a Sampling-based Approach

Authors: Nan Wang, Ricardo G. Sanfelice

Abstract: This paper focuses on the motion planning problem for the systems exhibiting both continuous and discrete behaviors, which we refer to as hybrid dynamical systems. Firstly, the motion planning problem for hybrid systems is formulated using the hybrid equation framework, which is general to capture most hybrid systems. Secondly, a propagation algorithm template is proposed that describes a general… ▽ More This paper focuses on the motion planning problem for the systems exhibiting both continuous and discrete behaviors, which we refer to as hybrid dynamical systems. Firstly, the motion planning problem for hybrid systems is formulated using the hybrid equation framework, which is general to capture most hybrid systems. Secondly, a propagation algorithm template is proposed that describes a general framework to solve the motion planning problem for hybrid systems. Thirdly, a rapidly-exploring random trees (RRT) implementation of the proposed algorithm template is designed to solve the motion planning problem for hybrid systems. At each iteration, the proposed algorithm, called HyRRT, randomly picks a state sample and extends the search tree by flow or jump, which is also chosen randomly when both regimes are possible. Through a definition of concatenation of functions defined on hybrid time domains, we show that HyRRT is probabilistically complete, namely, the probability of failing to find a motion plan approaches zero as the number of iterations of the algorithm increases. This property is guaranteed under mild conditions on the data defining the motion plan, which include a relaxation of the usual positive clearance assumption imposed in the literature of classical systems. The motion plan is computed through the solution of two optimization problems, one associated with the flow and the other with the jumps of the system. The proposed algorithm is applied to an actuated bouncing ball system and a walking robot system so as to highlight its generality and computational features. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 33 pages, 8 figures. Submitted to IJRR. arXiv admin note: text overlap with arXiv:2210.15082

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2404.13388 [pdf]

Diagnosis of Multiple Fundus Disorders Amidst a Scarcity of Medical Experts Via Self-supervised Machine Learning

Authors: Yong Liu, Mengtian Kang, Shuo Gao, Chi Zhang, Ying Liu, Shiming Li, Yue Qi, Arokia Nathan, Wenjun Xu, Chenyu Tang, Edoardo Occhipinti, Mayinuer Yusufu, Ningli Wang, Weiling Bai, Luigi Occhipinti

Abstract: Fundus diseases are major causes of visual impairment and blindness worldwide, especially in underdeveloped regions, where the shortage of ophthalmologists hinders timely diagnosis. AI-assisted fundus image analysis has several advantages, such as high accuracy, reduced workload, and improved accessibility, but it requires a large amount of expert-annotated data to build reliable models. To addres… ▽ More Fundus diseases are major causes of visual impairment and blindness worldwide, especially in underdeveloped regions, where the shortage of ophthalmologists hinders timely diagnosis. AI-assisted fundus image analysis has several advantages, such as high accuracy, reduced workload, and improved accessibility, but it requires a large amount of expert-annotated data to build reliable models. To address this dilemma, we propose a general self-supervised machine learning framework that can handle diverse fundus diseases from unlabeled fundus images. Our method's AUC surpasses existing supervised approaches by 15.7%, and even exceeds performance of a single human expert. Furthermore, our model adapts well to various datasets from different regions, races, and heterogeneous image sources or qualities from multiple cameras or devices. Our method offers a label-free general framework to diagnose fundus diseases, which could potentially benefit telehealth programs for early screening of people at risk of vision loss. △ Less

Submitted 23 April, 2024; v1 submitted 20 April, 2024; originally announced April 2024.

arXiv:2404.13386 [pdf]

SSVT: Self-Supervised Vision Transformer For Eye Disease Diagnosis Based On Fundus Images

Authors: Jiaqi Wang, Mengtian Kang, Yong Liu, Chi Zhang, Ying Liu, Shiming Li, Yue Qi, Wenjun Xu, Chenyu Tang, Edoardo Occhipinti, Mayinuer Yusufu, Ningli Wang, Weiling Bai, Shuo Gao, Luigi G. Occhipinti

Abstract: Machine learning-based fundus image diagnosis technologies trigger worldwide interest owing to their benefits such as reducing medical resource power and providing objective evaluation results. However, current methods are commonly based on supervised methods, bringing in a heavy workload to biomedical staff and hence suffering in expanding effective databases. To address this issue, in this artic… ▽ More Machine learning-based fundus image diagnosis technologies trigger worldwide interest owing to their benefits such as reducing medical resource power and providing objective evaluation results. However, current methods are commonly based on supervised methods, bringing in a heavy workload to biomedical staff and hence suffering in expanding effective databases. To address this issue, in this article, we established a label-free method, name 'SSVT',which can automatically analyze un-labeled fundus images and generate high evaluation accuracy of 97.0% of four main eye diseases based on six public datasets and two datasets collected by Bei**g Tongren Hospital. The promising results showcased the effectiveness of the proposed unsupervised learning method, and the strong application potential in biomedical resource shortage regions to improve global eye health. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: ISBI 2024

arXiv:2403.18413 [pdf, ps, other]

HyRRT-Connect: A Bidirectional Rapidly-Exploring Random Trees Motion Planning Algorithm for Hybrid Systems

Authors: Nan Wang, Ricardo G. Sanfelice

Abstract: This paper proposes a bidirectional rapidly-exploring random trees (RRT) algorithm to solve the motion planning problem for hybrid systems. The proposed algorithm, called HyRRT-Connect, propagates in both forward and backward directions in hybrid time until an overlap between the forward and backward propagation results is detected. Then, HyRRT-Connect constructs a motion plan through the reversal… ▽ More This paper proposes a bidirectional rapidly-exploring random trees (RRT) algorithm to solve the motion planning problem for hybrid systems. The proposed algorithm, called HyRRT-Connect, propagates in both forward and backward directions in hybrid time until an overlap between the forward and backward propagation results is detected. Then, HyRRT-Connect constructs a motion plan through the reversal and concatenation of functions defined on hybrid time domains, ensuring the motion plan thoroughly satisfies the given hybrid dynamics. To address the potential discontinuity along the flow caused by tolerating some distance between the forward and backward partial motion plans, we reconstruct the backward partial motion plan by a forward-in-hybrid-time simulation from the final state of the forward partial motion plan. By applying the reversed input of the backward partial motion plan, the reconstruction process effectively eliminates the discontinuity and ensures that as the tolerance distance decreases to zero, the distance between the endpoint of the reconstructed motion plan and the final state set approaches zero. The proposed algorithm is applied to an actuated bouncing ball example and a walking robot example so as to highlight its generality and computational improvement. △ Less

Submitted 29 March, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted by the 8th IFAC International Conference on Analysis and Design of Hybrid Systems (ADHS 2024)

arXiv:2402.04171 [pdf, other]

3D Volumetric Super-Resolution in Radiology Using 3D RRDB-GAN

Authors: Juhyung Ha, Nian Wang, Surendra Maharjan, Xuhong Zhang

Abstract: This study introduces the 3D Residual-in-Residual Dense Block GAN (3D RRDB-GAN) for 3D super-resolution for radiology imagery. A key aspect of 3D RRDB-GAN is the integration of a 2.5D perceptual loss function, which contributes to improved volumetric image quality and realism. The effectiveness of our model was evaluated through 4x super-resolution experiments across diverse datasets, including Mi… ▽ More This study introduces the 3D Residual-in-Residual Dense Block GAN (3D RRDB-GAN) for 3D super-resolution for radiology imagery. A key aspect of 3D RRDB-GAN is the integration of a 2.5D perceptual loss function, which contributes to improved volumetric image quality and realism. The effectiveness of our model was evaluated through 4x super-resolution experiments across diverse datasets, including Mice Brain MRH, OASIS, HCP1200, and MSD-Task-6. These evaluations, encompassing both quantitative metrics like LPIPS and FID and qualitative assessments through sample visualizations, demonstrate the models effectiveness in detailed image analysis. The 3D RRDB-GAN offers a significant contribution to medical imaging, particularly by enriching the depth, clarity, and volumetric detail of medical images. Its application shows promise in enhancing the interpretation and analysis of complex medical imagery from a comprehensive 3D perspective. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.02699 [pdf, other]

Adversarial Data Augmentation for Robust Speaker Verification

Authors: Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang

Abstract: Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a pot… ▽ More Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.03623 [pdf]

A Video Coding Method Based on Neural Network for CLIC2024

Authors: Zhengang Li, **gchi Zhang, Yonghua Wang, Xing Zeng, Zhen Zhang, Yunlin Long, Menghu Jia, Ning Wang

Abstract: This paper presents a video coding scheme that combines traditional optimization methods with deep learning methods based on the Enhanced Compression Model (ECM). In this paper, the traditional optimization methods adaptively adjust the quantization parameter (QP). The key frame QP offset is set according to the video content characteristics, and the coding tree unit (CTU) level QP of all frames i… ▽ More This paper presents a video coding scheme that combines traditional optimization methods with deep learning methods based on the Enhanced Compression Model (ECM). In this paper, the traditional optimization methods adaptively adjust the quantization parameter (QP). The key frame QP offset is set according to the video content characteristics, and the coding tree unit (CTU) level QP of all frames is also adjusted according to the spatial-temporal perception information. Block importance map** technology (BIM) is also introduced, which adjusts the QP according to the block importance. Meanwhile, the deep learning methods propose a convolutional neural network-based loop filter (CNNLF), which is turned on/off based on the rate-distortion optimization at the CTU and frame level. Besides, intra-prediction using neural networks (NN-intra) is proposed to further improve compression quality, where 8 neural networks are used for predicting blocks of different sizes. The experimental results show that compared with ECM-3.0, the proposed traditional methods and adding deep learning methods improve the PSNR by 0.54 dB and 1 dB at 0.05Mbps, respectively; 0.38 dB and 0.71dB at 0.5 Mbps, respectively, which proves the superiority of our method. △ Less

Submitted 7 January, 2024; originally announced January 2024.

arXiv:2401.00153 [pdf, other]

USFM: A Universal Ultrasound Foundation Model Generalized to Tasks and Organs towards Label Efficient Image Analysis

Authors: **g Jiao, ** Zhou, Xiaokang Li, Menghua Xia, Yi Huang, Lihong Huang, Na Wang, Xiaofan Zhang, Shichong Zhou, Yuanyuan Wang, Yi Guo

Abstract: Inadequate generality across different organs and tasks constrains the application of ultrasound (US) image analysis methods in smart healthcare. Building a universal US foundation model holds the potential to address these issues. Nevertheless, the development of such foundational models encounters intrinsic challenges in US analysis, i.e., insufficient databases, low quality, and ineffective fea… ▽ More Inadequate generality across different organs and tasks constrains the application of ultrasound (US) image analysis methods in smart healthcare. Building a universal US foundation model holds the potential to address these issues. Nevertheless, the development of such foundational models encounters intrinsic challenges in US analysis, i.e., insufficient databases, low quality, and ineffective features. In this paper, we present a universal US foundation model, named USFM, generalized to diverse tasks and organs towards label efficient US image analysis. First, a large-scale Multi-organ, Multi-center, and Multi-device US database was built, comprehensively containing over two million US images. Organ-balanced sampling was employed for unbiased learning. Then, USFM is self-supervised pre-trained on the sufficient US database. To extract the effective features from low-quality US images, we proposed a spatial-frequency dual masked image modeling method. A productive spatial noise addition-recovery approach was designed to learn meaningful US information robustly, while a novel frequency band-stop masking learning approach was also employed to extract complex, implicit grayscale distribution and textural variations. Extensive experiments were conducted on the various tasks of segmentation, classification, and image enhancement from diverse organs and diseases. Comparisons with representative US image analysis models illustrate the universality and effectiveness of USFM. The label efficiency experiments suggest the USFM obtains robust performance with only 20% annotation, laying the groundwork for the rapid development of US models in clinical practices. △ Less

Submitted 2 January, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

Comments: Submit to MedIA, 17 pages, 11 figures

arXiv:2312.13523 [pdf]

doi 10.1002/mrm.29990

High-resolution myelin-water fraction and quantitative relaxation map** using 3D ViSTa-MR fingerprinting

Authors: Congyu Liao, Xiaozhi Cao, Siddharth Srinivasan Iyer, Sophie Schauman, Zihan Zhou, Xiaoqian Yan, Quan Chen, Zhitao Li, Nan Wang, Ting Gong, Zhe Wu, Hongjian He, Jianhui Zhong, Yang Yang, Adam Kerr, Kalanit Grill-Spector, Kawin Setsompop

Abstract: Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous map** of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MR… ▽ More Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous map** of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MRF), to achieve high-fidelity whole-brain MWF and T1/T2/PD map** on a clinical 3T scanner. To achieve fast acquisition and memory-efficient reconstruction, the ViSTa-MRF sequence leverages an optimized 3D tiny-golden-angle-shuffling spiral-projection acquisition and joint spatial-temporal subspace reconstruction with optimized preconditioning algorithm. With the proposed ViSTa-MRF approach, high-fidelity direct MWF map** was achieved without a need for multi-compartment fitting that could introduce bias and/or noise from additional assumptions or priors. Results: The in-vivo results demonstrate the effectiveness of the proposed acquisition and reconstruction framework to provide fast multi-parametric map** with high SNR and good quality. The in-vivo results of 1mm- and 0.66mm-iso datasets indicate that the MWF values measured by the proposed method are consistent with standard ViSTa results that are 30x slower with lower SNR. Furthermore, we applied the proposed method to enable 5-minute whole-brain 1mm-iso assessment of MWF and T1/T2/PD map**s for infant brain development and for post-mortem brain samples. Conclusions: In this work, we have developed a 3D ViSTa-MRF technique that enables the acquisition of whole-brain MWF, quantitative T1, T2, and PD maps at 1mm and 0.66mm isotropic resolution in 5 and 15 minutes, respectively. This advancement allows for quantitative investigations of myelination changes in the brain. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 38 pages, 12 figures and 1 table

Journal ref: Magnetic Resonance in Medicine 2023

arXiv:2311.18167 [pdf, other]

doi 10.1109/JIOT.2023.3337131

Throughput Maximization for Intelligent Refracting Surface Assisted mmWave High-Speed Train Communications

Authors: **g Li, Yong Niu, Hao Wu, Bo Ai, Ruisi He, Ning Wang, Sheng Chen

Abstract: With the increasing demands from passengers for data-intensive services, millimeter-wave (mmWave) communication is considered as an effective technique to release the transmission pressure on high speed train (HST) networks. However, mmWave signals ncounter severe losses when passing through the carriage, which decreases the quality of services on board. In this paper, we investigate an intelligen… ▽ More With the increasing demands from passengers for data-intensive services, millimeter-wave (mmWave) communication is considered as an effective technique to release the transmission pressure on high speed train (HST) networks. However, mmWave signals ncounter severe losses when passing through the carriage, which decreases the quality of services on board. In this paper, we investigate an intelligent refracting surface (IRS)-assisted HST communication system. Herein, an IRS is deployed on the train window to dynamically reconfigure the propagation environment, and a hybrid time division multiple access-nonorthogonal multiple access scheme is leveraged for interference mitigation. We aim to maximize the overall throughput while taking into account the constraints imposed by base station beamforming, IRS discrete phase shifts and transmit power. To obtain a practical solution, we employ an alternating optimization method and propose a two-stage algorithm. In the first stage, the successive convex approximation method and branch and bound algorithm are leveraged for IRS phase shift design. In the second stage, the Lagrangian multiplier method is utilized for power allocation. Simulation results demonstrate the benefits of IRS adoption and power allocation for throughput improvement in mmWave HST networks. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 13 pages, 7 figures, IEEE Internet of Things Journal

arXiv:2311.07128 [pdf, other]

doi 10.1109/TVT.2023.3331707

Sum Rate Maximization under AoI Constraints for RIS-Assisted mmWave Communications

Authors: Ziqi Guo, Yong Niu, Shiwen Mao, Changming Zhang, Ning Wang, Zhangdui Zhong, Bo Ai

Abstract: The concept of age of information (AoI) has been proposed to quantify information freshness, which is crucial for time-sensitive applications. However, in millimeter wave (mmWave) communication systems, the link blockage caused by obstacles and the severe path loss greatly impair the freshness of information received by the user equipments (UEs). In this paper, we focus on reconfigurable intellige… ▽ More The concept of age of information (AoI) has been proposed to quantify information freshness, which is crucial for time-sensitive applications. However, in millimeter wave (mmWave) communication systems, the link blockage caused by obstacles and the severe path loss greatly impair the freshness of information received by the user equipments (UEs). In this paper, we focus on reconfigurable intelligent surface (RIS)-assisted mmWave communications, where beamforming is performed at transceivers to provide directional beam gain and a RIS is deployed to combat link blockage. We aim to maximize the system sum rate while satisfying the information freshness requirements of UEs by jointly optimizing the beamforming at transceivers, the discrete RIS reflection coefficients, and the UE scheduling strategy. To facilitate a practical solution, we decompose the problem into two subproblems. For the first per-UE data rate maximization problem, we further decompose it into a beamforming optimization subproblem and a RIS reflection coefficient optimization subproblem. Considering the difficulty of channel estimation, we utilize the hierarchical search method for the former and the local search method for the latter, and then adopt the block coordinate descent (BCD) method to alternately solve them. For the second scheduling strategy design problem, a low-complexity heuristic scheduling algorithm is designed. Simulation results show that the proposed algorithm can effectively improve the system sum rate while satisfying the information freshness requirements of all UEs. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2310.04992 [pdf, other]

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv , et al. (17 additional authors not shown)

Abstract: We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassifi… ▽ More We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2309.14158 [pdf, other]

An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition

Authors: Zhenyu Zhou, Junhui Chen, Namin Wang, Lantian Li, Dong Wang

Abstract: Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source… ▽ More Multi-genre speaker recognition is becoming increasingly popular due to its ability to better represent the complexities of real-world applications. However, a major challenge is the significant shift in the distribution of speaker vectors across different genres. While distribution alignment is a common approach to address this challenge, previous studies have mainly focused on aligning a source domain with a target domain, and the performance of multi-genre data is unknown. This paper presents a comprehensive study of mainstream distribution alignment methods on multi-genre data, where multiple distributions need to be aligned. We analyze various methods both qualitatively and quantitatively. Our experiments on the CN-Celeb dataset show that within-between distribution alignment (WBDA) performs relatively better. However, we also found that none of the investigated methods consistently improved performance in all test cases. This suggests that solely aligning the distributions of speaker vectors may not fully address the challenges posed by multi-genre speaker recognition. Further investigation is necessary to develop a more comprehensive solution. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: submitted to ICASSP 2024

arXiv:2308.09929 [pdf, other]

RIS-assisted High-Speed Railway Integrated Sensing and Communication System

Authors: Panpan Li, Yong Niu, Hao Wu, Zhu Han, Guiqi Sun, Ning Wang, Zhangdui Zhong, Bo Ai

Abstract: One technology that has the potential to improve wireless communications in years to come is integrated sensing and communication (ISAC). In this study, we take advantage of reconfigurable intelligent surface's (RIS) potential advantages to achieve ISAC while using the same frequency and resources. Specifically, by using the reflecting elements, the RIS dynamically modifies the radio waves' streng… ▽ More One technology that has the potential to improve wireless communications in years to come is integrated sensing and communication (ISAC). In this study, we take advantage of reconfigurable intelligent surface's (RIS) potential advantages to achieve ISAC while using the same frequency and resources. Specifically, by using the reflecting elements, the RIS dynamically modifies the radio waves' strength or phase in order to change the environment for radio transmission and increase the ISAC systems' transmission rate. We investigate a single cell downlink communication situation with RIS assistance. Combining the ISAC base station's (BS) beamforming with RIS's discrete phase shift optimization, while guaranteeing the sensing signal, The aim of optimizing the sum rate is specified. We take advantage of alternating maximization to find practical solutions with dividing the challenge into two minor issues. The first power allocation subproblem is non-convex that CVX solves by converting it to convex. A local search strategy is used to solve the second subproblem of phase shift optimization. According to the results of the simulation, using RIS with adjusted phase shifts can significantly enhance the ISAC system's performance. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: 12 pages

arXiv:2307.12845 [pdf, other]

Multi-View Vertebra Localization and Identification from CT Images

Authors: Han Wu, Jiadong Zhang, Yu Fang, Zhentao Liu, Nizhuan Wang, Zhiming Cui, Dinggang Shen

Abstract: Accurately localizing and identifying vertebrae from CT images is crucial for various clinical applications. However, most existing efforts are performed on 3D with crop** patch operation, suffering from the large computation costs and limited global information. In this paper, we propose a multi-view vertebra localization and identification from CT images, converting the 3D problem into a 2D lo… ▽ More Accurately localizing and identifying vertebrae from CT images is crucial for various clinical applications. However, most existing efforts are performed on 3D with crop** patch operation, suffering from the large computation costs and limited global information. In this paper, we propose a multi-view vertebra localization and identification from CT images, converting the 3D problem into a 2D localization and identification task on different views. Without the limitation of the 3D cropped patch, our method can learn the multi-view global information naturally. Moreover, to better capture the anatomical structure information from different view perspectives, a multi-view contrastive learning strategy is developed to pre-train the backbone. Additionally, we further propose a Sequence Loss to maintain the sequential structure embedded along the vertebrae. Evaluation results demonstrate that, with only two 2D networks, our method can localize and identify vertebrae in CT images accurately, and outperforms the state-of-the-art methods consistently. Our code is available at https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: MICCAI 2023

arXiv:2307.12634 [pdf, other]

Automatic lobe segmentation using attentive cross entropy and end-to-end fissure generation

Authors: Qi Su, Na Wang, Jiawen Xie, Yinan Chen, Xiaofan Zhang

Abstract: The automatic lung lobe segmentation algorithm is of great significance for the diagnosis and treatment of lung diseases, however, which has great challenges due to the incompleteness of pulmonary fissures in lung CT images and the large variability of pathological features. Therefore, we propose a new automatic lung lobe segmentation framework, in which we urge the model to pay attention to the a… ▽ More The automatic lung lobe segmentation algorithm is of great significance for the diagnosis and treatment of lung diseases, however, which has great challenges due to the incompleteness of pulmonary fissures in lung CT images and the large variability of pathological features. Therefore, we propose a new automatic lung lobe segmentation framework, in which we urge the model to pay attention to the area around the pulmonary fissure during the training process, which is realized by a task-specific loss function. In addition, we introduce an end-to-end pulmonary fissure generation method in the auxiliary pulmonary fissure segmentation task, without any additional network branch. Finally, we propose a registration-based loss function to alleviate the convergence difficulty of the Dice loss supervised pulmonary fissure segmentation task. We achieve 97.83% and 94.75% dice scores on our private dataset STLB and public LUNA16 dataset respectively. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 5 pages, 3 figures, published to 'IEEE International Symposium on Biomedical Imaging (ISBI) 2023'

arXiv:2306.10548 [pdf, other]

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Authors: Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue… ▽ More In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research. △ Less

Submitted 23 November, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

Comments: camera-ready version for NeurIPS 2023

arXiv:2306.02894 [pdf, ps, other]

Recyclable Semi-supervised Method Based on Multi-model Ensemble for Video Scene Parsing

Authors: Biao Wu, Shaoli Liu, Diankai Zhang, Chengjian Zheng, Si Gao, Xiaofeng Zhang, Ning Wang

Abstract: Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic segmentation is more reasonable and practical for realistic applications. In this paper, we adopt Mask2Former… ▽ More Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video semantic segmentation is more reasonable and practical for realistic applications. In this paper, we adopt Mask2Former as architecture and ViT-Adapter as backbone. Then, we propose a recyclable semi-supervised training method based on multi-model ensemble. Our method achieves the mIoU scores of 62.97% and 65.83% on Development test and final test respectively. Finally, we obtain the 2nd place in the Video Scene Parsing in the Wild Challenge at CVPR 2023. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.18649 [pdf, ps, other]

HySST: A Stable Sparse Rapidly-Exploring Random Trees Optimal Motion Planning Algorithm for Hybrid Dynamical Systems

Authors: Nan Wang, Ricardo G. Sanfelice

Abstract: This paper proposes a stable sparse rapidly-exploring random trees (SST) algorithm to solve the optimal motion planning problem for hybrid systems. At each iteration, the proposed algorithm, called HySST, selects a vertex with the lowest cost among all the vertices within the neighborhood of a randomly selected sample and then extends the search tree by flow or jump, which is also chosen randomly… ▽ More This paper proposes a stable sparse rapidly-exploring random trees (SST) algorithm to solve the optimal motion planning problem for hybrid systems. At each iteration, the proposed algorithm, called HySST, selects a vertex with the lowest cost among all the vertices within the neighborhood of a randomly selected sample and then extends the search tree by flow or jump, which is also chosen randomly when both regimes are possible. In addition, HySST maintains a static set of witness points such that all the vertices within the neighborhood of each witness are pruned except the vertex with the lowest cost. Through a definition of concatenation of functions defined on hybrid time domains, we show that HySST is asymptotically near optimal, namely, the probability of failing to find a motion plan such that its cost is close to the optimal cost approaches zero as the number of iterations of the algorithm increases to infinity. This property is guaranteed under mild conditions on the data defining the motion plan, which include a relaxation of the usual positive clearance assumption imposed in the literature of classical systems. The proposed algorithm is applied to an actuated bouncing ball system and a collision-resilient tensegrity multicopter system so as to highlight its generality and computational features. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: This paper has been submitted to the 2023 Conference of Decision and Control (CDC). arXiv admin note: substantial text overlap with arXiv:2210.15082

ACM Class: I.2.9

arXiv:2305.16043 [pdf, other]

Ordered and Binary Speaker Embedding

Authors: Jiaying Wang, Xianglong Wang, Namin Wang, Lantian Li, Dong Wang

Abstract: Modern speaker recognition systems represent utterances by embedding vectors. Conventional embedding vectors are dense and non-structural. In this paper, we propose an ordered binary embedding approach that sorts the dimensions of the embedding vector via a nested dropout and converts the sorted vectors to binary codes via Bernoulli sampling. The resultant ordered binary codes offer some important… ▽ More Modern speaker recognition systems represent utterances by embedding vectors. Conventional embedding vectors are dense and non-structural. In this paper, we propose an ordered binary embedding approach that sorts the dimensions of the embedding vector via a nested dropout and converts the sorted vectors to binary codes via Bernoulli sampling. The resultant ordered binary codes offer some important merits such as hierarchical clustering, reduced memory usage, and fast retrieval. These merits were empirically verified by comprehensive experiments on a speaker identification task with the VoxCeleb and CN-Celeb datasets. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: to be published in INTERSPEECH 2023

arXiv:2304.03708 [pdf, other]

Efficient automatic segmentation for multi-level pulmonary arteries: The PARSE challenge

Authors: Gongning Luo, Kuanquan Wang, Jun Liu, Shuo Li, Xinjie Liang, Xiangyu Li, Shaowei Gan, Wei Wang, Suyu Dong, Wenyi Wang, Pengxin Yu, Enyou Liu, Hongrong Wei, Na Wang, Jia Guo, Huiqi Li, Zhao Zhang, Ziwei Zhao, Na Gao, Nan An, Ashkan Pakzad, Bojidar Rangelov, Jiaqi Dou, Song Tian, Zeyu Liu , et al. (5 additional authors not shown)

Abstract: Efficient automatic segmentation of multi-level (i.e. main and branch) pulmonary arteries (PA) in CTPA images plays a significant role in clinical applications. However, most existing methods concentrate only on main PA or branch PA segmentation separately and ignore segmentation efficiency. Besides, there is no public large-scale dataset focused on PA segmentation, which makes it highly challengi… ▽ More Efficient automatic segmentation of multi-level (i.e. main and branch) pulmonary arteries (PA) in CTPA images plays a significant role in clinical applications. However, most existing methods concentrate only on main PA or branch PA segmentation separately and ignore segmentation efficiency. Besides, there is no public large-scale dataset focused on PA segmentation, which makes it highly challenging to compare the different methods. To benchmark multi-level PA segmentation algorithms, we organized the first \textbf{P}ulmonary \textbf{AR}tery \textbf{SE}gmentation (PARSE) challenge. On the one hand, we focus on both the main PA and the branch PA segmentation. On the other hand, for better clinical application, we assign the same score weight to segmentation efficiency (mainly running time and GPU memory consumption during inference) while ensuring PA segmentation accuracy. We present a summary of the top algorithms and offer some suggestions for efficient and accurate multi-level PA automatic segmentation. We provide the PARSE challenge as open-access for the community to benchmark future algorithm developments at \url{https://parse2022.grand-challenge.org/Parse2022/}. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2303.13631 [pdf, other]

In-depth analysis of music structure as a text network

Authors: **-Rui Tsai, Yen-Ting Chou, Nathan-Christopher Wang, Hui-Ling Chen, Hong-Yue Huang, Zih-Jia Luo, Tzay-Ming Hong

Abstract: Music, enchanting and poetic, permeates every corner of human civilization. Although music is not unfamiliar to people, our understanding of its essence remains limited, and there is still no universally accepted scientific description. This is primarily due to music being regarded as a product of both reason and emotion, making it difficult to define. In this article, we focus on the fundamental… ▽ More Music, enchanting and poetic, permeates every corner of human civilization. Although music is not unfamiliar to people, our understanding of its essence remains limited, and there is still no universally accepted scientific description. This is primarily due to music being regarded as a product of both reason and emotion, making it difficult to define. In this article, we focus on the fundamental elements of music and construct an evolutionary network from the perspective of music as a natural language, aligning with the statistical characteristics of texts. Through this approach, we aim to comprehend the structural differences in music across different periods, enabling a more scientific exploration of music. Relying on the advantages of structuralism, we can concentrate on the relationships and order between the physical elements of music, rather than getting entangled in the blurred boundaries of science and philosophy. The scientific framework we present not only conforms to past conclusions in music, but also serves as a bridge that connects music to natural language processing and knowledge graphs. △ Less

Submitted 2 January, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: 7 pages, 8 figures

arXiv:2303.09232 [pdf]

Generative Adversarial Network for Personalized Art Therapy in Melanoma Disease Management

Authors: Lennart Jütte, Ning Wang, Bernhard Roth

Abstract: Melanoma is the most lethal type of skin cancer. Patients are vulnerable to mental health illnesses which can reduce the effectiveness of the cancer treatment and the patients adherence to drug plans. It is crucial to preserve the mental health of patients while they are receiving treatment. However, current art therapy approaches are not personal and unique to the patient. We aim to provide a wel… ▽ More Melanoma is the most lethal type of skin cancer. Patients are vulnerable to mental health illnesses which can reduce the effectiveness of the cancer treatment and the patients adherence to drug plans. It is crucial to preserve the mental health of patients while they are receiving treatment. However, current art therapy approaches are not personal and unique to the patient. We aim to provide a well-trained image style transfer model that can quickly generate unique art from personal dermoscopic melanoma images as an additional tool for art therapy in disease management of melanoma. Visual art appreciation as a common form of art therapy in disease management that measurably reduces the degree of psychological distress. We developed a network based on the cycle-consistent generative adversarial network for style transfer that generates personalized and unique artworks from dermoscopic melanoma images. We developed a model that converts melanoma images into unique flower-themed artworks that relate to the shape of the lesion and are therefore personal to the patient. Further, we altered the initial framework and made comparisons and evaluations of the results. With this, we increased the options in the toolbox for art therapy in disease management of melanoma. The development of an easy-to-use user interface ensures the availability of the approach to stakeholders. The transformation of melanoma into flower-themed artworks is achieved by the proposed model and the graphical user interface. This contribution opens a new field of GANs in art therapy and could lead to more personalized disease management. △ Less

Submitted 20 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

arXiv:2211.16694 [pdf, ps, other]

MSV Challenge 2022: NPU-HC Speaker Verification System for Low-resource Indian Languages

Authors: Yue Li, Li Zhang, Namin Wang, Jie Liu, Lei Xie

Abstract: This report describes the NPU-HC speaker verification system submitted to the O-COCOSDA Multi-lingual Speaker Verification (MSV) Challenge 2022, which focuses on develo** speaker verification systems for low-resource Asian languages. We participate in the I-MSV track, which aims to develop speaker verification systems for various Indian languages. In this challenge, we first explore different ne… ▽ More This report describes the NPU-HC speaker verification system submitted to the O-COCOSDA Multi-lingual Speaker Verification (MSV) Challenge 2022, which focuses on develo** speaker verification systems for low-resource Asian languages. We participate in the I-MSV track, which aims to develop speaker verification systems for various Indian languages. In this challenge, we first explore different neural network frameworks for low-resource speaker verification. Then we leverage vanilla fine-tuning and weight transfer fine-tuning to transfer the out-domain pre-trained models to the in-domain Indian dataset. Specifically, the weight transfer fine-tuning aims to constrain the distance of the weights between the pre-trained model and the fine-tuned model, which takes advantage of the previously acquired discriminative ability from the large-scale out-domain datasets and avoids catastrophic forgetting and overfitting at the same time. Finally, score fusion is adopted to further improve performance. Together with the above contributions, we obtain 0.223% EER on the public evaluation set, ranking 2nd place on the leaderboard. On the private evaluation set, the EER of our submitted system is 2.123% and 0.630% for the constrained and unconstrained sub-tasks of the I-MSV track, leading to the 1st and 3rd place in the ranking, respectively. △ Less

Submitted 1 December, 2022; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 6pages, submitted to the 9th International Workshop on Vietnamese Language and Speech Processing

arXiv:2211.05256 [pdf, other]

Power Efficient Video Super-Resolution on Mobile NPUs with Deep Learning, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Cheng-Ming Chiang, Hsien-Kai Kuo, Yu-Syuan Xu, Man-Yu Lee, Allen Lu, Chia-Ming Cheng, Chih-Cheng Chen, Jia-Ying Yong, Hong-Han Shuai, Wen-Huang Cheng, Zhuang Jia, Tianyu Xu, Yijian Zhang, Long Bao, Heng Sun, Diankai Zhang, Si Gao, Shaoli Liu, Biao Wu, Xiaofeng Zhang, Chengjian Zheng, Kaidi Lu, Ning Wang , et al. (29 additional authors not shown)

Abstract: Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this prob… ▽ More Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2105.08826, arXiv:2105.07809, arXiv:2211.04470, arXiv:2211.03885

arXiv:2211.03038 [pdf, other]

Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling

Authors: Jixun Yao, Qing Wang, Yi Lei, Pengcheng Guo, Lei Xie, Namin Wang, Jie Liu

Abstract: Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker ver… ▽ More Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker verification (ASV)-model-free anonymization framework to protect speaker privacy while preserving speech intelligibility. Although the framework ranked first place in VoicePrivacy 2022 challenge, the anonymization was imperfect, since the speaker distinguishability of the anonymized speech was deteriorated. To address this issue, in this paper, we directly model the formant distribution and fundamental frequency (F0) to represent speaker identity and anonymize the source speech by the uniformly scaling formant and F0. By directly scaling the formant and F0, the speaker distinguishability degradation of the anonymized speech caused by the introduction of other speakers is prevented. The experimental results demonstrate that our proposed framework can improve the speaker distinguishability and significantly outperforms our previous framework in voice distinctiveness. Furthermore, our proposed method also can trade off the privacy-utility by using different scaling factors. △ Less

Submitted 6 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2208.03953 [pdf, ps, other]

Intelligent MIMO Detection Using Meta Learning

Authors: Haomiao Huo, **dan Xu, Gege Su, Wei Xu, Ning Wang

Abstract: In a K-best detector for multiple-input-multiple-output(MIMO) systems, the value of K needs to be sufficiently large to achieve near-maximum-likelihood (ML) performance. By treating K as a variable that can be adjusted according to a fitting function of some learnable coefficients, an intelligent MIMO detection network based on deep neural networks (DNN) is proposed to reduce complexity of the det… ▽ More In a K-best detector for multiple-input-multiple-output(MIMO) systems, the value of K needs to be sufficiently large to achieve near-maximum-likelihood (ML) performance. By treating K as a variable that can be adjusted according to a fitting function of some learnable coefficients, an intelligent MIMO detection network based on deep neural networks (DNN) is proposed to reduce complexity of the detection algorithm with little performance degradation. In particular, the proposed intelligent detection algorithm uses meta learning to learn the coefficients of the fitting function for K to circumvent the problem of learning K directly. The idea of network fusion is used to combine the learning results of the meta learning component networks. Simulation results show that the proposed scheme achieves near-ML detection performance while its complexity is close to that of linear detectors. Besides, it also exhibits strong ability of fast training. △ Less

Submitted 8 August, 2022; originally announced August 2022.

arXiv:2207.03430 [pdf, ps, other]

A Novel Unified Conditional Score-based Generative Framework for Multi-modal Medical Image Completion

Authors: Xiangxi Meng, Yuning Gu, Yongsheng Pan, Nizhuan Wang, Peng Xue, Mengkang Lu, Xuming He, Yiqiang Zhan, Dinggang Shen

Abstract: Multi-modal medical image completion has been extensively applied to alleviate the missing modality issue in a wealth of multi-modal diagnostic tasks. However, for most existing synthesis methods, their inferences of missing modalities can collapse into a deterministic map** from the available ones, ignoring the uncertainties inherent in the cross-modal relationships. Here, we propose the Unifie… ▽ More Multi-modal medical image completion has been extensively applied to alleviate the missing modality issue in a wealth of multi-modal diagnostic tasks. However, for most existing synthesis methods, their inferences of missing modalities can collapse into a deterministic map** from the available ones, ignoring the uncertainties inherent in the cross-modal relationships. Here, we propose the Unified Multi-Modal Conditional Score-based Generative Model (UMM-CSGM) to take advantage of Score-based Generative Model (SGM) in modeling and stochastically sampling a target probability distribution, and further extend SGM to cross-modal conditional synthesis for various missing-modality configurations in a unified framework. Specifically, UMM-CSGM employs a novel multi-in multi-out Conditional Score Network (mm-CSN) to learn a comprehensive set of cross-modal conditional distributions via conditional diffusion and reverse generation in the complete modality space. In this way, the generation process can be accurately conditioned by all available information, and can fit all possible configurations of missing modalities in a single network. Experiments on BraTS19 dataset show that the UMM-CSGM can more reliably synthesize the heterogeneous enhancement and irregular area in tumor-induced lesions for any missing modalities. △ Less

Submitted 7 July, 2022; originally announced July 2022.

arXiv:2205.13133 [pdf, other]

Coverage Probability Analysis of RIS-Assisted High-Speed Train Communications

Authors: Changzhu Liu, Ruisi He, Yong Niu, Bo Ai, Zhu Han, Zhangfeng Ma, Meilin Gao, Zhangdui Zhong, Ning Wang

Abstract: Reconfigurable intelligent surface (RIS) has received increasing attention due to its capability of extending cell coverage by reflecting signals toward receivers. This paper considers a RIS-assisted high-speed train (HST) communication system to improve the coverage probability. We derive the closed-form expression of coverage probability. Moreover, we analyze impacts of some key system parameter… ▽ More Reconfigurable intelligent surface (RIS) has received increasing attention due to its capability of extending cell coverage by reflecting signals toward receivers. This paper considers a RIS-assisted high-speed train (HST) communication system to improve the coverage probability. We derive the closed-form expression of coverage probability. Moreover, we analyze impacts of some key system parameters, including transmission power, signal-to-noise ratio threshold, and horizontal distance between base station and RIS. Simulation results verify the efficiency of RIS-assisted HST communications in terms of coverage probability. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: 6 pages, 6 figures,submmited to GlobeCom 2022

arXiv:2205.02524 [pdf, other]

M2R2: Missing-Modality Robust emotion Recognition framework with iterative data augmentation

Authors: Ning Wang

Abstract: This paper deals with the utterance-level modalities missing problem with uncertain patterns on emotion recognition in conversation (ERC) task. Present models generally predict the speaker's emotions by its current utterance and context, which is degraded by modality missing considerably. Our work proposes a framework Missing-Modality Robust emotion Recognition (M2R2), which trains emotion recogni… ▽ More This paper deals with the utterance-level modalities missing problem with uncertain patterns on emotion recognition in conversation (ERC) task. Present models generally predict the speaker's emotions by its current utterance and context, which is degraded by modality missing considerably. Our work proposes a framework Missing-Modality Robust emotion Recognition (M2R2), which trains emotion recognition model with iterative data augmentation by learned common representation. Firstly, a network called Party Attentive Network (PANet) is designed to classify emotions, which tracks all the speakers' states and context. Attention mechanism between speaker with other participants and dialogue topic is used to decentralize dependence on multi-time and multi-party utterances instead of the possible incomplete one. Moreover, the Common Representation Learning (CRL) problem is defined for modality-missing problem. Data imputation methods improved by the adversarial strategy are used here to construct extra features to augment data. Extensive experiments and case studies validate the effectiveness of our methods over baselines for modality-missing emotion recognition on two different datasets. △ Less

Submitted 5 May, 2022; originally announced May 2022.

arXiv:2205.02426 [pdf, other]

doi 10.1109/JSTSP.2022.3173749

Cooperative Reflection and Synchronization Design for Distributed Multiple-RIS Communications

Authors: Yaqiong Zhao, Wei Xu, Xiaohu You, Ning Wang, Huan Sun

Abstract: To reap the promised gain achieved by distributed reconfigurable intelligent surfaces (RISs)-enhanced communications in a wireless network, timing synchronization among these metasurfaces is an essential prerequisite in practice. This paper proposes a unified framework for the joint estimation of the unknown timing offsets and the RIS channel parameters, as well as the design of cooperative reflec… ▽ More To reap the promised gain achieved by distributed reconfigurable intelligent surfaces (RISs)-enhanced communications in a wireless network, timing synchronization among these metasurfaces is an essential prerequisite in practice. This paper proposes a unified framework for the joint estimation of the unknown timing offsets and the RIS channel parameters, as well as the design of cooperative reflection and synchronization algorithm for the distributed multiple-RIS communication. Considering that RIS is usually a passive device with limited capability of signal processing, the individual timing offset and channel gains of each hop of the RIS links cannot be directly estimated. To make the estimation tractable, we propose to estimate the cascaded channels and timing offsets jointly by deriving a maximum likelihood estimator. Furthermore, we theoretically characterize the Cramer-Rao lower bound (CRLB) to evaluate the accuracy of this estimator. By using the proposed estimator and the derived CRLBs, an efficient resynchronization algorithm is devised jointly at the RISs and the destination to compensate the multiple timing offsets. Based on the majorization-minimization framework, the proposed algorithm admits semi-closed and closed form solutions for the RIS reflection matrices and the timing offset equalizer, respectively. Simulation results verify that our theoretical analysis well matches the numerical tests and validate the effectiveness of the proposed resynchronization algorithm. △ Less

Submitted 5 May, 2022; originally announced May 2022.

arXiv:2204.10836 [pdf, other]

doi 10.1038/s41467-022-33407-5

Federated Learning Enables Big Data for Rare Cancer Boundary Detection

Authors: Sarthak Pati, Ujjwal Baid, Brandon Edwards, Micah Sheller, Shih-Han Wang, G Anthony Reina, Patrick Foley, Alexey Gruzdev, Deepthi Karkada, Christos Davatzikos, Chiharu Sako, Satyam Ghodasara, Michel Bilello, Suyash Mohan, Philipp Vollmuth, Gianluca Brugnara, Chandrakanth J Preetha, Felix Sahm, Klaus Maier-Hein, Maximilian Zenk, Martin Bendszus, Wolfgang Wick, Evan Calabrese, Jeffrey Rudie, Javier Villanueva-Meyer , et al. (254 additional authors not shown)

Abstract: Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train acc… ▽ More Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing. △ Less

Submitted 25 April, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: federated learning, deep learning, convolutional neural network, segmentation, brain tumor, glioma, glioblastoma, FeTS, BraTS

arXiv:2204.08987 [pdf]

Deep learning based closed-loop optimization of geothermal reservoir production

Authors: Nanzhe Wang, Haibin Chang, Xiangzhao Kong, Martin O. Saar, Dongxiao Zhang

Abstract: To maximize the economic benefits of geothermal energy production, it is essential to optimize geothermal reservoir management strategies, in which geologic uncertainty should be considered. In this work, we propose a closed-loop optimization framework, based on deep learning surrogates, for the well control optimization of geothermal reservoirs. In this framework, we construct a hybrid convolutio… ▽ More To maximize the economic benefits of geothermal energy production, it is essential to optimize geothermal reservoir management strategies, in which geologic uncertainty should be considered. In this work, we propose a closed-loop optimization framework, based on deep learning surrogates, for the well control optimization of geothermal reservoirs. In this framework, we construct a hybrid convolution-recurrent neural network surrogate, which combines the convolution neural network (CNN) and long short-term memory (LSTM) recurrent network. The convolution structure can extract spatial information of geologic parameter fields and the recurrent structure can approximate sequence-to-sequence map**. The trained model can predict time-varying production responses (rate, temperature, etc.) for cases with different permeability fields and well control sequences. In the closed-loop optimization framework, production optimization based on the differential evolution (DE) algorithm, and data assimilation based on the iterative ensemble smoother (IES), are performed alternately to achieve real-time well control optimization and geologic parameter estimation as the production proceeds. In addition, the averaged objective function over the ensemble of geologic parameter estimations is adopted to consider geologic uncertainty in the optimization process. Several geothermal reservoir development cases are designed to test the performance of the proposed production optimization framework. The results show that the proposed framework can achieve efficient and effective real-time optimization and data assimilation in the geothermal reservoir production process. △ Less

Submitted 15 April, 2022; originally announced April 2022.

Comments: 37 pages, 24 figures

arXiv:2204.08171 [pdf, other]

Distributed Neural Precoding for Hybrid mmWave MIMO Communications with Limited Feedback

Authors: Kai Wei, **dan Xu, Wei Xu, Ning Wang, Dong Chen

Abstract: Hybrid precoding is a cost-efficient technique for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) communications. This paper proposes a deep learning approach by using a distributed neural network for hybrid analog-and-digital precoding design with limited feedback. The proposed distributed neural precoding network, called DNet, is committed to achieving two objectives. Fir… ▽ More Hybrid precoding is a cost-efficient technique for millimeter wave (mmWave) massive multiple-input multiple-output (MIMO) communications. This paper proposes a deep learning approach by using a distributed neural network for hybrid analog-and-digital precoding design with limited feedback. The proposed distributed neural precoding network, called DNet, is committed to achieving two objectives. First, the DNet realizes channel state information (CSI) compression with a distributed architecture of neural networks, which enables practical deployment on multiple users. Specifically, this neural network is composed of multiple independent sub-networks with the same structure and parameters, which reduces both the number of training parameters and network complexity. Secondly, DNet learns the calculation of hybrid precoding from reconstructed CSI from limited feedback. Different from existing black-box neural network design, the DNet is specifically designed according to the data form of the matrix calculation of hybrid precoding. Simulation results show that the proposed DNet significantly improves the performance up to nearly 50% compared to traditional limited feedback precoding methods under the tests with various CSI compression ratios. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Comments: 13 pages, 4 figures

arXiv:2204.03889 [pdf, other]

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

Authors: Nick J. C. Wang, Zongfeng Quan, Shaojun Wang, **g Xiao

Abstract: The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder… ▽ More The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding efficiency of Conformer, we propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder fed from the acoustic sequences generated by the encoder, thus reducing operations. However, to achieve such decoding improvements, we must fine-tune model parameters, as cross-attention observations are changed and thus require corresponding refinements. Our final experiments show that, with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR accuracy. An improvement in accuracy is even found for the LibriSpeech "test-other" set. The word error rate (WER) is reduced by 6\% relative at the beam width of 1 and by 3% relative at the beam width of 4. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH 2022 (5 pages, 2 figures)

arXiv:2204.03879 [pdf, other]

A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding

Authors: Nick J. C. Wang, Shaojun Wang, **g Xiao

Abstract: SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems w… ▽ More SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application. System optimization still entails carefully improving the performance of each component. It is difficult to prove that one direction is conclusively better than the other. In this paper, we also propose a novel connectionist temporal summarization (CTS) method to reduce the length of acoustic encoding sequences while improving the accuracy and processing speed of end-to-end models. This method achieves the same intent accuracy as the best two-stage SLU recognition with complicated and time-consuming decoding but does so at lower computational cost. This stacked end-to-end SLU system yields an intent accuracy of 93.97% for the SmartLights far-field set, 95.18% for the close-field set, and 99.71% for FluentSpeech. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to INTERSPEECH 2022. (5 pages, 1 figure.)

arXiv:2204.03315 [pdf, ps, other]

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

Authors: Nick J. C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang

Abstract: In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-… ▽ More In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach yields an intent accuracy of 99.4% on FluentSpeech, an intent error rate reduction of 50% compared to that of Lugosch et al. Although we focus on real-time streaming methods, we also list non-streaming methods for comparison. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: Published in INTERSPEECH 2021

arXiv:2203.16005 [pdf, other]

Deep Joint Source-Channel Coding for CSI Feedback: An End-to-End Approach

Authors: Jialong Xu, Bo Ai, Ning Wang, Wei Chen

Abstract: The increased throughput brought by MIMO technology relies on the knowledge of channel state information (CSI) acquired in the base station (BS). To make the CSI feedback overhead affordable for the evolution of MIMO technology (e.g., massive MIMO and ultra-massive MIMO), deep learning (DL) is introduced to deal with the CSI compression task. Based on the separation principle in existing communica… ▽ More The increased throughput brought by MIMO technology relies on the knowledge of channel state information (CSI) acquired in the base station (BS). To make the CSI feedback overhead affordable for the evolution of MIMO technology (e.g., massive MIMO and ultra-massive MIMO), deep learning (DL) is introduced to deal with the CSI compression task. Based on the separation principle in existing communication systems, DL based CSI compression is used as source coding. However, this separate source-channel coding (SSCC) scheme is inferior to the joint source-channel coding (JSCC) scheme in the finite blocklength regime. In this paper, we propose a deep joint source-channel coding (DJSCC) based framework for the CSI feedback task. In particular, the proposed method can simultaneously learn from the CSI source and the wireless channel. Instead of truncating CSI via Fourier transform in the delay domain in existing methods, we apply non-linear transform networks to compress the CSI. Furthermore, we adopt an SNR adaption mechanism to deal with the wireless channel variations. The extensive experiments demonstrate the validity, adaptability, and generality of the proposed framework. △ Less

Submitted 6 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: 12 pages, 11 figure

arXiv:2203.09034 [pdf, other]

doi 10.1109/TMI.2022.3201974

GATE: Graph CCA for Temporal SElf-supervised Learning for Label-efficient fMRI Analysis

Authors: Liang Peng, Nan Wang, Jie Xu, Xiaofeng Zhu, Xiaoxiao Li

Abstract: In this work, we focus on the challenging task, neuro-disease classification, using functional magnetic resonance imaging (fMRI). In population graph-based disease analysis, graph convolutional neural networks (GCNs) have achieved remarkable success. However, these achievements are inseparable from abundant labeled data and sensitive to spurious signals. To improve fMRI representation learning and… ▽ More In this work, we focus on the challenging task, neuro-disease classification, using functional magnetic resonance imaging (fMRI). In population graph-based disease analysis, graph convolutional neural networks (GCNs) have achieved remarkable success. However, these achievements are inseparable from abundant labeled data and sensitive to spurious signals. To improve fMRI representation learning and classification under a label-efficient setting, we propose a novel and theory-driven self-supervised learning (SSL) framework on GCNs, namely Graph CCA for Temporal self-supervised learning on fMRI analysis GATE. Concretely, it is demanding to design a suitable and effective SSL strategy to extract formation and robust features for fMRI. To this end, we investigate several new graph augmentation strategies from fMRI dynamic functional connectives (FC) for SSL training. Further, we leverage canonical-correlation analysis (CCA) on different temporal embeddings and present the theoretical implications. Consequently, this yields a novel two-step GCN learning procedure comprised of (i) SSL on an unlabeled fMRI population graph and (ii) fine-tuning on a small labeled fMRI dataset for a classification task. Our method is tested on two independent fMRI datasets, demonstrating superior performance on autism and dementia diagnosis. △ Less

Submitted 27 August, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

Journal ref: IEEE Transactions on Medical Imaging 2022

arXiv:2203.00400 [pdf, other]

RIS-Assisted Quasi-Static Broad Coverage for Wideband mmWave Massive MIMO Systems

Authors: Muxin He, **dan Xu, Wei Xu, Hong Shen, Ning Wang, Chunming Zhao

Abstract: Reconfigurable intelligent surfaces (RISs) can establish favorable wireless environments to combat the severe attenuation and blockages in millimeter-wave (mmWave) bands. However, to achieve the optimal enhancement of performance, the instantaneous channel state information (CSI) needs to be estimated at the cost of a large overhead that scales with the number of RIS elements and the number of use… ▽ More Reconfigurable intelligent surfaces (RISs) can establish favorable wireless environments to combat the severe attenuation and blockages in millimeter-wave (mmWave) bands. However, to achieve the optimal enhancement of performance, the instantaneous channel state information (CSI) needs to be estimated at the cost of a large overhead that scales with the number of RIS elements and the number of users. In this paper, we design a quasi-static broad coverage at the RIS with the reduced overhead based on the statistical CSI. We propose a design framework to synthesize the power pattern reflected by the RIS that meets the customized requirements of broad coverage. For the communication of broadcast channels, we generalize the broad coverage of the single transmit stream to the scenario of multiple streams. Moreover, we employ the quasi-static broad coverage for a multiuser orthogonal frequency division multiplexing access (OFDMA) system, and derive the analytical expression of the downlink rate, which is proved to increase logarithmically with the power gain reflected by the RIS. By taking into account the overhead of channel estimation, the proposed quasi-static broad coverage even outperforms the design method that optimizes the RIS phases using the instantaneous CSI. Numerical simulations are conducted to verify these observations. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2111.10773 [pdf, other]

One-shot Weakly-Supervised Segmentation in Medical Images

Authors: Wenhui Lei, Qi Su, Ran Gu, Na Wang, Xinglong Liu, Guotai Wang, Xiaofan Zhang, Shaoting Zhang

Abstract: Deep neural networks usually require accurate and a large number of annotations to achieve outstanding performance in medical image segmentation. One-shot segmentation and weakly-supervised learning are promising research directions that lower labeling effort by learning a new class from only one annotated image and utilizing coarse labels instead, respectively. Previous works usually fail to leve… ▽ More Deep neural networks usually require accurate and a large number of annotations to achieve outstanding performance in medical image segmentation. One-shot segmentation and weakly-supervised learning are promising research directions that lower labeling effort by learning a new class from only one annotated image and utilizing coarse labels instead, respectively. Previous works usually fail to leverage the anatomical structure and suffer from class imbalance and low contrast problems. Hence, we present an innovative framework for 3D medical image segmentation with one-shot and weakly-supervised settings. Firstly a propagation-reconstruction network is proposed to project scribbles from annotated volume to unlabeled 3D images based on the assumption that anatomical patterns in different human bodies are similar. Then a dual-level feature denoising module is designed to refine the scribbles based on anatomical- and pixel-level features. After expanding the scribbles to pseudo masks, we could train a segmentation model for the new class with the noisy label training strategy. Experiments on one abdomen and one head-and-neck CT dataset show the proposed method obtains significant improvement over the state-of-the-art methods and performs robustly even under severe class imbalance and low contrast. △ Less

Submitted 21 November, 2021; originally announced November 2021.

arXiv:2110.08622 [pdf, other]

Self-Learned Kernel Low Rank Approach TO Accelerated High Resolution 3D Diffusion MRI

Authors: Abhijit Baul, Nian Wang, Choyi Zhang, Leslie Ying, Yuchou Chang, Ukash Nakarmi

Abstract: Diffusion Magnetic Resonance Imaging (dMRI) is a promising method to analyze the subtle changes in the tissue structure. However, the lengthy acquisition time is a major limitation in the clinical application of dMRI. Different image acquisition techniques such as parallel imaging, compressed sensing, has shortened the prolonged acquisition time but creating high-resolution 3D dMRI slices still re… ▽ More Diffusion Magnetic Resonance Imaging (dMRI) is a promising method to analyze the subtle changes in the tissue structure. However, the lengthy acquisition time is a major limitation in the clinical application of dMRI. Different image acquisition techniques such as parallel imaging, compressed sensing, has shortened the prolonged acquisition time but creating high-resolution 3D dMRI slices still requires a significant amount of time. In this study, we have shown that high-resolution 3D dMRI can be reconstructed from the highly undersampled k-space and q-space data using a Kernel LowRank method. Our proposed method has outperformed the conventional CS methods in terms of both image quality and diffusion maps constructed from the diffusion-weighted images △ Less

Submitted 23 November, 2021; v1 submitted 16 October, 2021; originally announced October 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2108.12074 [pdf, other]

4-bit Quantization of LSTM-based Speech Recognition Models

Authors: Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Xiao Sun, Naigang Wang, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan

Abstract: We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM port… ▽ More We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves $<$0.5% and $<$1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%. △ Less

Submitted 26 August, 2021; originally announced August 2021.

Comments: 5 pages, 3 figures, Andrea Fasoli and Chia-Yu Chen equally contributed to this work. Paper accepted to Interspeech 2021

ACM Class: I.2.6

arXiv:2107.04986 [pdf, other]

Theoretical Performance Limit for Radar Parameter Estimation

Authors: Dazhuan Xu, Han Zhang, Nan Wang

Abstract: In this paper, we employ the thoughts and methodologies of Shannon's information theory to solve the problem of the optimal radar parameter estimation. Based on a general radar system model, the \textit{a posteriori} probability density function of targets' parameters is derived. Range information (RI) and entropy error (EE) are defined to evaluate the performance. It is proved that acquiring 1 bi… ▽ More In this paper, we employ the thoughts and methodologies of Shannon's information theory to solve the problem of the optimal radar parameter estimation. Based on a general radar system model, the \textit{a posteriori} probability density function of targets' parameters is derived. Range information (RI) and entropy error (EE) are defined to evaluate the performance. It is proved that acquiring 1 bit of the range information is equivalent to reducing estimation deviation by half. The closed-form approximation for the EE is deduced in all signal-to-noise ratio (SNR) regions, which demonstrates that the EE degenerates to the mean square error (MSE) when the SNR is tending to infinity. Parameter estimation theorem is then proved, which claims that the theoretical RI is achievable. The converse claims that there exists no unbiased estimator whose empirical RI is larger than the theoretical RI. Simulation result demonstrates that the theoretical EE is tighter than the commonly used Cramér-Rao bound and the ZivZakai bound. △ Less

Submitted 6 February, 2023; v1 submitted 11 July, 2021; originally announced July 2021.

arXiv:2105.10283 [pdf, other]

A Lightweight Deep Network for Efficient CSI Feedback in Massive MIMO Systems

Authors: Yuyao Sun, Wei Xu, Le Liang, Ning Wang, Geoffery Ye Li, Xiaohu You

Abstract: To fully exploit the advantages of massive multiple-input multiple-output (m-MIMO), accurate channel state information (CSI) is required at the transmitter. However, excessive CSI feedback for large antenna arrays is inefficient and thus undesirable in practical applications. By exploiting the inherent correlation characteristics of complex-valued channel responses in the angular-delay domain, we… ▽ More To fully exploit the advantages of massive multiple-input multiple-output (m-MIMO), accurate channel state information (CSI) is required at the transmitter. However, excessive CSI feedback for large antenna arrays is inefficient and thus undesirable in practical applications. By exploiting the inherent correlation characteristics of complex-valued channel responses in the angular-delay domain, we propose a novel neural network (NN) architecture, namely ENet, for CSI compression and feedback in m-MIMO. Even if the ENet processes the real and imaginary parts of the CSI values separately, its special structure enables the network trained for the real part only to be reused for the imaginary part. The proposed ENet shows enhanced performance with the network size reduced by nearly an order of magnitude compared to the existing NN-based solutions. Experimental results verify the effectiveness of the proposed ENet. △ Less

Submitted 21 May, 2021; originally announced May 2021.

arXiv:2104.05376 [pdf, other]

Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer

Authors: Tianwei Lin, Zhuoqi Ma, Fu Li, Dongliang He, Xin Li, Errui Ding, Nannan Wang, Jie Li, Xinbo Gao

Abstract: Artistic style transfer aims at migrating the style from an example image to a content image. Currently, optimization-based methods have achieved great stylization quality, but expensive time cost restricts their practical applications. Meanwhile, feed-forward methods still fail to synthesize complex style, especially when holistic global and local patterns exist. Inspired by the common painting p… ▽ More Artistic style transfer aims at migrating the style from an example image to a content image. Currently, optimization-based methods have achieved great stylization quality, but expensive time cost restricts their practical applications. Meanwhile, feed-forward methods still fail to synthesize complex style, especially when holistic global and local patterns exist. Inspired by the common painting process of drawing a draft and revising the details, we introduce a novel feed-forward method named Laplacian Pyramid Network (LapStyle). LapStyle first transfers global style patterns in low-resolution via a Drafting Network. It then revises the local details in high-resolution via a Revision Network, which hallucinates a residual image according to the draft and the image textures extracted by Laplacian filtering. Higher resolution details can be easily generated by stacking Revision Networks with multiple Laplacian pyramid levels. The final stylized image is obtained by aggregating outputs of all pyramid levels. %We also introduce a patch discriminator to better learn local patterns adversarially. Experiments demonstrate that our method can synthesize high quality stylized images in real time, where holistic style patterns are properly transferred. △ Less

Submitted 17 April, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Accepted by CVPR 2021. Codes will be released soon on https://github.com/PaddlePaddle/PaddleGAN/

arXiv:2104.00199 [pdf, ps, other]

doi 10.1109/SMC42975.2020.9283038

Data-Driven Optimized Tracking Control Heuristic for MIMO Structures: A Balance System Case Study

Authors: Ning Wang, Mohammed Abouheaf, Wail Gueaieb

Abstract: A data-driven computational heuristic is proposed to control MIMO systems without prior knowledge of their dynamics. The heuristic is illustrated on a two-input two-output balance system. It integrates a self-adjusting nonlinear threshold accepting heuristic with a neural network to compromise between the desired transient and steady state characteristics of the system while optimizing a dynamic c… ▽ More A data-driven computational heuristic is proposed to control MIMO systems without prior knowledge of their dynamics. The heuristic is illustrated on a two-input two-output balance system. It integrates a self-adjusting nonlinear threshold accepting heuristic with a neural network to compromise between the desired transient and steady state characteristics of the system while optimizing a dynamic cost function. The heuristic decides on the control gains of multiple interacting PID control loops. The neural network is trained upon optimizing a weighted-derivative like objective cost function. The performance of the developed mechanism is compared with another controller that employs a combined PID-Riccati approach. One of the salient features of the proposed control schemes is that they do not require prior knowledge of the system dynamics. However, they depend on a known region of stability for the control gains to be used as a search space by the optimization algorithm. The control mechanism is validated using different optimization criteria which address different design requirements. △ Less

Submitted 31 March, 2021; originally announced April 2021.

Journal ref: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 2020, pp. 2365-2370

arXiv:2103.11130 [pdf, other]

doi 10.1109/TSP.2021.3133698

The Level Set Kalman Filter for State Estimation of Continuous-discrete Systems

Authors: Ningyuan Wang, Daniel B. Forger

Abstract: We propose a new extension of Kalman filtering for continuous-discrete systems with nonlinear state-space models that we name as the level set Kalman filter (LSKF). The LSKF assumes the probability distribution can be approximated as a Gaussian, and updates the Gaussian distribution through a time-update step and a measurement-update step. The LSKF improves the time-update step when compared to ex… ▽ More We propose a new extension of Kalman filtering for continuous-discrete systems with nonlinear state-space models that we name as the level set Kalman filter (LSKF). The LSKF assumes the probability distribution can be approximated as a Gaussian, and updates the Gaussian distribution through a time-update step and a measurement-update step. The LSKF improves the time-update step when compared to existing methods, such as the continuous-discrete cubature Kalman filter (CD-CKF) by reformulating the underlying Fokker-Planck equation as an ordinary differential equation for the Gaussian, thereby avoiding expansion in time. Together with a carefully picked measurement-update method, numerical experiments show that the LSKF has a consistent performance improvement over CD-CKF for a range of parameters, while also simplifies the implementation, as no user-defined timestep subdivision between measurements is required, and the spatial derivatives of the drift function are not explicitly needed. △ Less

Submitted 13 December, 2021; v1 submitted 20 March, 2021; originally announced March 2021.

Showing 1–50 of 66 results for author: Wang, N