Search | arXiv e-print repository

Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

Authors: Wenting Chen, Jie Liu, Tommy W. S. Chow, Yixuan Yuan

Abstract: Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and mis… ▽ More Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)

arXiv:2406.12270 [pdf, other]

Sparse MIMO for ISAC: New Opportunities and Challenges

Authors: Xinrui Li, Hongqi Min, Yong Zeng, Shi **, Linglong Dai, Yifei Yuan, Rui Zhang

Abstract: Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not on… ▽ More Multiple-input multiple-output (MIMO) has been a key technology of wireless communications for decades. A typical MIMO system employs antenna arrays with the inter-antenna spacing being half of the signal wavelength, which we term as compact MIMO. Looking forward towards the future sixth-generation (6G) mobile communication networks, MIMO system will achieve even finer spatial resolution to not only enhance the spectral efficiency of wireless communications, but also enable more accurate wireless sensing. To this end, by removing the restriction of half-wavelength antenna spacing, sparse MIMO has been proposed as a new architecture that is able to significantly enlarge the array aperture as compared to conventional compact MIMO with the same number of array elements. In addition, sparse MIMO leads to a new form of virtual MIMO systems for sensing with their virtual apertures considerably larger than physical apertures. As sparse MIMO is expected to be a viable technology for 6G, we provide in this article a comprehensive overview of it, especially focusing on its appealing advantages for integrated sensing and communication (ISAC) towards 6G. Specifically, assorted sparse MIMO architectures are first introduced, followed by their new benefits as well as challenges. We then discuss the main design issues of sparse MIMO, including beam pattern synthesis, signal processing, grating lobe suppression, beam codebook design, and array geometry optimization. Last, we provide numerical results to evaluate the performance of sparse MIMO for ISAC and point out promising directions for future research. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.02918 [pdf, other]

U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation

Authors: Chenxin Li, Xinyu Liu, Wuyang Li, Cheng Wang, Hengyu Liu, Yixuan Yuan

Abstract: U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the… ▽ More U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U-KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. These endeavours unveil valuable insights and sheds light on the prospect that with U-KAN, you can make strong backbone for medical image segmentation and generation. Project page: https://yes-ukan.github.io/ △ Less

Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2405.18356 [pdf, other]

Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography

Authors: Jie Liu, Yixiao Zhang, Kang Wang, Mehmet Can Yavuz, Xiaoxi Chen, Yixuan Yuan, Haoliang Li, Yang Yang, Alan Yuille, Yucheng Tang, Zongwei Zhou

Abstract: The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and… ▽ More The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and learning scheme. To overcome these limitations, we propose a universal, extensible framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes (e.g., organs/tumors). Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models, enriching semantic encoding compared with one-hot encoding. Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors and ease the addition of new classes. We train our Universal Model on 3,410 CT volumes assembled from 14 publicly available datasets and then test it on 6,173 CT volumes from four external datasets. Universal Model achieves first place on six CT tasks in the Medical Segmentation Decathlon (MSD) public leaderboard and leading performance on the Beyond The Cranial Vault (BTCV) dataset. In summary, Universal Model exhibits remarkable computational efficiency (6x faster than other dataset-specific models), demonstrates strong generalization across different hospitals, transfers well to numerous downstream tasks, and more importantly, facilitates the extensibility to new classes while alleviating the catastrophic forgetting of previously learned classes. Codes, models, and datasets are available at https://github.com/ljwztc/CLIP-Driven-Universal-Model △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Accepted to Medical Image Analysis

arXiv:2405.10825 [pdf, other]

Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities

Authors: Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili **, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu

Abstract: Large language models (LLMs) have received considerable attention recently due to their outstanding comprehension and reasoning capabilities, leading to great progress in many fields. The advancement of LLM techniques also offers promising opportunities to automate many tasks in the telecommunication (telecom) field. After pre-training and fine-tuning, LLMs can perform diverse downstream tasks bas… ▽ More Large language models (LLMs) have received considerable attention recently due to their outstanding comprehension and reasoning capabilities, leading to great progress in many fields. The advancement of LLM techniques also offers promising opportunities to automate many tasks in the telecommunication (telecom) field. After pre-training and fine-tuning, LLMs can perform diverse downstream tasks based on human instructions, paving the way to artificial general intelligence (AGI)-enabled 6G. Given the great potential of LLM technologies, this work aims to provide a comprehensive overview of LLM-enabled telecom networks. In particular, we first present LLM fundamentals, including model architecture, pre-training, fine-tuning, inference and utilization, model evaluation, and telecom deployment. Then, we introduce LLM-enabled key techniques and telecom applications in terms of generation, classification, optimization, and prediction problems. Specifically, the LLM-enabled generation applications include telecom domain knowledge, code, and network configuration generation. After that, the LLM-based classification applications involve network security, text, image, and traffic classification problems. Moreover, multiple LLM-enabled optimization techniques are introduced, such as automated reward function design for reinforcement learning and verbal reinforcement learning. Furthermore, for LLM-aided prediction problems, we discussed time-series prediction models and multi-modality prediction problems for telecom. Finally, we highlight the challenges and identify the future directions of LLM-enabled telecom networks. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.00233 [pdf, other]

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Authors: Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley

Abstract: Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these chal… ▽ More Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general audio, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised AudioMAE, discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.43 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated audio codecs, even at significantly lower bitrates. Our code and demos are available at https://haoheliu.github.io/SemantiCodec/. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: Demo and code: https://haoheliu.github.io/SemantiCodec/

arXiv:2404.17806 [pdf, other]

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Authors: Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang

Abstract: Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introd… ▽ More Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: Preprint submitted to IEEE MLSP 2024

arXiv:2404.03068 [pdf, other]

Multiple UAV-Assisted Cooperative DF Relaying in Multi-User Massive MIMO IoT Systems

Authors: Mobeen Mahmood, Yicheng Yuan, Tho Le-Ngoc

Abstract: This work considers a multi-user massive multiple-input multiple-output (MU-mMIMO) Internet-of-Things (IoT) system, where multiple unmanned aerial vehicles (UAVs) operating as decode-and-forward (DF) relays connect the base station (BS) to a large number of IoT devices. To maximize the total achievable rate, we propose a novel joint optimization problem of hybrid beamforming (HBF), multiple UAV re… ▽ More This work considers a multi-user massive multiple-input multiple-output (MU-mMIMO) Internet-of-Things (IoT) system, where multiple unmanned aerial vehicles (UAVs) operating as decode-and-forward (DF) relays connect the base station (BS) to a large number of IoT devices. To maximize the total achievable rate, we propose a novel joint optimization problem of hybrid beamforming (HBF), multiple UAV relay positioning, and power allocation (PA) to multiple IoT users. The study adopts a geometry-based millimeter-wave (mmWave) channel model for both links and utilizes sequential optimization based on K-means UAV-user association. The radio frequency (RF) stages are designed based on the slow time-varying angular information, while the baseband (BB) stages are designed utilizing the reduced-dimension effective channel matrices. The illustrative results show that multiple UAV-assisted cooperative relaying systems outperform a single UAV system in practical user distributions. Moreover, compared to fixed positions and equal PA of UAVs and BS, the joint optimization of UAV location and PA substantially enhances the total achievable rate. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: This paper has been accepted for publication in IEEE ICC 2024. arXiv admin note: text overlap with arXiv:2309.11748

arXiv:2404.01611 [pdf]

Audio Simulation for Sound Source Localization in Virtual Evironment

Authors: Yi Di Yuan, Swee Liang Wong, Jonathan Pan

Abstract: Non-line-of-sight localization in signal-deprived environments is a challenging yet pertinent problem. Acoustic methods in such predominantly indoor scenarios encounter difficulty due to the reverberant nature. In this study, we aim to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods… ▽ More Non-line-of-sight localization in signal-deprived environments is a challenging yet pertinent problem. Acoustic methods in such predominantly indoor scenarios encounter difficulty due to the reverberant nature. In this study, we aim to locate sound sources to specific locations within a virtual environment by leveraging physically grounded sound propagation simulations and machine learning methods. This process attempts to overcome the issue of data insufficiency to localize sound sources to their location of occurrence especially in post-event localization. We achieve 0.786+/- 0.0136 F1-score using an audio transformer spectrogram approach. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 2024 IEEE World Forum on Public Safety Technology

arXiv:2403.07390 [pdf, other]

Learning Correction Errors via Frequency-Self Attention for Blind Image Super-Resolution

Authors: Haochen Sun, Yan Yuan, Lijuan Su, Haotian Shao

Abstract: Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a res… ▽ More Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a result of correction errors. In this paper, we introduce a novel blind SR approach that focuses on Learning Correction Errors (LCE). Our method employs a lightweight Corrector to obtain a corrected low-resolution (CLR) image. Subsequently, within an SR network, we jointly optimize SR performance by utilizing both the original LR image and the frequency learning of the CLR image. Additionally, we propose a new Frequency-Self Attention block (FSAB) that enhances the global information utilization ability of Transformer. This block integrates both self-attention and frequency spatial attention mechanisms. Extensive ablation and comparison experiments conducted across various settings demonstrate the superiority of our method in terms of visual quality and accuracy. Our approach effectively addresses the challenges associated with degradation estimation and correction errors, paving the way for more accurate blind image SR. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: 16 pages

arXiv:2403.00605 [pdf, other]

Channel Measurements and Modeling for Dynamic Vehicular ISAC Scenarios at 28 GHz

Authors: Zhengyu Zhang, Ruisi He, Bo Ai, Mi Yang, Xuejian Zhang, Ziyi Qi, Yuan Yuan

Abstract: Integrated sensing and communication (ISAC) is a promising technology for 6G, with the goal of providing end-to-end information processing and inherent perception capabilities for future communication systems. Within ISAC emerging application scenarios, vehicular ISAC technologies have the potential to enhance traffic efficiency and safety through integration of communication and synchronized perc… ▽ More Integrated sensing and communication (ISAC) is a promising technology for 6G, with the goal of providing end-to-end information processing and inherent perception capabilities for future communication systems. Within ISAC emerging application scenarios, vehicular ISAC technologies have the potential to enhance traffic efficiency and safety through integration of communication and synchronized perception abilities. To establish a foundational theoretical support for vehicular ISAC system design and standardization, it is necessary to conduct channel measurements, and modeling to obtain a deep understanding of the radio propagation. In this paper, a dynamic statistical channel model is proposed for vehicular ISAC scenarios, incorporating Sensing Multipath Components (S-MPCs) and Clutter Multipath Components (C-MPCs), which are identified by the proposed tracking algorithm. Based on actual vehicular ISAC channel measurements at 28 GHz, time-varying sensing characteristics in front, left, and right directions are investigated. To model the dynamic evolution process of channel, number of new S-MPCs, lifetimes, initial power and delay positions, dynamic variations within their lifetimes, clustering, power decay, and fading of C-MPCs are statistically characterized. Finally, the paper provides implementation of dynamic vehicular ISAC model and validates it by comparing key simulation statistics between measurements and simulations. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2403.00569 [pdf, other]

Characterization of Wireless Channel Semantics: A New Paradigm

Authors: Zhengyu Zhang, Ruisi He, Mi Yang, Xuejian Zhang, Ziyi Qi, Yuan Yuan, Bo Ai

Abstract: Recently, deep learning enabled semantic communications have been developed to understand transmission content from semantic level, which realize effective and accurate information transfer. Aiming to the vision of sixth generation (6G) networks, wireless devices are expected to have native perception and intelligent capabilities, which associate wireless channel with surrounding environments from… ▽ More Recently, deep learning enabled semantic communications have been developed to understand transmission content from semantic level, which realize effective and accurate information transfer. Aiming to the vision of sixth generation (6G) networks, wireless devices are expected to have native perception and intelligent capabilities, which associate wireless channel with surrounding environments from physical propagation dimension to semantic information dimension. Inspired by these, we aim to provide a new paradigm on wireless channel from semantic level. A channel semantic model and its characterization framework are proposed in this paper. Specifically, a channel semantic model composes of status semantics, behavior semantics and event semantics. Based on actual channel measurement at 28 GHz, as well as multi-mode data, example results of channel semantic characterization are provided and analyzed, which exhibits reasonable and interpretable semantic information. △ Less

Submitted 1 March, 2024; originally announced March 2024.

arXiv:2402.16663 [pdf, other]

UN-SAM: Universal Prompt-Free Segmentation for Generalized Nuclei Images

Authors: Zhen Chen, Qing Xu, Xinyu Liu, Yixuan Yuan

Abstract: In digital pathology, precise nuclei segmentation is pivotal yet challenged by the diversity of tissue types, staining protocols, and imaging conditions. Recently, the segment anything model (SAM) revealed overwhelming performance in natural scenarios and impressive adaptation to medical imaging. Despite these advantages, the reliance of labor-intensive manual annotation as segmentation prompts se… ▽ More In digital pathology, precise nuclei segmentation is pivotal yet challenged by the diversity of tissue types, staining protocols, and imaging conditions. Recently, the segment anything model (SAM) revealed overwhelming performance in natural scenarios and impressive adaptation to medical imaging. Despite these advantages, the reliance of labor-intensive manual annotation as segmentation prompts severely hinders their clinical applicability, especially for nuclei image analysis containing massive cells where dense manual prompts are impractical. To overcome the limitations of current SAM methods while retaining the advantages, we propose the Universal prompt-free SAM framework for Nuclei segmentation (UN-SAM), by providing a fully automated solution with remarkable generalization capabilities. Specifically, to eliminate the labor-intensive requirement of per-nuclei annotations for prompt, we devise a multi-scale Self-Prompt Generation (SPGen) module to revolutionize clinical workflow by automatically generating high-quality mask hints to guide the segmentation tasks. Moreover, to unleash the generalization capability of SAM across a variety of nuclei images, we devise a Domain-adaptive Tuning Encoder (DT-Encoder) to seamlessly harmonize visual features with domain-common and domain-specific knowledge, and further devise a Domain Query-enhanced Decoder (DQ-Decoder) by leveraging learnable domain queries for segmentation decoding in different nuclei domains. Extensive experiments prove that UN-SAM with exceptional performance surpasses state-of-the-arts in nuclei instance and semantic segmentation, especially the generalization capability in zero-shot scenarios. The source code is available at https://github.com/CUHK-AIM-Group/UN-SAM. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 10 pages, 6 figures

arXiv:2401.15434 [pdf]

Decentralized Gossip Mutual Learning (GML) for brain tumor segmentation on multi-parametric MRI

Authors: **gyun Chen, Yading Yuan

Abstract: Federated Learning (FL) enables collaborative model training among medical centers without sharing private data. However, traditional FL risks on server failures and suboptimal performance on local data due to the nature of centralized model aggregation. To address these issues, we present Gossip Mutual Learning (GML), a decentralized framework that uses Gossip Protocol for direct peer-to-peer com… ▽ More Federated Learning (FL) enables collaborative model training among medical centers without sharing private data. However, traditional FL risks on server failures and suboptimal performance on local data due to the nature of centralized model aggregation. To address these issues, we present Gossip Mutual Learning (GML), a decentralized framework that uses Gossip Protocol for direct peer-to-peer communication. In addition, GML encourages each site to optimize its local model through mutual learning to account for data variations among different sites. For the task of tumor segmentation using 146 cases from four clinical sites in BraTS 2021 dataset, we demonstrated GML outperformed local models and achieved similar performance as FedAvg with only 25% communication overhead. △ Less

Submitted 27 January, 2024; originally announced January 2024.

Comments: 3 pages, 1 figure, accepted to IEEE EMBS 2023. arXiv admin note: text overlap with arXiv:2401.06180

arXiv:2401.07012 [pdf]

An ADRC-Incorporated Stochastic Gradient Descent Algorithm for Latent Factor Analysis

Authors: **li Li, Ye Yuan

Abstract: High-dimensional and incomplete (HDI) matrix contains many complex interactions between numerous nodes. A stochastic gradient descent (SGD)-based latent factor analysis (LFA) model is remarkably effective in extracting valuable information from an HDI matrix. However, such a model commonly encounters the problem of slow convergence because a standard SGD algorithm only considers the current learni… ▽ More High-dimensional and incomplete (HDI) matrix contains many complex interactions between numerous nodes. A stochastic gradient descent (SGD)-based latent factor analysis (LFA) model is remarkably effective in extracting valuable information from an HDI matrix. However, such a model commonly encounters the problem of slow convergence because a standard SGD algorithm only considers the current learning error to compute the stochastic gradient without considering the historical and future state of the learning error. To address this critical issue, this paper innovatively proposes an ADRC-incorporated SGD (ADS) algorithm by refining the instance learning error by considering the historical and future state by following the principle of an ADRC controller. With it, an ADS-based LFA model is further achieved for fast and accurate latent factor analysis on an HDI matrix. Empirical studies on two HDI datasets demonstrate that the proposed model outperforms the state-of-the-art LFA models in terms of computational efficiency and accuracy for predicting the missing data of an HDI matrix. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.06180 [pdf]

Decentralized Gossip Mutual Learning (GML) for automatic head and neck tumor segmentation

Authors: **gyun Chen, Yading Yuan

Abstract: Federated learning (FL) has emerged as a promising strategy for collaboratively training complicated machine learning models from different medical centers without the need of data sharing. However, the traditional FL relies on a central server to orchestrate the global model training among clients. This makes it vulnerable to the failure of the model server. Meanwhile, the model trained based on… ▽ More Federated learning (FL) has emerged as a promising strategy for collaboratively training complicated machine learning models from different medical centers without the need of data sharing. However, the traditional FL relies on a central server to orchestrate the global model training among clients. This makes it vulnerable to the failure of the model server. Meanwhile, the model trained based on the global data property may not yield the best performance on the local data of a particular site due to the variations of data characteristics among them. To address these limitations, we proposed Gossip Mutual Learning(GML), a decentralized collaborative learning framework that employs Gossip Protocol for direct peer-to-peer communication and encourages each site to optimize its local model by leveraging useful information from peers through mutual learning. On the task of tumor segmentation on PET/CT images using HECKTOR21 dataset with 223 cases from five clinical sites, we demonstrated GML could improve tumor segmentation performance in terms of Dice Similarity Coefficient (DSC) by 3.2%, 4.6% and 10.4% on site-specific testing cases as compared to three baseline methods: pooled training, FedAvg and individual training, respectively. We also showed GML has comparable generalization performance as pooled training and FedAvg when applying them on 78 cases from two out-of-sample sites where no case was used for model training. In our experimental setup, GML showcased a sixfold decrease in communication overhead compared to FedAvg, requiring only 16.67% of the total communication overhead. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: 6 pages, 1 figure, accepted to SPIE Medical Imaging 2024

arXiv:2312.05256 [pdf, other]

Holistic Evaluation of GPT-4V for Biomedical Imaging

Authors: Zhengliang Liu, Hanqi Jiang, Tianyang Zhong, Zihao Wu, Chong Ma, Yiwei Li, Xiaowei Yu, Yutong Zhang, Yi Pan, Peng Shu, Yanjun Lyu, Lu Zhang, Junjie Yao, Peixin Dong, Chao Cao, Zhenxiang Xiao, Jiaqi Wang, Huan Zhao, Shaochen Xu, Yaonai Wei, **gyuan Chen, Haixing Dai, Peilong Wang, Hao He, Zewei Wang , et al. (25 additional authors not shown)

Abstract: In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and mor… ▽ More In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications. △ Less

Submitted 10 November, 2023; originally announced December 2023.

arXiv:2312.00550 [pdf, ps, other]

Novel 3D Geometry-Based Stochastic Models for Non-Isotropic MIMO Vehicle-to-Vehicle Channels

Authors: Yi Yuan, Cheng-Xiang Wang, Xiang Cheng, Bo Ai, David I. Laurenson

Abstract: This paper proposes a novel three-dimensional (3D) theoretical regular-shaped geometry-based stochastic model (RS-GBSM) and the corresponding sum-of-sinusoids (SoS) simulation model for non-isotropic multiple-input multiple-output (MIMO) vehicle-to-vehicle (V2V) Ricean fading channels. The proposed RS-GBSM, combining line-of-sight (LoS) components, a two-sphere model, and an elliptic-cylinder mode… ▽ More This paper proposes a novel three-dimensional (3D) theoretical regular-shaped geometry-based stochastic model (RS-GBSM) and the corresponding sum-of-sinusoids (SoS) simulation model for non-isotropic multiple-input multiple-output (MIMO) vehicle-to-vehicle (V2V) Ricean fading channels. The proposed RS-GBSM, combining line-of-sight (LoS) components, a two-sphere model, and an elliptic-cylinder model, has the ability to study the impact of the vehicular traffic density (VTD) on channel statistics, and jointly considers the azimuth and elevation angles by using the von Mises Fisher distribution. Moreover, a novel parameter computation method is proposed for jointly calculating the azimuth and elevation angles in the SoS channel simulator. Based on the proposed 3D theoretical RS-GBSM and its SoS simulation model, statistical properties are derived and thoroughly investigated. The impact of the elevation angle in the 3D model on key statistical properties is investigated by comparing with those of the corresponding two-dimensional (2D) model. It is demonstrated that the 3D model is more accurate to characterize real V2V channels, in particular for pico cell scenarios. Finally, close agreement is achieved between the theoretical model, SoS simulation model, and simulation results, demonstrating the utility of the proposed models. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00535 [pdf, other]

RIS-Based On-the-Air Semantic Communications -- a Diffractional Deep Neural Network Approach

Authors: Shuyi Chen, Yingzhe Hui, Yifan Qin, Yueyi Yuan, Weixiao Meng, Xuewen Luo, Hsiao-Hwa Chen

Abstract: Semantic communication has gained significant attention recently due to its advantages in achieving higher transmission efficiency by focusing on semantic information instead of bit-level information. However, current AI-based semantic communication methods require digital hardware for implementation. With the rapid advancement on reconfigurable intelligence surfaces (RISs), a new approach called… ▽ More Semantic communication has gained significant attention recently due to its advantages in achieving higher transmission efficiency by focusing on semantic information instead of bit-level information. However, current AI-based semantic communication methods require digital hardware for implementation. With the rapid advancement on reconfigurable intelligence surfaces (RISs), a new approach called on-the-air diffractional deep neural networks (D$^2$NN) can be utilized to enable semantic communications on the wave domain. This paper proposes a new paradigm of RIS-based on-the-air semantic communications, where the computational process occurs inherently as wireless signals pass through RISs. We present the system model and discuss the data and control flows of this scheme, followed by a performance analysis using image transmission as an example. In comparison to traditional hardware-based approaches, RIS-based semantic communications offer appealing features, such as light-speed computation, low computational power requirements, and the ability to handle multiple tasks simultaneously. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 17 pages, 5 figures, accepted by IEEE WCM

arXiv:2311.07630 [pdf, other]

Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation

Authors: Zhaojian Li, Bin Zhao, Yuan Yuan

Abstract: Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial ap… ▽ More Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial approach for generating binaural stereo audio from mono audio. Specifically, we develop a Stereo Audio Generation Model (SAGM), which utilizes shared spatio-temporal visual information to guide the generator and the discriminator to work separately. The shared visual information is updated alternately in the generative adversarial stage, allowing the generator and discriminator to deliver their respective guided knowledge while visually sharing. The proposed method learns bidirectional complementary visual information, which facilitates the expression of visual guidance in generation. In addition, spatial perception is a crucial attribute of binaural stereo audio, and thus the evaluation of stereo spatial perception is essential. However, previous metrics failed to measure the spatial perception of audio. To this end, a metric to measure the spatial perception of audio is proposed for the first time. The proposed metric is capable of measuring the magnitude and direction of spatial perception in the temporal dimension. Further, considering its function, it is feasible to utilize it instead of demanding user studies to some extent. The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics. Qualitative experiments and user studies demonstrate that the method generates space-realistic stereo audio. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2309.11642 [pdf]

High-content stimulated Raman histology of human breast cancer

Authors: Hongli Ni, Chinmayee Prabhu Dessai, Haonan Lin, Wei Wang, Shaoxiong Chen, Yuhao Yuan, Xiaowei Ge, Jianpeng Ao, Nolan Vild, Ji-Xin Cheng

Abstract: Histological examination is crucial for cancer diagnosis, including hematoxylin and eosin (H&E) staining for map** morphology and immunohistochemistry (IHC) staining for revealing chemical information. Recently developed two-color stimulated Raman histology could bypass the complex tissue processing to mimic H&E-like morphology. Yet, the underlying chemical features are not revealed, compromisin… ▽ More Histological examination is crucial for cancer diagnosis, including hematoxylin and eosin (H&E) staining for map** morphology and immunohistochemistry (IHC) staining for revealing chemical information. Recently developed two-color stimulated Raman histology could bypass the complex tissue processing to mimic H&E-like morphology. Yet, the underlying chemical features are not revealed, compromising the effectiveness of prognostic stratification. Here, we present a high-content stimulated Raman histology (HC-SRH) platform that provides both morphological and chemical information for cancer diagnosis based on un-stained breast tissues. Through spectral unmixing in the C-H vibration window, HC-SRH can map unsaturated lipids, cellular protein, extracellular matrix, saturated lipid, and water in breast tissue. In this way, HC-SRH provides excellent contrast for various tissue components. Considering rapidness is important in clinical trials, we implemented spectral selective sampling to boost the speed of HC-SRH by one order. We also successfully demonstrated the HC-SRH in a clinical-compatible fiber laser-based SRS microscopy. With the widely rapid tuning capability of the advanced fiber laser, a clear chemical contrast of nucleic acid and solid-state ester is shown in the fingerprint result. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 6 figures

arXiv:2309.08051 [pdf, other]

Retrieval-Augmented Text-to-Audio Generation

Authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

Abstract: Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer… ▽ More Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks. △ Less

Submitted 5 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted by ICASSP 2024

arXiv:2308.14117 [pdf, other]

Cross-Entropy-Based Approach to Multi-Objective Electric Vehicle Charging Infrastructure Planning

Authors: **hao Li, Yu Hui Yuan, Qiushi Cui, Hao Wang

Abstract: Pure electric vehicles (PEVs) are increasingly adopted to decarbonize the transport sector and mitigate global warming. However, the inadequate PEV charging infrastructure may hinder the further adoption of PEVs in the large-scale traffic network, which calls for effective planning solutions for the charging station (CS) placement. The deployment of charging infrastructure inevitably increases the… ▽ More Pure electric vehicles (PEVs) are increasingly adopted to decarbonize the transport sector and mitigate global warming. However, the inadequate PEV charging infrastructure may hinder the further adoption of PEVs in the large-scale traffic network, which calls for effective planning solutions for the charging station (CS) placement. The deployment of charging infrastructure inevitably increases the load on the associated power distribution network. Therefore, we are motivated to develop a comprehensive multi-objective framework for optimal CS placement in a traffic network overlaid by a distribution network, considering multiple stakeholders' interested factors, such as traffic flow, PEV charging time cost, PEV travel distance, and the reliability of the distribution network. We leverage a cross-entropy-based method to solve the optimal CS placement and evaluate our method in a real-world 183-node traffic network in Chengdu, China, overlaid by a 26-region distribution network. It is demonstrated that our work provides various viable planning options favoring different objectives for the stakeholders' decision-making in practice. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: IEEE I&CPS Asia 2023 (2023 IEEE IAS Industrial and Commercial Power System Asia Conference)

arXiv:2308.05734 [pdf, other]

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

Authors: Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yu** Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

Abstract: Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learn… ▽ More Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2. △ Less

Submitted 11 May, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. Project page is https://audioldm.github.io/audioldm2

arXiv:2308.05037 [pdf, other]

Separate Anything You Describe

Authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instr… ▽ More Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep. △ Less

Submitted 27 October, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

Comments: Code, benchmark and pre-trained models: https://github.com/Audio-AGI/AudioSep

arXiv:2307.14335 [pdf, other]

WavJourney: Compositional Audio Creation with Large Language Models

Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, **hua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation systems. We present WavJourney, a novel framework that leverages Large Language Models (LLMs) to connect various audio models for audio creation. WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions. Specifically, given a text instruction, WavJourney first prompts LLMs to generate an audio script that serves as a structured semantic representation of audio elements. The audio script is then converted into a computer program, where each line of the program calls a task-specific audio generation model or computational operation function. The computer program is then executed to obtain a compositional and interpretable solution for audio creation. Experimental results suggest that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions, achieving state-of-the-art results on text-to-audio generation benchmarks. Additionally, we introduce a new multi-genre story benchmark. Subjective evaluations demonstrate the potential of WavJourney in crafting engaging storytelling audio content from text. We further demonstrate that WavJourney can facilitate human-machine co-creation in multi-round dialogues. To foster future research, the code and synthesized audio are available at: https://audio-agi.github.io/WavJourney_demopage/. △ Less

Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

Comments: GitHub: https://github.com/Audio-AGI/WavJourney

arXiv:2306.10359 [pdf, other]

Text-Driven Foley Sound Generation With Latent Diffusion Model

Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang

Abstract: Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale… ▽ More Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks ${1}^{st}$ among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online. △ Less

Submitted 18 September, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

Comments: Submit to DCASE-workshop 2023, an extension and supersedes the previous technical report arXiv:2305.15905

arXiv:2306.10275 [pdf, other]

Multi-Scale Simulation of Complex Systems: A Perspective of Integrating Knowledge and Data

Authors: Huandong Wang, Huan Yan, Can Rong, Yuan Yuan, Fenyu Jiang, Zhenyu Han, Hongjie Sui, Depeng **, Yong Li

Abstract: Complex system simulation has been playing an irreplaceable role in understanding, predicting, and controlling diverse complex systems. In the past few decades, the multi-scale simulation technique has drawn increasing attention for its remarkable ability to overcome the challenges of complex system simulation with unknown mechanisms and expensive computational costs. In this survey, we will syste… ▽ More Complex system simulation has been playing an irreplaceable role in understanding, predicting, and controlling diverse complex systems. In the past few decades, the multi-scale simulation technique has drawn increasing attention for its remarkable ability to overcome the challenges of complex system simulation with unknown mechanisms and expensive computational costs. In this survey, we will systematically review the literature on multi-scale simulation of complex systems from the perspective of knowledge and data. Firstly, we will present background knowledge about simulating complex system simulation and the scales in complex systems. Then, we divide the main objectives of multi-scale modeling and simulation into five categories by considering scenarios with clear scale and scenarios with unclear scale, respectively. After summarizing the general methods for multi-scale simulation based on the clues of knowledge and data, we introduce the adopted methods to achieve different objectives. Finally, we introduce the applications of multi-scale simulation in typical matter systems and social systems. △ Less

Submitted 17 June, 2023; originally announced June 2023.

arXiv:2305.15905 [pdf, other]

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang

Abstract: Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry pr… ▽ More Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7. △ Less

Submitted 15 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: DCASE 2023 task 7 technical report, ranked 1st in the challenge

arXiv:2303.03857 [pdf, other]

Leveraging Pre-trained AudioLDM for Text to Sound Generation: A Benchmark Study

Authors: Yi Yuan, Haohe Liu, **hua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang

Abstract: Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the… ▽ More Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the backbone for sound generation. Our study demonstrates the advantages of using pre-trained models for text-to-sound generation, especially in data-scarcity scenarios. In addition, experiments show that different training strategies (e.g., training conditions) may affect the performance of AudioLDM on datasets of different scales. To facilitate future studies, we also evaluate various text-to-sound generation systems on several frequently used datasets under the same evaluation protocols, which allow fair comparisons and benchmarking of these methods on the common ground. △ Less

Submitted 11 March, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

Comments: EUSIPCO 2023

arXiv:2303.01927 [pdf, other]

A Generalized Nyquist-Shannon Sampling Theorem Using the Koopman Operator

Authors: Zhexuan Zeng, Ye Yuan

Abstract: The sampling theorem plays a fundamental role for the recovery of continuous-time signals from discrete-time samples in the field of signal processing. The sampling theorem of non-band-limited signals has evolved into one of the most challenging problems. In this work, a generalized sampling theorem -- which builds on the Koopman operator -- is proved for signals in generator-bounded space (Theore… ▽ More The sampling theorem plays a fundamental role for the recovery of continuous-time signals from discrete-time samples in the field of signal processing. The sampling theorem of non-band-limited signals has evolved into one of the most challenging problems. In this work, a generalized sampling theorem -- which builds on the Koopman operator -- is proved for signals in generator-bounded space (Theorem 1). It naturally extends the Nyquist-Shannon sampling theorem that, 1) for band-limited signals, the lower bounds of sampling frequency given by these two theorems are exactly the same; 2) the Koopman operator-based sampling theorem can also provide finite bound of sampling frequency for certain types of non-band-limited signals, which can not be addressed by Nyquist-Shannon sampling theorem. These types of non-band-limited signals include but not limited to, for example, inverse Laplace transform with limited imaginary interval of integration, and linear combinations of complex exponential functions. Moreover, the Koopman operator-based reconstruction algorithm is provided with theoretical result of convergence. By this algorithm, the sampling theorem is effectively illustrated on several signals related to sine, exponential and polynomial signals. △ Less

Submitted 6 March, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2301.12503 [pdf, other]

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Authors: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley

Abstract: Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLA… ▽ More Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io. △ Less

Submitted 9 September, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

Comments: Accepted by ICML 2023. Demo and implementation at https://audioldm.github.io. Evaluation toolbox at https://github.com/haoheliu/audioldm_eval

arXiv:2301.00785 [pdf, other]

doi 10.1109/ICCV51070.2023.01934

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection

Authors: Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A. Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, Zongwei Zhou

Abstract: An increasing number of public datasets have shown a marked impact on automated organ segmentation and tumor detection. However, due to the small size and partially labeled problem of each dataset, as well as a limited investigation of diverse types of tumors, the resulting models are often limited to segmenting specific organs/tumors and ignore the semantics of anatomical structures, nor can they… ▽ More An increasing number of public datasets have shown a marked impact on automated organ segmentation and tumor detection. However, due to the small size and partially labeled problem of each dataset, as well as a limited investigation of diverse types of tumors, the resulting models are often limited to segmenting specific organs/tumors and ignore the semantics of anatomical structures, nor can they be extended to novel domains. To address these issues, we propose the CLIP-Driven Universal Model, which incorporates text embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models. This CLIP-based label encoding captures anatomical relationships, enabling the model to learn a structured feature embedding and segment 25 organs and 6 types of tumors. The proposed model is developed from an assembly of 14 datasets, using a total of 3,410 CT scans for training and then evaluated on 6,162 external CT scans from 3 additional datasets. We rank first on the Medical Segmentation Decathlon (MSD) public leaderboard and achieve state-of-the-art results on Beyond The Cranial Vault (BTCV). Additionally, the Universal Model is computationally more efficient (6x faster) compared with dataset-specific models, generalized better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. △ Less

Submitted 17 August, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: ICCV-2023; Rank first in Medical Segmentation Decathlon (MSD) Competition

arXiv:2212.07960 [pdf, other]

Beyond the Metaverse: XV (eXtended meta/uni/Verse)

Authors: Steve Mann, Yu Yuan, Tom Furness, Joseph Paradiso, Thomas Coughlin

Abstract: We propose the term and concept XV (eXtended meta/omni/uni/Verse) as an alternative to, and generalization of, the shared/social virtual reality widely known as ``metaverse''. XV is shared/social XR. We, and many others, use XR (eXtended Reality) as a broad umbrella term and concept to encompass all the other realities, where X is an ``anything'' variable, like in mathematics, to denote any realit… ▽ More We propose the term and concept XV (eXtended meta/omni/uni/Verse) as an alternative to, and generalization of, the shared/social virtual reality widely known as ``metaverse''. XV is shared/social XR. We, and many others, use XR (eXtended Reality) as a broad umbrella term and concept to encompass all the other realities, where X is an ``anything'' variable, like in mathematics, to denote any reality, X $\in$ \{physical, virtual, augmented, \ldots \} reality. Therefore XV inherits this generality from XR. We begin with a very simple organized taxonomy of all these realities in terms of two simple building blocks: (1) physical reality (PR) as made of ``atoms'', and (2) virtual reality (VR) as made of ``bits''. Next we introduce XV as combining all these realities with extended society as a three-dimensional space and taxonomy of (1) ``atoms'' (physical reality), (2) ``bits'' (virtuality), and (3) ``genes'' (sociality). Thus those working in the liminal space between Virtual Reality (VR), Augmented Reality (AR), metaverse, and their various extensions, can describe their work and research as existing in the new field of XV. XV includes the metaverse along with extensions of reality itself like shared seeing in the infrared, ultraviolet, and shared seeing of electromagnetic radio waves, sound waves, and electric currents in motors. For example, workers in a mechanical room can look at a pump and see a superimposed time-varying waveform of the actual rotating magnetic field inside its motor, in real time, while sharing this vision across multiple sites. Presented at IEEE Standards Association, Behind and Beyond the Metaverse: XV (eXtended meta/uni/Verse), Thurs. Dec. 8, 2022, 2:15-3:30pm, EST. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: 9 pages, 10 figures, presented at the IEEE Standards Association panel entitled "Behind and Beyond the Metaverse", Thurs. Dec. 8th, 2022. This work is entitled "Beyond the Metaverse: XV (eXtended meta/uni/Verse)" and was presented that day from 2:15pm to 3:30pm EST (Eastern Standard Time)

arXiv:2212.05808 [pdf]

Z-SSMNet: A Zonal-aware Self-Supervised Mesh Network for Prostate Cancer Detection and Diagnosis in bpMRI

Authors: Yuan Yuan, Euijoon Ahn, Dagan Feng, Mohamad Khadra, **man Kim

Abstract: Prostate cancer (PCa) is one of the most prevalent cancers in men and many people around the world die from clinically significant PCa (csPCa). Early diagnosis of csPCa in bi-parametric MRI (bpMRI), which is non-invasive, cost-effective, and more efficient compared to multiparametric MRI (mpMRI), can contribute to precision care for PCa. The rapid rise in artificial intelligence (AI) algorithms ar… ▽ More Prostate cancer (PCa) is one of the most prevalent cancers in men and many people around the world die from clinically significant PCa (csPCa). Early diagnosis of csPCa in bi-parametric MRI (bpMRI), which is non-invasive, cost-effective, and more efficient compared to multiparametric MRI (mpMRI), can contribute to precision care for PCa. The rapid rise in artificial intelligence (AI) algorithms are enabling unprecedented improvements in providing decision support systems that can aid in csPCa diagnosis and understanding. However, existing state of the art AI algorithms which are based on deep learning technology are often limited to 2D images that fails to capture inter-slice correlations in 3D volumetric images. The use of 3D convolutional neural networks (CNNs) partly overcomes this limitation, but it does not adapt to the anisotropy of images, resulting in sub-optimal semantic representation and poor generalization. Furthermore, due to the limitation of the amount of labelled data of bpMRI and the difficulty of labelling, existing CNNs are built on relatively small datasets, leading to a poor performance. To address the limitations identified above, we propose a new Zonal-aware Self-supervised Mesh Network (Z-SSMNet) that adaptatively fuses multiple 2D, 2.5D and 3D CNNs to effectively balance representation for sparse inter-slice information and dense intra-slice information in bpMRI. A self-supervised learning (SSL) technique is further introduced to pre-train our network using unlabelled data to learn the generalizable image features. Furthermore, we constrained our network to understand the zonal specific domain knowledge to improve the diagnosis precision of csPCa. Experiments on the PI-CAI Challenge dataset demonstrate our proposed method achieves better performance for csPCa detection and diagnosis in bpMRI. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: 8 pages, 1 figure, PI-CAI challenge

arXiv:2212.03357 [pdf, other]

Contactless Oxygen Monitoring with Gated Transformer

Authors: Hao He, Yuan Yuan, Ying-Cong Chen, Peng Cao, Dina Katabi

Abstract: With the increasing popularity of telehealth, it becomes critical to ensure that basic physiological signals can be monitored accurately at home, with minimal patient overhead. In this paper, we propose a contactless approach for monitoring patients' blood oxygen at home, simply by analyzing the radio signals in the room, without any wearable devices. We extract the patients' respiration from the… ▽ More With the increasing popularity of telehealth, it becomes critical to ensure that basic physiological signals can be monitored accurately at home, with minimal patient overhead. In this paper, we propose a contactless approach for monitoring patients' blood oxygen at home, simply by analyzing the radio signals in the room, without any wearable devices. We extract the patients' respiration from the radio signals that bounce off their bodies and devise a novel neural network that infers a patient's oxygen estimates from their breathing signal. Our model, called \emph{Gated BERT-UNet}, is designed to adapt to the patient's medical indices (e.g., gender, sleep stages). It has multiple predictive heads and selects the most suitable head via a gate controlled by the person's physiological indices. Extensive empirical results show that our model achieves high accuracy on both medical and radio datasets. △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: 19 pages, Workshop on Learning from Time Series for Health, NeurIPS 2022

arXiv:2212.00595 [pdf, other]

Ghost-free High Dynamic Range Imaging via Hybrid CNN-Transformer and Structure Tensor

Authors: Yu Yuan, Jiaqi Wu, Zhongliang **g, Henry Leung, Han Pan

Abstract: Eliminating ghosting artifacts due to moving objects is a challenging problem in high dynamic range (HDR) imaging. In this letter, we present a hybrid model consisting of a convolutional encoder and a Transformer decoder to generate ghost-free HDR images. In the encoder, a context aggregation network and non-local attention block are adopted to optimize multi-scale features and capture both global… ▽ More Eliminating ghosting artifacts due to moving objects is a challenging problem in high dynamic range (HDR) imaging. In this letter, we present a hybrid model consisting of a convolutional encoder and a Transformer decoder to generate ghost-free HDR images. In the encoder, a context aggregation network and non-local attention block are adopted to optimize multi-scale features and capture both global and local dependencies of multiple low dynamic range (LDR) images. The decoder based on Swin Transformer is utilized to improve the reconstruction capability of the proposed model. Motivated by the phenomenal difference between the presence and absence of artifacts under the field of structure tensor (ST), we integrate the ST information of LDR images as auxiliary inputs of the network and use ST loss to further constrain artifacts. Different from previous approaches, our network is capable of processing an arbitrary number of input LDR images. Qualitative and quantitative experiments demonstrate the effectiveness of the proposed method by comparing it with existing state-of-the-art HDR deghosting models. Codes are available at https://github.com/pandayuanyu/HSTHdr. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2211.09206 [pdf, other]

Learning to Kindle the Starlight

Authors: Yu Yuan, Jiaqi Wu, Lindong Wang, Zhongliang **g, Henry Leung, Shuyuan Zhu, Han Pan

Abstract: Capturing highly appreciated star field images is extremely challenging due to light pollution, the requirements of specialized hardware, and the high level of photographic skills needed. Deep learning-based techniques have achieved remarkable results in low-light image enhancement (LLIE) but have not been widely applied to star field image enhancement due to the lack of training data. To address… ▽ More Capturing highly appreciated star field images is extremely challenging due to light pollution, the requirements of specialized hardware, and the high level of photographic skills needed. Deep learning-based techniques have achieved remarkable results in low-light image enhancement (LLIE) but have not been widely applied to star field image enhancement due to the lack of training data. To address this problem, we construct the first Star Field Image Enhancement Benchmark (SFIEB) that contains 355 real-shot and 854 semi-synthetic star field images, all having the corresponding reference images. Using the presented dataset, we propose the first star field image enhancement approach, namely StarDiffusion, based on conditional denoising diffusion probabilistic models (DDPM). We introduce dynamic stochastic corruptions to the inputs of conditional DDPM to improve the performance and generalization of the network on our small-scale dataset. Experiments show promising results of our method, which outperforms state-of-the-art low-light image enhancement algorithms. The dataset and codes will be open-sourced. △ Less

Submitted 16 November, 2022; originally announced November 2022.

arXiv:2211.05309 [pdf]

Generic Cryo-CMOS Device Modeling and EDACompatible Platform for Reliable Cryogenic IC Design

Authors: Zhidong Tang, Zewei Wang, Yumeng Yuan, Chang He, Xin Luo, Ao Guo, Renhe Chen, Yongqi Hu, Longfei Yang, Chengwei Cao, Linlin Liu, Liujiang Yu, Ganbing Shang, Yongfeng Cao, Shoumian Chen, Yuhang Zhao, Shaojian Hu, Xufeng Kou

Abstract: This paper outlines the establishment of a generic cryogenic CMOS database in which key electrical parameters and transfer characteristics of the MOSFETs are quantified as functions of device size, temperature/frequency responses. Meanwhile, comprehensive device statistical study is conducted to evaluate the influence of variation and mismatch effects at low temperatures. Furthermore, by incorpora… ▽ More This paper outlines the establishment of a generic cryogenic CMOS database in which key electrical parameters and transfer characteristics of the MOSFETs are quantified as functions of device size, temperature/frequency responses. Meanwhile, comprehensive device statistical study is conducted to evaluate the influence of variation and mismatch effects at low temperatures. Furthermore, by incorporating the Cryo-CMOS compact model into the process design kit (PDK), the cryogenic 4 Kb SRAM, 5-bit flash ADC and 8-bit current steering DAC are designed, and their performance is readily investigated and optimized on the EDA-compatible platform, hence laying a solid foundation for large-scale cryogenic IC design. △ Less

Submitted 9 February, 2024; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.04630 [pdf, other]

doi 10.1016/j.optlaseng.2023.107615

Non-invasive color imaging through scattering medium under broadband illumination

Authors: Yunong Sun, Jianbin Liu, Hui Chen, Zhuoran Xi, Yu Zhou, Yuchen He, Huaibin Zheng, Zhuo Xu, Yuan Yuan

Abstract: Due to the complex of mixed spectral point spread function within memory effect range, it is unreliable and slow to use speckle correlation technology for non-invasive imaging through scattering medium under broadband illumination. The contrast of the speckles will drastically drop as the light source's spectrum width increases. Here, we propose a method for producing the optical transfer function… ▽ More Due to the complex of mixed spectral point spread function within memory effect range, it is unreliable and slow to use speckle correlation technology for non-invasive imaging through scattering medium under broadband illumination. The contrast of the speckles will drastically drop as the light source's spectrum width increases. Here, we propose a method for producing the optical transfer function with several speckle frames within memory effect range to image under broadband illumination. The method can be applied to image amplitude and color objects under white LED illumination. Compared to other approaches of imaging under broadband illumination, such as deep learning and modified phase retrieval, our method can provide more stable results with faster convergence speed, which can be applied in high speed scattering imaging under natural light illumination. △ Less

Submitted 10 October, 2022; originally announced October 2022.

arXiv:2209.12406 [pdf, other]

A heterogeneous group CNN for image super-resolution

Authors: Chunwei Tian, Yanning Zhang, Wangmeng Zuo, Chia-Wen Lin, David Zhang, Yixuan Yuan

Abstract: Convolutional neural networks (CNNs) have obtained remarkable performance via deep architectures. However, these CNNs often achieve poor robustness for image super-resolution (SR) under complex scenes. In this paper, we present a heterogeneous group SR CNN (HGSRCNN) via leveraging structure information of different types to obtain a high-quality image. Specifically, each heterogeneous group block… ▽ More Convolutional neural networks (CNNs) have obtained remarkable performance via deep architectures. However, these CNNs often achieve poor robustness for image super-resolution (SR) under complex scenes. In this paper, we present a heterogeneous group SR CNN (HGSRCNN) via leveraging structure information of different types to obtain a high-quality image. Specifically, each heterogeneous group block (HGB) of HGSRCNN uses a heterogeneous architecture containing a symmetric group convolutional block and a complementary convolutional block in a parallel way to enhance internal and external relations of different channels for facilitating richer low-frequency structure information of different types. To prevent appearance of obtained redundant features, a refinement block with signal enhancements in a serial way is designed to filter useless information. To prevent loss of original information, a multi-level enhancement mechanism guides a CNN to achieve a symmetric architecture for promoting expressive ability of HGSRCNN. Besides, a parallel up-sampling mechanism is developed to train a blind SR model. Extensive experiments illustrate that the proposed HGSRCNN has obtained excellent SR performance in terms of both quantitative and qualitative analysis. Codes can be accessed at https://github.com/hellloxiaotian/HGSRCNN. △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2209.11451 [pdf, other]

FIAT: Fine-grained Information Audit for Trustless Transborder Data Flow

Authors: Shuhao Zheng, Yanxi Lin, Yang Yu, Ye Yuan, Yongzheng Jia, Xue Liu

Abstract: Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the tra… ▽ More Auditing the information leakage of latent sensitive features during the transborder data flow has attracted sufficient attention from global digital regulators. However, there is missing a technical approach for the audit practice due to two technical challenges. Firstly, there is a lack of theory and tools for measuring the information of sensitive latent features in a dataset. Secondly, the transborder data flow involves multi-stakeholders with diverse interests, which means the audit must be trustless. Despite the tremendous efforts in protecting data privacy, an important issue that has long been neglected is that the transmitted data in data flows can leak other regulated information that is not explicitly contained in the data, leading to unaware information leakage risks. To unveil such risks trustfully before the actual data transfer, we propose FIAT, a Fine-grained Information Audit system for Trustless transborder data flow. In FIAT, we use a learning approach to quantify the amount of information leakage, while the technologies of zero-knowledge proof and smart contracts are applied to provide trustworthy and privacy-preserving auditing results. Experiments show that large information leakage can boost the predictability of uninvolved information using simple machine-learning models, revealing the importance of information auditing. Further performance benchmarking also validates the efficiency and scalability of the FIAT auditing system. △ Less

Submitted 10 February, 2023; v1 submitted 23 September, 2022; originally announced September 2022.

Comments: 10 pages, 6 figures, 1 table

arXiv:2209.04093 [pdf, other]

Learning Audio-Visual embedding for Person Verification in the Wild

Authors: Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu

Abstract: It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during… ▽ More It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification. △ Less

Submitted 26 October, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

arXiv:2206.14777 [pdf, other]

System-level Simulation of Reconfigurable Intelligent Surface assisted Wireless Communications System

Authors: Qi Gu, Dan Wu, Xin Su, Hanning Wang, **gyuan Cui, Yifei Yuan

Abstract: Reconfigurable intelligent surface (RIS) is an emerging technique employing metasurface to reflect the signal from the source node to the destination node. By smartly reconfiguring the electromagnetic (EM) properties of the metasurface and adjusting the EM parameters of the reflected radio waves, RIS can turn the uncontrollable propagation environment into an artificially reconfigurable space, and… ▽ More Reconfigurable intelligent surface (RIS) is an emerging technique employing metasurface to reflect the signal from the source node to the destination node. By smartly reconfiguring the electromagnetic (EM) properties of the metasurface and adjusting the EM parameters of the reflected radio waves, RIS can turn the uncontrollable propagation environment into an artificially reconfigurable space, and thus, can significantly increase the communications capacity and improve the coverage of the system. In this paper, we investigate the far field channel in which the line-of-sight (LOS) propagation is dominant. We propose an antenna model that can characterize the radiation patterns of realistic RIS elements, and consider the signal power received from the two-hop path through RIS. System-level simulations of network performance under various scenarios and parameter. △ Less

Submitted 29 June, 2022; originally announced June 2022.

arXiv:2206.01741 [pdf, other]

Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation

Authors: Yanglan Ou, Ye Yuan, Xiaolei Huang, Stephen T. C. Wong, John Volpi, James Z. Wang, Kelvin Wong

Abstract: We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation. Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches, each of which is further divided into small patches. Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel. We intentionall… ▽ More We present a new encoder-decoder Vision Transformer architecture, Patcher, for medical image segmentation. Unlike standard Vision Transformers, it employs Patcher blocks that segment an image into large patches, each of which is further divided into small patches. Transformers are applied to the small patches within a large patch, which constrains the receptive field of each pixel. We intentionally make the large patches overlap to enhance intra-patch communication. The encoder employs a cascade of Patcher blocks with increasing receptive fields to extract features from local to global levels. This design allows Patcher to benefit from both the coarse-to-fine feature extraction common in CNNs and the superior spatial relationship modeling of Transformers. We also propose a new mixture-of-experts (MoE) based decoder, which treats the feature maps from the encoder as experts and selects a suitable set of expert features to predict the label for each pixel. The use of MoE enables better specializations of the expert features and reduces interference between them during inference. Extensive experiments demonstrate that Patcher outperforms state-of-the-art Transformer- and CNN-based approaches significantly on stroke lesion segmentation and polyp segmentation. Code for Patcher is released with publication to facilitate future research. △ Less

Submitted 29 May, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: MICCAI 2022

arXiv:2205.14548 [pdf, other]

Image Super-resolution with An Enhanced Group Convolutional Neural Network

Authors: Chunwei Tian, Yixuan Yuan, Shichao Zhang, Chia-Wen Lin, Wangmeng Zuo, David Zhang

Abstract: CNNs with strong learning abilities are widely chosen to resolve super-resolution problem. However, CNNs depend on deeper network architectures to improve performance of image super-resolution, which may increase computational cost in general. In this paper, we present an enhanced super-resolution group CNN (ESRGCNN) with a shallow architecture by fully fusing deep and wide channel features to ext… ▽ More CNNs with strong learning abilities are widely chosen to resolve super-resolution problem. However, CNNs depend on deeper network architectures to improve performance of image super-resolution, which may increase computational cost in general. In this paper, we present an enhanced super-resolution group CNN (ESRGCNN) with a shallow architecture by fully fusing deep and wide channel features to extract more accurate low-frequency information in terms of correlations of different channels in single image super-resolution (SISR). Also, a signal enhancement operation in the ESRGCNN is useful to inherit more long-distance contextual information for resolving long-term dependency. An adaptive up-sampling operation is gathered into a CNN to obtain an image super-resolution model with low-resolution images of different sizes. Extensive experiments report that our ESRGCNN surpasses the state-of-the-arts in terms of SISR performance, complexity, execution speed, image quality evaluation and visual effect in SISR. Code is found at https://github.com/hellloxiaotian/ESRGCNN. △ Less

Submitted 31 July, 2022; v1 submitted 28 May, 2022; originally announced May 2022.

arXiv:2204.14021 [pdf, ps, other]

A Sampling Theorem for Exact Identification of Continuous-time Nonlinear Dynamical Systems

Authors: Zhexuan Zeng, Zuogong Yue, Alexandre Mauroy, Jorge Goncalves, Ye Yuan

Abstract: Low sampling frequency challenges the exact identification of the continuous-time (CT) dynamical system from sampled data, even when its model is identifiable. The necessary and sufficient condition is proposed -- which is built from Koopman operator -- to the exact identification of the CT system from sampled data. The condition gives a Nyquist-Shannon-like critical frequency for exact identifica… ▽ More Low sampling frequency challenges the exact identification of the continuous-time (CT) dynamical system from sampled data, even when its model is identifiable. The necessary and sufficient condition is proposed -- which is built from Koopman operator -- to the exact identification of the CT system from sampled data. The condition gives a Nyquist-Shannon-like critical frequency for exact identification of CT nonlinear dynamical systems with Koopman invariant subspaces: 1) it establishes a sufficient condition for a sampling frequency that permits a discretized sequence of samples to discover the underlying system and 2) it also establishes a necessary condition for a sampling frequency that leads to system aliasing that the underlying system is indistinguishable; and 3) the original CT signal does not have to be band-limited as required in the Nyquist-Shannon Theorem. The theoretical criterion has been demonstrated on a number of simulated examples, including linear systems, nonlinear systems with equilibria, and limit cycles. △ Less

Submitted 29 April, 2022; originally announced April 2022.

arXiv:2204.10836 [pdf, other]

doi 10.1038/s41467-022-33407-5

Federated Learning Enables Big Data for Rare Cancer Boundary Detection

Authors: Sarthak Pati, Ujjwal Baid, Brandon Edwards, Micah Sheller, Shih-Han Wang, G Anthony Reina, Patrick Foley, Alexey Gruzdev, Deepthi Karkada, Christos Davatzikos, Chiharu Sako, Satyam Ghodasara, Michel Bilello, Suyash Mohan, Philipp Vollmuth, Gianluca Brugnara, Chandrakanth J Preetha, Felix Sahm, Klaus Maier-Hein, Maximilian Zenk, Martin Bendszus, Wolfgang Wick, Evan Calabrese, Jeffrey Rudie, Javier Villanueva-Meyer , et al. (254 additional authors not shown)

Abstract: Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train acc… ▽ More Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing. △ Less

Submitted 25 April, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

Comments: federated learning, deep learning, convolutional neural network, segmentation, brain tumor, glioma, glioblastoma, FeTS, BraTS

arXiv:2204.08127 [pdf]

Parallel Network with Channel Attention and Post-Processing for Carotid Arteries Vulnerable Plaque Segmentation in Ultrasound Images

Authors: Yanchao Yuan, Cancheng Li, Lu Xu, Ke Zhang, Yang Hua, Jicong Zhang

Abstract: Carotid arteries vulnerable plaques are a crucial factor in the screening of atherosclerosis by ultrasound technique. However, the plaques are contaminated by various noises such as artifact, speckle noise, and manual segmentation may be time-consuming. This paper proposes an automatic convolutional neural network (CNN) method for plaque segmentation in carotid ultrasound images using a small data… ▽ More Carotid arteries vulnerable plaques are a crucial factor in the screening of atherosclerosis by ultrasound technique. However, the plaques are contaminated by various noises such as artifact, speckle noise, and manual segmentation may be time-consuming. This paper proposes an automatic convolutional neural network (CNN) method for plaque segmentation in carotid ultrasound images using a small dataset. First, a parallel network with three independent scale decoders is utilized as our base segmentation network, pyramid dilation convolutions are used to enlarge receptive fields in the three segmentation sub-networks. Subsequently, the three decoders are merged to be rectified in channels by SENet. Thirdly, in test stage, the initially segmented plaque is refined by the max contour morphology post-processing to obtain the final plaque. Moreover, three loss function Dice loss, SSIM loss and cross-entropy loss are compared to segment plaques. Test results show that the proposed method with dice loss function yields a Dice value of 0.820, an IoU of 0.701, Acc of 0.969, and modified Hausdorff distance (MHD) of 1.43 for 30 vulnerable cases of plaques, it outperforms some of the conventional CNN-based methods on these metrics. Additionally, we apply an ablation experiment to show the validity of each proposed module. Our study provides some reference for similar researches and may be useful in actual applications for plaque segmentation of ultrasound carotid arteries. △ Less

Submitted 17 April, 2022; originally announced April 2022.

Comments: 16 pages,6 figures

arXiv:2204.04387 [pdf, other]

doi 10.1109/TIP.2022.3221287

Dual-Stage Approach Toward Hyperspectral Image Super-Resolution

Authors: Qiang Li, Yuan Yuan, ** Jia, Qi Wang

Abstract: Hyperspectral image produces high spectral resolution at the sacrifice of spatial resolution. Without reducing the spectral resolution, improving the resolution in the spatial domain is a very challenging problem. Motivated by the discovery that hyperspectral image exhibits high similarity between adjacent bands in a large spectral range, in this paper, we explore a new structure for hyperspectral… ▽ More Hyperspectral image produces high spectral resolution at the sacrifice of spatial resolution. Without reducing the spectral resolution, improving the resolution in the spatial domain is a very challenging problem. Motivated by the discovery that hyperspectral image exhibits high similarity between adjacent bands in a large spectral range, in this paper, we explore a new structure for hyperspectral image super-resolution (DualSR), leading to a dual-stage design, i.e., coarse stage and fine stage. In coarse stage, five bands with high similarity in a certain spectral range are divided into three groups, and the current band is guided to study the potential knowledge. Under the action of alternative spectral fusion mechanism, the coarse SR image is super-resolved in band-by-band. In order to build model from a global perspective, an enhanced back-projection method via spectral angle constraint is developed in fine stage to learn the content of spatial-spectral consistency, dramatically improving the performance gain. Extensive experiments demonstrate the effectiveness of the proposed coarse stage and fine stage. Besides, our network produces state-of-the-art results against existing works in terms of spatial reconstruction and spectral fidelity. △ Less

Submitted 9 April, 2022; originally announced April 2022.

Showing 1–50 of 137 results for author: Yuan, Y