Search | arXiv e-print repository

SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation

Authors: Wenhui Zhu, Xiwen Chen, Peijie Qiu, Mohammad Farazi, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

Abstract: Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important facto… ▽ More Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets. The code is available at \url{https://github.com/ChongQingNoSubway/SelfReg-UNet} △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted as a conference paper to 2024 MICCAI

arXiv:2406.10856 [pdf, other]

LEO Satellite Networks Assisted Geo-distributed Data Processing

Authors: Zhiyuan Zhao, Zhe Chen, Zheng Lin, Wenjun Zhu, Kun Qiu, Chaoqun You, Yue Gao

Abstract: Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective soluti… ▽ More Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective solution for increasing edge clouds, and hence large volumes of data in edge clouds can be transferred to a core cloud via those networks for time-sensitive big data tasks processing, such as attack detection. However, the state-of-the-art satellite selection algorithms bring poor performance for those processing via our measurements. Therefore, we propose a novel data volume aware satellite selection algorithm, named DVA, to support such big data processing tasks. DVA first takes into account both the data size in edge clouds and satellite capacity to finalize the selection, thereby preventing congestion in the access network and reducing transmitting duration. Extensive simulations validate that DVA has a significantly lower average access network duration than the state-of-the-art satellite selection algorithms in a LEO satellite emulation platform. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 6 pages, 5 figures

arXiv:2405.19665 [pdf]

A novel fault localization with data refinement for hydroelectric units

Authors: Jialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai Fan

Abstract: Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learni… ▽ More Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 6pages,4 figures,Conference on Decision and Control(CDC) conference

arXiv:2403.12425 [pdf, other]

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Authors: Jun Yu, Gongpeng Zhao, Yongqi Wang, Zhihong Wei, Yang Zheng, Zerui Zhang, Zhongpeng Cai, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Abstract: This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we… ▽ More This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset. △ Less

Submitted 20 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages,3 figures

arXiv:2403.11757 [pdf, other]

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation

Authors: Jun Yu, Wangyuan Zhu, Jichao Zhu

Abstract: In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain"… ▽ More In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($ρ$) across the 6 emotion dimensionson the validation set achieving 0.3288. △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.11834 [pdf, ps, other]

Terahertz User-Centric Clustering in the Presence of Beam Misalignment

Authors: Khaled Humadi, Imene Trigui, Wei-** Zhu, Wessam Ajib

Abstract: Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. Th… ▽ More Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. The coverage probability is derived by leveraging an accurate approximation of the aggregate interference distribution that captures the effect of beam misalignment and THz fading. The numerical results reveal the impact of beam misalignment with respect to crucial link parameters, such as the transmitter's beam width and the serving cluster size, demonstrating that user-centric BS clustering is a promising enabler of THz networks. △ Less

Submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.10388 [pdf]

Improvising Age Verification Technologies in Canada: Technical, Regulatory and Social Dynamics

Authors: Azfar Adib, Wei-** Zhu, M. Omair Ahmad

Abstract: Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deploym… ▽ More Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deployment and mass acceptance of these technologies are significantly dependent on the corresponding socio-economic and regulatory context. This paper reviews such key dynamics for improvising age-verification technologies in Canada. It is particularly essential for such technologies to be inclusive, transparent, adaptable, privacy-preserving, and secure. Effective collaboration between academia, government, and industry entities can help to meet the growing demands for age-verification services in Canada while maintaining a user-centric approach. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Presented and accepted for publication in the 2023 IEEE International Humanitarian Technologies Conference (IEEE IHTC 2023), November 1 to 3, 2023, Cartagena, Colombia

arXiv:2401.04154 [pdf]

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Authors: Wentao Zhu

Abstract: Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply conc… ▽ More Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by WACV 2024; well-formatted PDF is in https://drive.google.com/file/d/1qvW52lamsvNGMCqPS7q8g8L4NaR_LlbR/view?usp=sharing. arXiv admin note: text overlap with arXiv:2401.04023

arXiv:2401.04023 [pdf]

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

Authors: Wentao Zhu

Abstract: In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a nove… ▽ More In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by WACV 2024; well-formatted PDF is in https://drive.google.com/file/d/10Zo_ydJZFAm7YsxHDgTjhyc4dEJbW_dk/view?usp=sharing

arXiv:2312.16228 [pdf, other]

Deformable Audio Transformer for Audio Event Detection

Authors: Wentao Zhu

Abstract: Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be… ▽ More Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equip** with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance. △ Less

Submitted 7 January, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

Comments: ICASSP 2024. arXiv admin note: substantial text overlap with arXiv:2201.00520 by other authors

arXiv:2312.05786 [pdf, other]

Deep Learning for Joint Design of Pilot, Channel Feedback, and Hybrid Beamforming in FDD Massive MIMO-OFDM Systems

Authors: Junyi Yang, Weifeng Zhu, Shu Sun, Xiaofeng Li, Xingqin Lin, Meixia Tao

Abstract: This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networ… ▽ More This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networks. All the considered modules are jointly learned in an end-to-end manner, and a graph neural network is adopted to effectively capture interactions between beamformers based on the built graphical representation. Numerical results validate the effectiveness of our method. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 5 pages, 4 figures, acccpted by IEEE Communication Letters

arXiv:2312.05557 [pdf, ps, other]

Long-Term Rate-Fairness-Aware Beamforming Based Massive MIMO Systems

Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, Y. Fang, H. V. Poor, L. Hanzo

Abstract: This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLN… ▽ More This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLNR max-min optimization, geometric mean of SLNRs (GM-SLNR) based optimization, and SLNR soft max-min optimization. We develop a convex-solver based algorithm, which invokes a convex subproblem of cubic time-complexity at each iteration for solving the SLNR maxmin problem. We then develop closed-form expression based algorithms of scalable complexity for the solution of the GMSLNR and of the SLNR soft max-min problem. The simulations provided confirm the users' improved-fairness ergodic rate distributions. △ Less

Submitted 9 December, 2023; originally announced December 2023.

arXiv:2311.14264 [pdf, ps, other]

An ADMM-Based Geometric Configuration Optimization in RSSD-Based Source Localization By UAVs with Spread Angle Constraint

Authors: Xin Cheng, Weiqiang Zhu, Feng Shu, Jiangzhou Wang

Abstract: Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for rece… ▽ More Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for received signal strength difference (RSSD)-based passive source localization by drone swarm. Different from prior works, this paper considers a general measuring condition where the spread angle of drone swarm centered on the source is constrained. Subject to this constraint, a geometric configuration optimization problem with the aim of maximizing the determinant of Fisher information matrix (FIM) is formulated. After transforming this problem using matrix theory, an alternating direction method of multipliers (ADMM)-based optimization framework is proposed. To solve the subproblems in this framework, two global optimal solutions based on the Von Neumann matrix trace inequality theorem and majorize-minimize (MM) algorithm are proposed respectively. Finally, the effectiveness as well as the practicality of the proposed ADMM-based optimization algorithm are demonstrated by extensive simulations. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2310.17155 [pdf, ps, other]

Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness

Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, H. V. Poor, L. Hanzo

Abstract: A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We… ▽ More A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We develop a convexsolver based algorithm, which iteratively invokes a convex problem of the same beamformer size for its solution. We then introduce the soft max-min rate objective function and develop a scalable algorithm for its optimization. Our simulation results demonstrate the striking fact that soft max-min rate optimization not only approaches the minimum user rate obtained by max-min rate optimization but it also achieves a sum rate similar to that of sum-rate maximization. Thus, the soft max-min rate optimization based beamforming design conceived offers a new technique of simultaneously achieving a high individual quality-of-service for all users and a high total network throughput. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.10095 [pdf, other]

A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images

Authors: Yangfan Ni, Duo Zhang, Gege Ma, Lijun Lu, Zhongke Huang, Wentao Zhu

Abstract: Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of i… ▽ More Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 17 pages, 7 figures

arXiv:2308.12198 [pdf, other]

Hierarchical Beam Alignment for Millimeter-Wave Communication Systems: A Deep Learning Approach

Authors: Junyi Yang, Weifeng Zhu, Meixia Tao, Shu Sun

Abstract: Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-ou… ▽ More Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-output (MIMO) systems, which learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine search manner. Specifically, a hierarchical beam alignment network (HBAN) is developed for MISO systems, which first performs coarse channel measurement using a tier-1 PC, then selects a tier-2 PC for fine channel measurement, and finally predicts the optimal beam based on both coarse and fine measurements. The propounded HBAN is trained in two steps: the tier-1 PC and the tier-2 PC selector are first trained jointly, followed by the joint training of all the tier-2 PCs and beam predictors. Furthermore, an HBAN for MIMO systems is proposed to directly predict the optimal beam pair without performing beam alignment individually at the transmitter and receiver. Numerical results demonstrate that the proposed HBANs are superior to the state-of-art methods in both alignment accuracy and signaling overhead reduction. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: 15 pages, 16 figures, to appear in Transactions on Wireless Communications. arXiv admin note: text overlap with arXiv:2209.03643

arXiv:2308.04663 [pdf, other]

doi 10.1016/j.media.2024.103199

Classification of lung cancer subtypes on CT images with synthetic pathological priors

Authors: Wentao Zhu, Yuan **, Gege Ma, Geng Chen, Jan Egger, Shaoting Zhang, Dimitris N. Metaxas

Abstract: The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns betwe… ▽ More The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 16 pages, 7 figures

Journal ref: Medical Image Analysis 95, July 2024, 103199

arXiv:2306.15942 [pdf, other]

Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Authors: Aoqi Guo, Junnan Wu, Peng Gao, Wenbo Zhu, Qinwen Guo, Dazhi Gao, Yujun Wang

Abstract: Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input fea… ▽ More Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.11958 [pdf, other]

doi 10.1088/1361-6560/ad00fc

PDS-MAR: a fine-grained Projection-Domain Segmentation-based Metal Artifact Reduction method for intraoperative CBCT images with guidewires

Authors: Tianling Lyu, Zhan Wu, Gege Ma, Chen Jiang, Xinyun Zhong, Yan Xi, Yang Chen, Wentao Zhu

Abstract: Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventiona… ▽ More Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventional image-domain segmentation-based MAR methods are unable to eliminate metal artifacts for intraoperative CBCT images with guidewires. To solve this problem, we present a fine-grained projection-domain segmentation-based MAR method termed PDS-MAR, in which metal traces are augmented and segmented in the projection domain before being inpainted using triangular interpolation. In addition, a metal reconstruction phase is proposed to restore metal areas in the image domain. The digital phantom study and real CBCT data study demonstrate that the proposed algorithm achieves significantly better artifact suppression than other comparing methods and has the potential to advance the use of intraoperative CBCT imaging in clinical spine surgeries. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 19 Pages

Journal ref: Phys. Med. Biol. 68 215007 (2023)

arXiv:2306.01289 [pdf, other]

nnMobileNet: Rethinking CNN for Retinopathy Research

Authors: Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xin Li, Natasha Lepore, Oana M. Dumitrascu, Yalin Wang

Abstract: Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability… ▽ More Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability to improve as more parameters are added. As a result, ViT-based models tend to outshine traditional CNNs in RD applications, albeit at the cost of increased data and computational demands. ViTs also differ from CNNs in their approach to processing images, working with patches rather than local regions, which can complicate the precise localization of small, variably presented lesions in RD. In our study, we revisited and updated the architecture of a CNN model, specifically MobileNet, to enhance its utility in RD diagnostics. We found that an optimized MobileNet, through selective modifications, can surpass ViT-based models in various RD benchmarks, including diabetic retinopathy grading, detection of multiple fundus diseases, and classification of diabetic macular edema. The code is available at https://github.com/Retinal-Research/NN-MOBILENET △ Less

Submitted 15 April, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted as a conference paper to 2024 CVPRW

arXiv:2305.08014 [pdf]

Surface EMG-Based Inter-Session/Inter-Subject Gesture Recognition by Leveraging Lightweight All-ConvNet and Transfer Learning

Authors: Md. Rabiul Islam, Daniel Massicotte, Philippe Y. Massicotte, Wei-** Zhu

Abstract: Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate th… ▽ More Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability. Hence, these methods also require learning over millions of training parameters and a large pre-trained and target domain dataset in both the pre-training and adaptation stages. As a result, it makes high-end resource-bounded and computationally very expensive for deployment in real-time applications. To overcome this problem, we propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition performance. The All-ConvNet+TL model consists solely of convolutional layers, a simple yet efficient framework for learning invariant and discriminative representations to address the distribution shifts caused by inter-session and inter-subject data variability. Experiments on four datasets demonstrate that our proposed methods outperform the most complex existing approaches by a large margin and achieve state-of-the-art results on inter-session and inter-subject scenarios and perform on par or competitively on intra-session gesture recognition. These performance gaps increase even more when a tiny amount (e.g., a single trial) of data is available on the target domain for adaptation. These outstanding experimental results provide evidence that the current state-of-the-art models may be overparameterized for sEMG-based inter-session and inter-subject gesture recognition tasks. △ Less

Submitted 19 February, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

arXiv:2304.09727 [pdf, other]

Cooperative Multi-Cell Massive Access with Temporally Correlated Activity

Authors: Weifeng Zhu, Meixia Tao, Xiaojun Yuan, Fan Xu, Yunfeng Guan

Abstract: This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for… ▽ More This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for fully exploiting the temporal correlation in activity. By establishing the probabilistic model associated with the factor graph representation, we propose a scalable Dynamic Compressed Sensing-based Multiple Measurement Vector Generalized Approximate Message Passing (DCS-MMV-GAMP) algorithm from the perspective of Bayesian inference. Therein, the activity likelihood is refined by performing standard message passing among the activities in the spatial-temporal domain and GAMP is employed for efficient channel estimation. Furthermore, we develop two schemes of quantize-and-forward (QF) and detect-and-forward (DF) based on DCS-MMV-GAMP for the finite-fronthaul-capacity scenario, which are extensively evaluated under various system limits. Numerical results verify the significant superiority of the proposed approach over the benchmarks. Moreover, it is revealed that QF can usually realize superior performance when the antenna number is small, whereas DF shifts to be preferable with limited fronthaul capacity if the large-scale antenna arrays are equipped. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: 16 pages, 17 figures, minor revision

arXiv:2303.10757 [pdf, other]

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Authors: Wentao Zhu, Mohamed Omar

Abstract: Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along… ▽ More Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals. △ Less

Submitted 19 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

arXiv:2303.07704 [pdf, other]

TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

Authors: Yukai Ju, Jun Chen, Shimin Zhang, Shulin He, Wei Rao, Weixin Zhu, Yannan Wang, Tao Yu, Shidong Shang

Abstract: This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) struct… ▽ More This paper introduces the Unbeatable Team's submission to the ICASSP 2023 Deep Noise Suppression (DNS) Challenge. We expand our previous work, TEA-PSE, to its upgraded version -- TEA-PSE 3.0. Specifically, TEA-PSE 3.0 incorporates a residual LSTM after squeezed temporal convolution network (S-TCN) to enhance sequence modeling capabilities. Additionally, the local-global representation (LGR) structure is introduced to boost speaker information extraction, and multi-STFT resolution loss is used to effectively capture the time-frequency characteristics of the speech signals. Moreover, retraining methods are employed based on the freeze training strategy to fine-tune the system. According to the official results, TEA-PSE 3.0 ranks 1st in both ICASSP 2023 DNS-Challenge track 1 and track 2. △ Less

Submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.03737 [pdf, other]

Multi-Dimensional and Multi-Scale Modeling for Speech Separation Optimized by Discriminative Learning

Authors: Zhaoxi Mu, Xinyu Yang, Wen**g Zhu

Abstract: Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network… ▽ More Transformer has shown advanced performance in speech separation, benefiting from its ability to capture global features. However, capturing local features and channel information of audio sequences in speech separation is equally important. In this paper, we present a novel approach named Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation. Specifically, we design a new network SE-Conformer that can model audio sequences in multiple dimensions and scales, and apply it to the dual-path speech separation framework. Furthermore, we propose Multi-Block Feature Aggregation to improve the separation effect by selectively utilizing information from the intermediate blocks of the separation network. Meanwhile, we propose a speaker similarity discriminative loss to optimize the speech separation model to address the problem of poor performance when speakers have similar voices. Experimental results on the benchmark datasets WSJ0-2mix and WHAM! show that ISCIT can achieve state-of-the-art results. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.03732 [pdf, other]

A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments

Authors: Zhaoxi Mu, Xinyu Yang, Xiangyuan Yang, Wen**g Zhu

Abstract: In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech… ▽ More In noisy and reverberant environments, the performance of deep learning-based speech separation methods drops dramatically because previous methods are not designed and optimized for such situations. To address this issue, we propose a multi-stage end-to-end learning method that decouples the difficult speech separation problem in noisy and reverberant environments into three sub-problems: speech denoising, separation, and de-reverberation. The probability and speed of searching for the optimal solution of the speech separation model are improved by reducing the solution space. Moreover, since the channel information of the audio sequence in the time domain is crucial for speech separation, we propose a triple-path structure capable of modeling the channel dimension of audio sequences. Experimental results show that the proposed multi-stage triple-path method can improve the performance of speech separation models at the cost of little model parameter increment. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.09274 [pdf, ps, other]

doi 10.1109/TVT.2023.3245539

Low-Complexity Pareto-Optimal 3D Beamforming for the Full-Dimensional Multi-User Massive MIMO Downlink

Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, Y. Fang, L. Hanzo

Abstract: Full-dimensional (FD) multi-user massive multiple input multiple output (m-MIMO) systems employ large two-dimensional (2D) rectangular antenna arrays to control both the azimuth and elevation angles of signal transmission. We introduce the sum of two outer products of the azimuth and elevation beamforming vectors having moderate dimensions as a new class of FD beamforming. We show that this low-co… ▽ More Full-dimensional (FD) multi-user massive multiple input multiple output (m-MIMO) systems employ large two-dimensional (2D) rectangular antenna arrays to control both the azimuth and elevation angles of signal transmission. We introduce the sum of two outer products of the azimuth and elevation beamforming vectors having moderate dimensions as a new class of FD beamforming. We show that this low-complexity class is capable of outperforming 2D beamforming relying on the single outer product of the azimuth and elevation beamforming vectors. It is also capable of performing close to its FD counterpart of massive dimensions in terms of either the users minimum rate or their geometric mean rate (GM-rate), or sum rate (SR). Furthermore, we also show that even FD beamforming may be outperformed by our outer product-based improper Gaussian signaling solution. Explicitly, our design is based on low-complexity algorithms relying on convex problems of moderate dimensions for max-min rate optimization or on closed-form expressions for GM-rate and SR maximization. △ Less

Submitted 18 February, 2023; originally announced February 2023.

arXiv:2302.03003 [pdf, other]

OTRE: Where Optimal Transport Guided Unpaired Image-to-Image Translation Meets Regularization by Enhancing

Authors: Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Jacob M. Sobczak, Mohammad Farazi, Zhangsihao Yang, Keshav Nandakumar, Yalin Wang

Abstract: Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the Optimal Transport (OT) theory to propose an… ▽ More Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the Optimal Transport (OT) theory to propose an unpaired image-to-image translation scheme for map** low-quality retinal CFPs to high-quality counterparts. Furthermore, to improve the flexibility, robustness, and applicability of our image enhancement pipeline in the clinical practice, we generalized a state-of-the-art model-based image reconstruction method, regularization by denoising, by plugging in priors learned by our OT-guided image-to-image translation network. We named it as regularization by enhancing (RE). We validated the integrated framework, OTRE, on three publicly available retinal image datasets by assessing the quality after enhancement and their performance on various downstream tasks, including diabetic retinopathy grading, vessel segmentation, and diabetic lesion segmentation. The experimental results demonstrated the superiority of our proposed framework over some state-of-the-art unsupervised competitors and a state-of-the-art supervised method. △ Less

Submitted 8 April, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: Accepted as a conference paper to The 28th biennial international conference on Information Processing in Medical Imaging (IPMI 2023)

arXiv:2302.02991 [pdf, other]

Optimal Transport Guided Unsupervised Learning for Enhancing low-quality Retinal Images

Authors: Wenhui Zhu, Peijie Qiu, Mohammad Farazi, Keshav Nandakumar, Oana M. Dumitrascu, Yalin Wang

Abstract: Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we propo… ▽ More Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we proposed an unpaired image-to-image translation scheme for transporting low-quality images to their high-quality counterparts. We theoretically proved that a Generative Adversarial Networks (GAN) model with a generator and discriminator is sufficient for this task. Furthermore, to mitigate the inconsistency of information between the low-quality images and their enhancements, an information consistency mechanism was proposed to maximally maintain structural consistency (optical discs, blood vessels, lesions) between the source and enhanced domains. Extensive experiments were conducted on the EyeQ dataset to demonstrate the superiority of our proposed method perceptually and quantitatively. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: Accepted as a conference paper to 20th IEEE International Symposium on Biomedical Imaging(ISBI 2023)

arXiv:2301.00554 [pdf]

In-situ monitoring additive manufacturing process with AI edge computing

Authors: Wenkang Zhu, Hui Li, Yikai Zhang, Yuqing Hou, Liwei Chen

Abstract: In-situ monitoring system can be used to monitor the quality of additive manufacturing (AM) processes. In the case of digital image correlation (DIC) based in-situ monitoring systems, high-speed cameras were used to capture images of high resolutions. This paper proposed a novel in-situ monitoring system to accelerate the process of digital images using artificial intelligence (AI) edge computing… ▽ More In-situ monitoring system can be used to monitor the quality of additive manufacturing (AM) processes. In the case of digital image correlation (DIC) based in-situ monitoring systems, high-speed cameras were used to capture images of high resolutions. This paper proposed a novel in-situ monitoring system to accelerate the process of digital images using artificial intelligence (AI) edge computing board. It built a visual transformer based video super resolution (ViTSR) network to reconstruct high resolution (HR) videos frames. Fully convolutional network (FCN) was used to simultaneously extract the geometric characteristics of molten pool and plasma arc during the AM processes. Compared with 6 state-of-the-art super resolution methods, ViTSR ranks first in terms of peak signal to noise ratio (PSNR). The PSNR of ViTSR for 4x super resolution reached 38.16 dB on test data with input size of 75 pixels x 75 pixels. Inference time of ViTSR and FCN was optimized to 50.97 ms and 67.86 ms on AI edge board after operator fusion and model pruning. The total inference time of the proposed system was 118.83 ms, which meets the requirement of real-time quality monitoring with low cost in-situ monitoring equipment during AM processes. The proposed system achieved an accuracy of 96.34% on the multi-objects extraction task and can be applied to different AM processes. △ Less

Submitted 2 January, 2023; originally announced January 2023.

arXiv:2211.06041 [pdf, other]

An Adapter based Multi-label Pre-training for Speech Separation and Enhancement

Authors: Tianrui Wang, Xie Chen, Zhuo Chen, Shu Yu, Weibin Zhu

Abstract: In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improv… ▽ More In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first update HuBERT's masked speech prediction (MSP) objective by integrating the separation and denoising terms, resulting in a multiple pseudo label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the multiple pseudo label MSP while other parameters remain frozen as default HuBERT. Experimental results show that our proposed adapter-based multiple pseudo label HuBERT yield consistent and significant performance improvements on SE, SS, and ASR tasks, with a faster pre-training speed, at only marginal parameters increase. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: 5 pages

arXiv:2211.00002 [pdf, other]

A Self-Supervised Approach to Reconstruction in Sparse X-Ray Computed Tomography

Authors: Rey Mendoza, Minh Nguyen, Judith Weng Zhu, Vincent Dumont, Talita Perciano, Juliane Mueller, Vidya Ganapati

Abstract: Computed tomography has propelled scientific advances in fields from biology to materials science. This technology allows for the elucidation of 3-dimensional internal structure by the attenuation of x-rays through an object at different rotations relative to the beam. By imaging 2-dimensional projections, a 3-dimensional object can be reconstructed through a computational algorithm. Imaging at a… ▽ More Computed tomography has propelled scientific advances in fields from biology to materials science. This technology allows for the elucidation of 3-dimensional internal structure by the attenuation of x-rays through an object at different rotations relative to the beam. By imaging 2-dimensional projections, a 3-dimensional object can be reconstructed through a computational algorithm. Imaging at a greater number of rotation angles allows for improved reconstruction. However, taking more measurements increases the x-ray dose and may cause sample damage. Deep neural networks have been used to transform sparse 2-D projection measurements to a 3-D reconstruction by training on a dataset of known similar objects. However, obtaining high-quality object reconstructions for the training dataset requires high x-ray dose measurements that can destroy or alter the specimen before imaging is complete. This becomes a chicken-and-egg problem: high-quality reconstructions cannot be generated without deep learning, and the deep neural network cannot be learned without the reconstructions. This work develops and validates a self-supervised probabilistic deep learning technique, the physics-informed variational autoencoder, to solve this problem. A dataset consisting solely of sparse projection measurements from each object is used to jointly reconstruct all objects of the set. This approach has the potential to allow visualization of fragile samples with x-ray computed tomography. We release our code for reproducing our results at: https://github.com/vganapati/CT_PVAE . △ Less

Submitted 29 October, 2022; originally announced November 2022.

Comments: NeurIPS 2022 Machine Learning and the Physical Sciences Workshop. arXiv admin note: text overlap with arXiv:2210.16709

arXiv:2210.12954 [pdf, other]

Message Passing-Based Joint User Activity Detection and Channel Estimation for Temporally-Correlated Massive Access

Authors: Weifeng Zhu, Meixia Tao, Xiaojun Yuan, Yunfeng Guan

Abstract: This paper studies the user activity detection and channel estimation problem in a temporally-correlated massive access system where a very large number of users communicate with a base station sporadically and each user once activated can transmit with a large probability over multiple consecutive frames. We formulate the problem as a dynamic compressed sensing (DCS) problem to exploit both the s… ▽ More This paper studies the user activity detection and channel estimation problem in a temporally-correlated massive access system where a very large number of users communicate with a base station sporadically and each user once activated can transmit with a large probability over multiple consecutive frames. We formulate the problem as a dynamic compressed sensing (DCS) problem to exploit both the sparsity and the temporal correlation of user activity. By leveraging the hybrid generalized approximate message passing (HyGAMP) framework, we design a computationally efficient algorithm, HyGAMP-DCS, to solve this problem. In contrast to only exploit the historical estimations, the proposed algorithm performs bidirectional message passing between the neighboring frames for activity likelihood update to fully exploit the temporally-correlated user activities. Furthermore, we develop an expectation maximization HyGAMP-DCS (EM-HyGAMP-DCS) algorithm to adaptively learn the hyperparameters during the estimation procedure when the system statistics are unknown. In particular, we propose to utilize the analysis tool of state evolution to find the appropriate hyperparameter initialization of EM-HyGAMP-DCS. Simulation results demonstrate that our proposed algorithms can significantly improve the user activity detection accuracy and reduce the channel estimation error. △ Less

Submitted 26 January, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: 31 pages, 14 figures, minor revision

arXiv:2210.11089 [pdf, other]

Speech Dereverberation with a Reverberation Time Shortening Target

Authors: Rui Zhou, Wenye Zhu, Xiaofei Li

Abstract: This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation a… ▽ More This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation and meanwhile maintains the exponential decaying property of reverberation, which will ease the network training, and thus reduce signal distortion caused by the prediction error. Moreover, this work experimentally study to adapt our previously proposed FullSubNet speech denoising network to speech dereverberation. Experiments show that RTS is a more suitable learning target than direct-path speech and early reflections, in terms of better suppressing reverberation and signal distortion. FullSubNet is able to achieve outstanding dereverberation performance. △ Less

Submitted 5 June, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2204.08765

arXiv:2210.08802 [pdf, other]

spatial-dccrn: dccrn equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement

Authors: Shubo Lv, Yihui Fu, Yukai Jv, Lei Xie, Weixin Zhu, Wei Rao, Yannan Wang

Abstract: Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded su… ▽ More Recently, multi-channel speech enhancement has drawn much interest due to the use of spatial information to distinguish target speech from interfering signal. To make full use of spatial information and neural network based masking estimation, we propose a multi-channel denoising neural network -- Spatial DCCRN. Firstly, we extend S-DCCRN to multi-channel scenario, aiming at performing cascaded sub-channel and full-channel processing strategy, which can model different channels separately. Moreover, instead of only adopting multi-channel spectrum or concatenating first-channel's magnitude and IPD as the model's inputs, we apply an angle feature extraction module (AFE) to extract frame-level angle feature embeddings, which can help the model to apparently perceive spatial information. Finally, since the phenomenon of residual noise will be more serious when the noise and speech exist in the same time frequency (TF) bin, we particularly design a masking and map** filtering method to substitute the traditional filter-and-sum operation, with the purpose of cascading coarsely denoising, dereverberation and residual noise suppression. The proposed model, Spatial-DCCRN, has surpassed EaBNet, FasNet as well as several competitive models on the L3DAS22 Challenge dataset. Not only the 3D scenario, Spatial-DCCRN outperforms state-of-the-art (SOTA) model MIMO-UNet by a large margin in multiple evaluation metrics on the multi-channel ConferencingSpeech2021 Challenge dataset. Ablation studies also demonstrate the effectiveness of different contributions. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2210.05946 [pdf, other]

Self-Supervised Equivariant Regularization Reconciles Multiple Instance Learning: Joint Referable Diabetic Retinopathy Classification and Lesion Segmentation

Authors: Wenhui Zhu, Peijie Qiu, Natasha Lepore, Oana M. Dumitrascu, Yalin Wang

Abstract: Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning an… ▽ More Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning and attention-based multi-instance learning (MIL) to tackle this problem. MIL is an effective strategy to differentiate positive and negative instances, hel** us discard background regions (negative instances) while localizing lesion regions (positive ones). However, MIL only provides coarse lesion localization and cannot distinguish lesions located across adjacent patches. Conversely, a self-supervised equivariant attention mechanism (SEAM) generates a segmentation-level class activation map (CAM) that can guide patch extraction of lesions more accurately. Our work aims at integrating both methods to improve rDR classification accuracy. We conduct extensive validation experiments on the Eyepacs dataset, achieving an area under the receiver operating characteristic curve (AU ROC) of 0.958, outperforming current state-of-the-art algorithms. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: 7 pages, 2 tables, 3 figures. 18th International Symposium on Medical Information Processing and Analysis

arXiv:2209.03643 [pdf, ps, other]

Deep Learning for Hierarchical Beam Alignment in mmWave Communication Systems

Authors: Junyi Yang, Weifeng Zhu, Meixia Tao

Abstract: Fast and precise beam alignment is crucial to support high-quality data transmission in millimeter wave (mmWave) communication systems. In this work, we propose a novel deep learning based hierarchical beam alignment method that learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine searching manner. Specifically, the proposed method… ▽ More Fast and precise beam alignment is crucial to support high-quality data transmission in millimeter wave (mmWave) communication systems. In this work, we propose a novel deep learning based hierarchical beam alignment method that learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine searching manner. Specifically, the proposed method first performs coarse channel measurement using the tier-1 PC, then selects a tier-2 PC for fine channel measurement, and finally predicts the optimal beam based on both coarse and fine measurements. The proposed deep neural network (DNN) architecture is trained in two steps. First, the tier-1 PC and the tier-2 PC selector are trained jointly. After that, all the tier-2 PCs together with the optimal beam predictors are trained jointly. The learned hierarchical PCs can capture the features of propagation environment. Numerical results based on realistic ray-tracing datasets demonstrate that the proposed method is superior to the state-of-art beam alignment methods in both alignment accuracy and swee** overhead. △ Less

Submitted 8 September, 2022; originally announced September 2022.

Comments: 6 pages, 6 figure, accepted by GLOBECOM 2022

arXiv:2206.04289 [pdf, other]

A No-Reference Deep Learning Quality Assessment Method for Super-resolution Images Based on Frequency Maps

Authors: Zicheng Zhang, Wei Sun, Xiongkuo Min, Wenhan Zhu, Tao Wang, Wei Lu, Guangtao Zhai

Abstract: To support the application scenarios where high-resolution (HR) images are urgently needed, various single image super-resolution (SISR) algorithms are developed. However, SISR is an ill-posed inverse problem, which may bring artifacts like texture shift, blur, etc. to the reconstructed images, thus it is necessary to evaluate the quality of super-resolution images (SRIs). Note that most existing… ▽ More To support the application scenarios where high-resolution (HR) images are urgently needed, various single image super-resolution (SISR) algorithms are developed. However, SISR is an ill-posed inverse problem, which may bring artifacts like texture shift, blur, etc. to the reconstructed images, thus it is necessary to evaluate the quality of super-resolution images (SRIs). Note that most existing image quality assessment (IQA) methods were developed for synthetically distorted images, which may not work for SRIs since their distortions are more diverse and complicated. Therefore, in this paper, we propose a no-reference deep-learning image quality assessment method based on frequency maps because the artifacts caused by SISR algorithms are quite sensitive to frequency information. Specifically, we first obtain the high-frequency map (HM) and low-frequency map (LM) of SRI by using Sobel operator and piecewise smooth image approximation. Then, a two-stream network is employed to extract the quality-aware features of both frequency maps. Finally, the features are regressed into a single quality value using fully connected layers. The experimental results show that our method outperforms all compared IQA models on the selected three super-resolution quality assessment (SRQA) databases. △ Less

Submitted 9 June, 2022; originally announced June 2022.

arXiv:2205.07494 [pdf, other]

Double-Sided Information Aided Temporal-Correlated Massive Access

Authors: Weifeng Zhu, Meixia Tao, Yunfeng Guan

Abstract: This letter considers temporal-correlated massive access, where each device, once activated, is likely to transmit continuously over several consecutive frames. Motivated by that the device activity at each frame is correlated to not only its previous frame but also its next frame, we propose a double-sided information (DSI) aided joint activity detection and channel estimation algorithm based on… ▽ More This letter considers temporal-correlated massive access, where each device, once activated, is likely to transmit continuously over several consecutive frames. Motivated by that the device activity at each frame is correlated to not only its previous frame but also its next frame, we propose a double-sided information (DSI) aided joint activity detection and channel estimation algorithm based on the approximate message passing (AMP) framework. The DSI is extracted from the estimation results in a sliding window that contains the target detection frame and its previous and next frames. The proposed algorithm demonstrates superior performance over the state-of-the-art methods. △ Less

Submitted 16 May, 2022; originally announced May 2022.

Comments: 6 pages, 5 figures

arXiv:2204.08765 [pdf, other]

Speech Dereverberation with A Reverberation Time Shortening Target

Authors: Rui Zhou, Wenye Zhu, Xiaofei Li

Abstract: This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation a… ▽ More This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training. The proposed RTS target suppresses reverberation and meanwhile maintains the exponential decaying property of reverberation, which will ease the network training, and thus reduce signal distortion caused by the prediction error. Moreover, this work experimentally study to adapt our previously proposed FullSubNet speech denoising network to speech dereverberation. Experiments show that RTS is a more suitable learning target than direct-path speech and early reflections, in terms of better suppressing reverberation and signal distortion. FullSubNet is able to achieve outstanding dereverberation performance. △ Less

Submitted 20 November, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Submitted to ICASSP 2023

arXiv:2204.05571 [pdf, other]

Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation

Authors: Wen**g Zhu, Xiang Li

Abstract: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks~(CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be wel… ▽ More Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks~(CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be well captured due to the limits of existing CNNs for SER. In this paper, we propose a novel GLobal-Aware Multi-scale (GLAM) neural network (The code is available at https://github.com/lixiangucas01/GLAM) to learn multi-scale feature representation with global-aware fusion module to attend emotional information. Specifically, GLAM iteratively utilizes multiple convolutional kernels with different scales to learn multiple feature representation. Then, instead of using attention-based methods, a simple but effective global-aware fusion module is applied to grab most important emotional information globally. Experiments on the benchmark corpus IEMOCAP over four emotions demonstrates the superiority of our proposed model with 2.5% to 4.5% improvements on four common metrics compared to previous state-of-the-art approaches. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: 6 pages, 3 figures, ICASSP 2022

arXiv:2204.00226 [pdf, other]

Multiple Confidence Gates For Joint Training Of SE And ASR

Authors: Tianrui Wang, Weibin Zhu, Yingying Gao, Junlan Feng, Shilei Zhang

Abstract: Joint training of speech enhancement model (SE) and speech recognition model (ASR) is a common solution for robust ASR in noisy environments. SE focuses on improving the auditory quality of speech, but the enhanced feature distribution is changed, which is uncertain and detrimental to the ASR. To tackle this challenge, an approach with multiple confidence gates for jointly training of SE and ASR i… ▽ More Joint training of speech enhancement model (SE) and speech recognition model (ASR) is a common solution for robust ASR in noisy environments. SE focuses on improving the auditory quality of speech, but the enhanced feature distribution is changed, which is uncertain and detrimental to the ASR. To tackle this challenge, an approach with multiple confidence gates for jointly training of SE and ASR is proposed. A speech confidence gates prediction module is designed to replace the former SE module in joint training. The noisy speech is filtered by gates to obtain features that are easier to be fitting by the ASR network. The experimental results show that the proposed method has better performance than the traditional robust speech recognition system on test sets of clean speech, synthesized noisy speech, and real noisy speech. △ Less

Submitted 1 April, 2022; originally announced April 2022.

Comments: 5 pages

arXiv:2203.04780 [pdf]

Intelligent resonance tracking of a microwave plasmonic resonator for compact wireless sensors

Authors: Xuanru Zhang, Jia Wen Zhu, Tie Jun Cui

Abstract: Plasmonic sensing has been in the spotlight for decades, the concept and applications of which have been generalized to spoof surface plasmons (SSPs) in the microwave band. Here, we report a compact and wireless sensor within a printed circuit board size of 18 mm * 12 mm, tracking the resonance frequency shift of a microwave plasmonic resonator via a software-defined scheme. The microwave plasmoni… ▽ More Plasmonic sensing has been in the spotlight for decades, the concept and applications of which have been generalized to spoof surface plasmons (SSPs) in the microwave band. Here, we report a compact and wireless sensor within a printed circuit board size of 18 mm * 12 mm, tracking the resonance frequency shift of a microwave plasmonic resonator via a software-defined scheme. The microwave plasmonic resonator yields a deep-subwavelength size, enhanced sensitivity, and a good electromagnetic compatibility performance. The software-defined resonance tracking scheme minimalizes the hardware circuit and the consumed spectrum resources, and makes the detection intelligently adaptive to the target resonance, with a signal-to-noise ratio of 69 dB and a data rate of 2272 measuring points per second. The sensor has been validated via acetone vapor concentration sensing, while its applications can be widely extended by replacing the transducer materials. This approach provides compact, sensitive, accurate and intelligent solutions for resonant sensors in the Internet of things (IoT). △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00926 [pdf, other]

Parameterized Image Quality Score Distribution Prediction

Authors: Yixuan Gao, Xiongkuo Min, Wenhan Zhu, Xiao-** Zhang, Guangtao Zhai

Abstract: Recently, image quality has been generally describedby a mean opinion score (MOS). However, we observe that thequality scores of an image given by a group of subjects are verysubjective and diverse. Thus it is not enough to use a MOS todescribe the image quality. In this paper, we propose to describeimage quality using a parameterized distribution rather thana MOS, and an objective method is also… ▽ More Recently, image quality has been generally describedby a mean opinion score (MOS). However, we observe that thequality scores of an image given by a group of subjects are verysubjective and diverse. Thus it is not enough to use a MOS todescribe the image quality. In this paper, we propose to describeimage quality using a parameterized distribution rather thana MOS, and an objective method is also proposed to predictthe image quality score distribution (IQSD). At first, the LIVEdatabase is re-recorded. Specifically, we have invited a largegroup of subjects to evaluate the quality of all images in theLIVE database, and each image is evaluated by a large numberof subjects (187 valid subjects), whose scores can form a reliableIQSD. By analyzing the obtained subjective quality scores, wefind that the IQSD can be well modeled by an alpha stable model,and it can reflect much more information than a single MOS, suchas the skewness of opinion score, the subject diversity and themaximum probability score for an image. Therefore, we proposeto model the IQSD using the alpha stable model. Moreover, wepropose a framework and an algorithm to predict the alphastable model based IQSD, where quality features are extractedfrom each image based on structural information and statisticalinformation, and support vector regressors are trained to predictthe alpha stable model parameters. Experimental results verifythe feasibility of using alpha stable model to describe the IQSD,and prove the effectiveness of objective alpha stable model basedIQSD prediction method. △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00917 [pdf, other]

Machine Learning Methods for Inferring the Number of UAV Emitters via Massive MIMO Receive Array

Authors: Yifan Li, Feng Shu, **song Hu, Shihao Yan, Haiwei Song, Weiqiang Zhu, Da Tian, Yaoliang Song, Jiangzhou Wang

Abstract: To provide important prior knowledge for the DOA estimation of UAV emitters in future wireless networks, we present a complete DOA preprocessing system for inferring the number of emitters via massive MIMO receive array. Firstly, in order to eliminate the noise signals, two high-precision signal detectors, square root of maximum eigenvalue times minimum eigenvalue (SR-MME) and geometric mean (GM),… ▽ More To provide important prior knowledge for the DOA estimation of UAV emitters in future wireless networks, we present a complete DOA preprocessing system for inferring the number of emitters via massive MIMO receive array. Firstly, in order to eliminate the noise signals, two high-precision signal detectors, square root of maximum eigenvalue times minimum eigenvalue (SR-MME) and geometric mean (GM), are proposed. Compared to other detectors, SR-MME and GM can achieve a high detection probability while maintaining extremely low false alarm probability. Secondly, if the existence of emitters is determined by detectors, we need to further confirm their number. Therefore, we perform feature extraction on the the eigenvalue sequence of sample covariance matrix to construct feature vector and innovatively propose a multi-layer neural network (ML-NN). Additionally, the support vector machine (SVM), and naive Bayesian classifier (NBC) are also designed. The simulation results show that the machine learning-based methods can achieve good results in signal classification, especially neural networks, which can always maintain the classification accuracy above 70\% with massive MIMO receive array. Finally, we analyze the classical signal classification methods, Akaike (AIC) and Minimum description length (MDL). It is concluded that the two methods are not suitable for scenarios with massive MIMO arrays, and they also have much worse performance than machine learning-based classifiers. △ Less

Submitted 10 March, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2203.00613 [pdf]

Towards a Common Speech Analysis Engine

Authors: Hagai Aronowitz, Itai Gat, Edmilson Morais, Weizhong Zhu, Ron Hoory

Abstract: Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an eng… ▽ More Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art. We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine should be able to handle multiple speech processing tasks, using a single architecture, to obtain state-of-the-art accuracy. The engine must also enable support for new tasks with small training datasets. Beyond that, a common engine should be capable of supporting distributed training with client in-house private data. We present the architecture for a common speech analysis engine based on the HuBERT self-supervised speech representation. Based on experiments, we report our results for language identification and emotion recognition on the standard evaluations NIST-LRE 07 and IEMOCAP. Our results surpass the state-of-the-art performance reported so far on these tasks. We also analyzed our engine on the emotion recognition task using reduced amounts of training data and show how to achieve improved results. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: ICASSP 2022

arXiv:2202.12643 [pdf, other]

Harmonic gated compensation network plus for ICASSP 2022 DNS CHALLENGE

Authors: Tianrui Wang, Weibin Zhu, Yingying Gao, Yanan Chen, Junlan Feng, Shilei Zhang

Abstract: The harmonic structure of speech is resistant to noise, but the harmonics may still be partially masked by noise. Therefore, we previously proposed a harmonic gated compensation network (HGCN) to predict the full harmonic locations based on the unmasked harmonics and process the result of a coarse enhancement module to recover the masked harmonics. In addition, the auditory loudness loss function… ▽ More The harmonic structure of speech is resistant to noise, but the harmonics may still be partially masked by noise. Therefore, we previously proposed a harmonic gated compensation network (HGCN) to predict the full harmonic locations based on the unmasked harmonics and process the result of a coarse enhancement module to recover the masked harmonics. In addition, the auditory loudness loss function is used to train the network. For the DNS Challenge, we update HGCN with the following aspects, resulting in HGCN+. First, a high-band module is employed to help the model handle full-band signals. Second, cosine is used to model the harmonic structure more accurately. Then, the dual-path encoder and dual-path rnn (DPRNN) are introduced to take full advantage of the features. Finally, a gated residual linear structure replaces the gated convolution in the compensation module to increase the receptive field of frequency. The experimental results show that each updated module brings performance improvement to the model. HGCN+ also outperforms the referenced models on both wide-band and full-band test sets. △ Less

Submitted 25 February, 2022; originally announced February 2022.

Comments: 5 pages

arXiv:2202.03896 [pdf]

Speech Emotion Recognition using Self-Supervised Features

Authors: Edmilson Morais, Ron Hoory, Weizhong Zhu, Itai Gat, Matheus Damasceno, Hagai Aronowitz

Abstract: Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration o… ▽ More Self-supervised pre-trained features have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of speech emotion recognition (SER) still need further investigation. In this paper we introduce a modular End-to- End (E2E) SER system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features. Several SER experiments for predicting categorical emotion classes from the IEMOCAP dataset are performed. These experiments investigate interactions among fine-tuning of self-supervised feature models, aggregation of frame-level features into utterance-level features and back-end classification networks. The proposed monomodal speechonly based system not only achieves SOTA results, but also brings light to the possibility of powerful and well finetuned self-supervised acoustic features that reach results similar to the results achieved by SOTA multimodal systems using both Speech and Text modalities. △ Less

Submitted 6 February, 2022; originally announced February 2022.

Comments: 5 pages, 4 figures, 2 tables, ICASSP 2022

arXiv:2201.12755 [pdf, other]

HGCN: Harmonic gated compensation network for speech enhancement

Authors: Tianrui Wang, Weibin Zhu, Yingying Gao, Junlan Feng, Shilei Zhang

Abstract: Mask processing in the time-frequency (T-F) domain through the neural network has been one of the mainstreams for single-channel speech enhancement. However, it is hard for most models to handle the situation when harmonics are partially masked by noise. To tackle this challenge, we propose a harmonic gated compensation network (HGCN). We design a high-resolution harmonic integral spectrum to impr… ▽ More Mask processing in the time-frequency (T-F) domain through the neural network has been one of the mainstreams for single-channel speech enhancement. However, it is hard for most models to handle the situation when harmonics are partially masked by noise. To tackle this challenge, we propose a harmonic gated compensation network (HGCN). We design a high-resolution harmonic integral spectrum to improve the accuracy of harmonic locations prediction. Then we add voice activity detection (VAD) and voiced region detection (VRD) to the convolutional recurrent network (CRN) to filter harmonic locations. Finally, the harmonic gating mechanism is used to guide the compensation model to adjust the coarse results from CRN to obtain the refinedly enhanced results. Our experiments show HGCN achieves substantial gain over a number of advanced approaches in the community. △ Less

Submitted 16 March, 2022; v1 submitted 30 January, 2022; originally announced January 2022.

Comments: 5 pages

arXiv:2201.09124 [pdf, ps, other]

Copula-Based Modeling of RIS-Assisted Communications: Outage Probability Analysis

Authors: Imène Trigui, Damoon Shahbaztabar, Wessam Ajib, Wei-** Zhu

Abstract: Statistical characterization of the signal-to-noise ratio (SNR) of reconfigurable intelligent surface (RIS)-assistedcommunications in the presence of phase noise is an important open issue. In this letter, we exploit the concept of copula modeling to capture the non-standard dependence features that appear due to the presence of discrete phase noise. In particular,we consider the outage probabilit… ▽ More Statistical characterization of the signal-to-noise ratio (SNR) of reconfigurable intelligent surface (RIS)-assistedcommunications in the presence of phase noise is an important open issue. In this letter, we exploit the concept of copula modeling to capture the non-standard dependence features that appear due to the presence of discrete phase noise. In particular,we consider the outage probability of RIS systems in Rayleighfading channels and provide joint distributions to characterize the dependencies due to the use of finite resolution phase shifters at the RIS. Numerical assessments confirm the validity of closed-form expressions of the outage probability and motivate the use of bivariate copula for further RIS studies. △ Less

Submitted 22 January, 2022; originally announced January 2022.

Showing 1–50 of 92 results for author: Zhu, W