Search | arXiv e-print repository

arXiv:2406.19706 [pdf, other]

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Abstract: Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters… ▽ More Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 5 pages, accepted by Interspeech 2024. arXiv admin note: substantial text overlap with arXiv:2309.09136

arXiv:2406.18361 [pdf, other]

Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg △ Less

Submitted 27 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

arXiv:2404.03179 [pdf, other]

UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization

Authors: Tiantian Geng, Teng Wang, Yanfu Zhang, **ming Duan, Weili Guan, Feng Zheng

Abstract: Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio… ▽ More Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2312.09464 [pdf, other]

Enhanced Eye Diagram Estimation Method for Nonlinear Systems With Input Jitter

Authors: Hanqing Zhang, Feijun Zheng

Abstract: An enhanced multiple-edge response (MER) based eye diagram estimation method is proposed to evaluate the performance of nonlinear systems with input jitter. Compared with existing MER-based methods which only took into account the bit effect, the proposed method first determines both orders of bit effect and jitter effect. These decided orders can affirm the necessary MERs. Subsequently, the propo… ▽ More An enhanced multiple-edge response (MER) based eye diagram estimation method is proposed to evaluate the performance of nonlinear systems with input jitter. Compared with existing MER-based methods which only took into account the bit effect, the proposed method first determines both orders of bit effect and jitter effect. These decided orders can affirm the necessary MERs. Subsequently, the proposed method figures out the minimal number of sampling points so that the necessary MERs can be recovered quickly based on the Nyquist theory and can be used to create eye diagrams. Lastly, the eye diagrams and their parameters are compared with those generated by traditional transient simulation and an existing MER-based method which introduces input jitter through a convolution process. The result indicates that this enhanced method is more accurate than the existing MER-based method. △ Less

Submitted 10 November, 2023; originally announced December 2023.

Comments: The article was accepted but not published by EMC+SIPI 2023 because of failure to attend the conference for personal emergency reason. The information is attached at the end of article

arXiv:2311.03887 [pdf, other]

Toward ground-truth optical coherence tomography via three-dimensional unsupervised deep learning processing and data

Authors: Renxiong Wu, Fei Zheng, Meixuan Li, Shaoyan Huang, Xin Ge, Linbo Liu, Yong Liu, Guangming Ni

Abstract: Optical coherence tomography (OCT) can perform non-invasive high-resolution three-dimensional (3D) imaging and has been widely used in biomedical fields, while it is inevitably affected by coherence speckle noise which degrades OCT imaging performance and restricts its applications. Here we present a novel speckle-free OCT imaging strategy, named toward-ground-truth OCT (tGT-OCT), that utilizes un… ▽ More Optical coherence tomography (OCT) can perform non-invasive high-resolution three-dimensional (3D) imaging and has been widely used in biomedical fields, while it is inevitably affected by coherence speckle noise which degrades OCT imaging performance and restricts its applications. Here we present a novel speckle-free OCT imaging strategy, named toward-ground-truth OCT (tGT-OCT), that utilizes unsupervised 3D deep-learning processing and leverages OCT 3D imaging features to achieve speckle-free OCT imaging. Specifically, our proposed tGT-OCT utilizes an unsupervised 3D-convolution deep-learning network trained using random 3D volumetric data to distinguish and separate speckle from real structures in 3D imaging volumetric space; moreover, tGT-OCT effectively further reduces speckle noise and reveals structures that would otherwise be obscured by speckle noise while preserving spatial resolution. Results derived from different samples demonstrated the high-quality speckle-free 3D imaging performance of tGT-OCT and its advancement beyond the previous state-of-the-art. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2309.09136 [pdf, other]

Enhancing Quantised End-to-End ASR Models via Personalisation

Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

Abstract: Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker ad… ▽ More Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker adaptive training (SAT) with model quantisation to improve the performance of heavily compressed models. Specifically, PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size and 1% additional speaker-specific parameters, 15.1% and 23.3% relative WER reductions were achieved on quantised Whisper and Conformer-based attention-based encoder-decoder ASR models respectively, comparing to the original full precision models. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Comments: 5 pages, submitted to ICASSP 2024

arXiv:2307.04827 [pdf, other]

LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

Authors: Siting Xu, Yunlong Tang, Feng Zheng

Abstract: Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the la… ▽ More Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/. △ Less

Submitted 23 July, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

Comments: Accepted by International Computer Music Conference (ICMC) 2023

arXiv:2307.04296 [pdf, other]

K-Space-Aware Cross-Modality Score for Synthesized Neuroimage Quality Assessment

Authors: Guoyang Xie, **bao Wang, Yawen Huang, Jiayi Lyu, Feng Zheng, Yefeng Zheng, Yaochu **

Abstract: The problem of how to assess cross-modality medical image synthesis has been largely unexplored. The most used measures like PSNR and SSIM focus on analyzing the structural features but neglect the crucial lesion location and fundamental k-space speciality of medical images. To overcome this problem, we propose a new metric K-CROSS to spur progress on this challenging problem. Specifically, K-CROS… ▽ More The problem of how to assess cross-modality medical image synthesis has been largely unexplored. The most used measures like PSNR and SSIM focus on analyzing the structural features but neglect the crucial lesion location and fundamental k-space speciality of medical images. To overcome this problem, we propose a new metric K-CROSS to spur progress on this challenging problem. Specifically, K-CROSS uses a pre-trained multi-modality segmentation network to predict the lesion location, together with a tumor encoder for representing features, such as texture details and brightness intensities. To further reflect the frequency-specific information from the magnetic resonance imaging principles, both k-space features and vision features are obtained and employed in our comprehensive encoders with a frequency reconstruction penalty. The structure-shared encoders are designed and constrained with a similarity loss to capture the intrinsic common structural information for both modalities. As a consequence, the features learned from lesion regions, k-space, and anatomical structures are all captured, which serve as our quality evaluators. We evaluate the performance by constructing a large-scale cross-modality neuroimaging perceptual similarity (NIRPS) dataset with 6,000 radiologist judgments. Extensive experiments demonstrate that the proposed method outperforms other metrics, especially in comparison with the radiologists on NIRPS. △ Less

Submitted 9 February, 2024; v1 submitted 9 July, 2023; originally announced July 2023.

arXiv:2306.01458 [pdf, ps, other]

Extremely Large-scale Array Systems: Near-Field Codebook Design and Performance Analysis

Authors: Feng Zheng, Hongkang Yu, Chenchen Wang, Luyang Sun, Qingqing Wu, Yijian Chen

Abstract: Extremely Large-scale Array (ELAA) promises to deliver ultra-high data rates with increased antenna elements. However, increasing antenna elements leads to a wider realm of near-field, which challenges the traditional design of codebooks. In this paper, we propose novel near-field codebook schemes based on the fitting formula of codewords' quantization performance. First, we analyze the quantizati… ▽ More Extremely Large-scale Array (ELAA) promises to deliver ultra-high data rates with increased antenna elements. However, increasing antenna elements leads to a wider realm of near-field, which challenges the traditional design of codebooks. In this paper, we propose novel near-field codebook schemes based on the fitting formula of codewords' quantization performance. First, we analyze the quantization performance properties of uniform linear array (ULA) and uniform planar array (UPA) codewords. Our findings reveal an intriguing property: the correlation formula for ULA codewords can be represented by the elliptic formula, while the correlation formula for UPA codewords can be approximated using the ellipsoid formula. Building on this insight, we propose a ULA uniform codebook that maximizes the minimum correlation based on the derived formula. Moreover, we introduce a ULA dislocation codebook to further reduce quantization overhead. Continuing our exploration, we propose UPA uniform and dislocation codebook schemes. Our investigation demonstrates that oversampling in the angular domain offers distinct advantages, achieving heightened accuracy while minimizing overhead in quantifying near-field channels. Numerical results demonstrate the appealing advantages of the proposed codebook over existing methods in decreasing quantization overhead and increasing quantization accuracy. △ Less

Submitted 24 August, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.16105 [pdf, ps, other]

Joint Uplink and Downlink Resource Allocation Towards Energy-efficient Transmission for URLLC

Authors: Kang Li, Pengcheng Zhu, Yan Wang, Fu-Chun Zheng, Xiaohu You

Abstract: Ultra-reliable and low-latency communications (URLLC) is firstly proposed in 5G networks, and expected to support applications with the most stringent quality-of-service (QoS). However, since the wireless channels vary dynamically, the transmit power for ensuring the QoS requirements of URLLC may be very high, which conflicts with the power limitation of a real system. To fulfill the successful UR… ▽ More Ultra-reliable and low-latency communications (URLLC) is firstly proposed in 5G networks, and expected to support applications with the most stringent quality-of-service (QoS). However, since the wireless channels vary dynamically, the transmit power for ensuring the QoS requirements of URLLC may be very high, which conflicts with the power limitation of a real system. To fulfill the successful URLLC transmission with finite transmit power, we propose an energy-efficient packet delivery mechanism incorparated with frequency-hop** and proactive drop** in this paper. To reduce uplink outage probability, frequency-hop** provides more chances for transmission so that the failure hardly occurs. To avoid downlink outage from queue clearing, proactive drop** controls overall reliability by introducing an extra error component. With the proposed packet delivery mechanism, we jointly optimize bandwidth allocation and power control of uplink and downlink, antenna configuration, and subchannel assignment to minimize the average total power under the constraint of URLLC transmission requirements. Via theoretical analysis (e.g., the convexity with respect to bandwidth, the independence of bandwidth allocation, the convexity of antenna configuration with inactive constraints), the simplication of finding the global optimal solution for resource allocation is addressed. A three-step method is then proposed to find the optimal solution for resource allocation. Simulation results validate the analysis and show the performance gain by optimizing resource allocation with the proposed packet delivery mechanism. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 16 pages, 11 figures

arXiv:2303.12930 [pdf, other]

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

Authors: Tiantian Geng, Teng Wang, **ming Duan, Runmin Cong, Feng Zheng

Abstract: Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize a… ▽ More Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task. △ Less

Submitted 24 March, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR2023

arXiv:2211.07143 [pdf]

WSC-Trans: A 3D network model for automatic multi-structural segmentation of temporal bone CT

Authors: Xin Hua, Zhijiang Du, Hongjian Yu, Jixin Ma, Fanjun Zheng, Cheng Zhang, Qiaohui Lu, Hui Zhao

Abstract: Cochlear implantation is currently the most effective treatment for patients with severe deafness, but mastering cochlear implantation is extremely challenging because the temporal bone has extremely complex and small three-dimensional anatomical structures, and it is important to avoid damaging the corresponding structures when performing surgery. The spatial location of the relevant anatomical t… ▽ More Cochlear implantation is currently the most effective treatment for patients with severe deafness, but mastering cochlear implantation is extremely challenging because the temporal bone has extremely complex and small three-dimensional anatomical structures, and it is important to avoid damaging the corresponding structures when performing surgery. The spatial location of the relevant anatomical tissues within the target area needs to be determined using CT prior to the procedure. Considering that the target structures are too small and complex, the time required for manual segmentation is too long, and it is extremely challenging to segment the temporal bone and its nearby anatomical structures quickly and accurately. To overcome this difficulty, we propose a deep learning-based algorithm, a 3D network model for automatic segmentation of multi-structural targets in temporal bone CT that can automatically segment the cochlea, facial nerve, auditory tubercle, vestibule and semicircular canal. The algorithm combines CNN and Transformer for feature extraction and takes advantage of spatial attention and channel attention mechanisms to further improve the segmentation effect, the experimental results comparing with the results of various existing segmentation algorithms show that the dice similarity scores, Jaccard coefficients of all targets anatomical structures are significantly higher while HD95 and ASSD scores are lower, effectively proving that our method outperforms other advanced methods. △ Less

Submitted 14 November, 2022; originally announced November 2022.

Comments: 10 pages,7 figures

arXiv:2207.09647 [pdf, other]

Deep Learning Based Automatic Modulation Recognition: Models, Datasets, and Challenges

Authors: Fuxin Zhang, Chunbo Luo, Jialang Xu, Yang Luo, FuChun Zheng

Abstract: Automatic modulation recognition (AMR) detects the modulation scheme of the received signals for further signal processing without needing prior information, and provides the essential function when such information is missing. Recent breakthroughs in deep learning (DL) have laid the foundation for develo** high-performance DL-AMR approaches for communications systems. Comparing with traditional… ▽ More Automatic modulation recognition (AMR) detects the modulation scheme of the received signals for further signal processing without needing prior information, and provides the essential function when such information is missing. Recent breakthroughs in deep learning (DL) have laid the foundation for develo** high-performance DL-AMR approaches for communications systems. Comparing with traditional modulation detection methods, DL-AMR approaches have achieved promising performance including high recognition accuracy and low false alarms due to the strong feature extraction and classification abilities of deep neural networks. Despite the promising potential, DL-AMR approaches also bring concerns to complexity and explainability, which affect the practical deployment in wireless communications systems. This paper aims to present a review of the current DL-AMR research, with a focus on appropriate DL models and benchmark datasets. We further provide comprehensive experiments to compare the state of the art models for single-input-single-output (SISO) systems from both accuracy and complexity perspectives, and propose to apply DL-AMR in the new multiple-input-multiple-output (MIMO) scenario with precoding. Finally, existing challenges and possible future research directions are discussed. △ Less

Submitted 20 July, 2022; originally announced July 2022.

arXiv:2207.06918 [pdf, ps, other]

Interference-Limited Ultra-Reliable and Low-Latency Communications: Graph Neural Networks or Stochastic Geometry?

Authors: Yuhong Liu, Changyang She, Yi Zhong, Wibowo Hardjawana, Fu-Chun Zheng, Branka Vucetic

Abstract: In this paper, we aim to improve the Quality-of-Service (QoS) of Ultra-Reliability and Low-Latency Communications (URLLC) in interference-limited wireless networks. To obtain time diversity within the channel coherence time, we first put forward a random repetition scheme that randomizes the interference power. Then, we optimize the number of reserved slots and the number of repetitions for each p… ▽ More In this paper, we aim to improve the Quality-of-Service (QoS) of Ultra-Reliability and Low-Latency Communications (URLLC) in interference-limited wireless networks. To obtain time diversity within the channel coherence time, we first put forward a random repetition scheme that randomizes the interference power. Then, we optimize the number of reserved slots and the number of repetitions for each packet to minimize the QoS violation probability, defined as the percentage of users that cannot achieve URLLC. We build a cascaded Random Edge Graph Neural Network (REGNN) to represent the repetition scheme and develop a model-free unsupervised learning method to train it. We analyze the QoS violation probability using stochastic geometry in a symmetric scenario and apply a model-based Exhaustive Search (ES) method to find the optimal solution. Simulation results show that in the symmetric scenario, the QoS violation probabilities achieved by the model-free learning method and the model-based ES method are nearly the same. In more general scenarios, the cascaded REGNN generalizes very well in wireless networks with different scales, network topologies, cell densities, and frequency reuse factors. It outperforms the model-based ES method in the presence of the model mismatch. △ Less

Submitted 18 July, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

Comments: Submitted to IEEE journal for possible publication

arXiv:2207.06057 [pdf, other]

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Authors: Jian Ma, Zhedong Zheng, Hao Fei, Feng Zheng, Tat-seng Chua, Yi Yang

Abstract: Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice tran… ▽ More Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance. △ Less

Submitted 27 July, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

arXiv:2206.13741 [pdf, other]

Social-aware Cooperative Caching in Fog Radio Access Networks

Authors: Baotian Fan, Yanxiang Jiang, Fu-Chun Zheng, Mehdi Bennis, Xiaohu You

Abstract: In this paper, the cooperative caching problem in fog radio access networks (F-RANs) is investigated to jointly optimize the transmission delay and energy consumption. Exploiting the potential social relationships among fog access points (F-APs), we firstly propose a clustering scheme based on hedonic coalition game (HCG) to improve the potential cooperation gain. Then, considering that the optimi… ▽ More In this paper, the cooperative caching problem in fog radio access networks (F-RANs) is investigated to jointly optimize the transmission delay and energy consumption. Exploiting the potential social relationships among fog access points (F-APs), we firstly propose a clustering scheme based on hedonic coalition game (HCG) to improve the potential cooperation gain. Then, considering that the optimization problem is non-deterministic polynomial hard (NP-hard), we further propose an improved firefly algorithm (FA) based cooperative caching scheme, which utilizes a mutation strategy based on local content popularity to avoid pre-mature convergence. Simulation results show that our proposed scheme can effectively reduce the content transmission delay and energy consumption in comparison with the baselines. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: 6 pages, 5 figures. This paper has been accepted by IEEE ICC 2022

arXiv:2206.11556 [pdf, other]

A Federated Reinforcement Learning Method with Quantization for Cooperative Edge Caching in Fog Radio Access Networks

Authors: Yanxiang Jiang, Min Zhang, Fu-Chun Zheng, Yan Chen, Mehdi Bennis, Xiaohu You

Abstract: In this paper, cooperative edge caching problem is studied in fog radio access networks (F-RANs). Given the non-deterministic polynomial hard (NP-hard) property of the problem, a dueling deep Q network (Dueling DQN) based caching update algorithm is proposed to make an optimal caching decision by learning the dynamic network environment. In order to protect user data privacy and solve the problem… ▽ More In this paper, cooperative edge caching problem is studied in fog radio access networks (F-RANs). Given the non-deterministic polynomial hard (NP-hard) property of the problem, a dueling deep Q network (Dueling DQN) based caching update algorithm is proposed to make an optimal caching decision by learning the dynamic network environment. In order to protect user data privacy and solve the problem of slow convergence of the single deep reinforcement learning (DRL) model training, we propose a federated reinforcement learning method with quantization (FRLQ) to implement cooperative training of models from multiple fog access points (F-APs) in F-RANs. To address the excessive consumption of communications resources caused by model transmission, we prune and quantize the shared DRL models to reduce the number of model transfer parameters. The communications interval is increased and the communications rounds are reduced by periodical model global aggregation. We analyze the global convergence and computational complexity of our policy. Simulation results verify that our policy has better performance in reducing user request delay and improving cache hit rate compared to benchmark schemes. The proposed policy is also shown to have faster training speed and higher communications efficiency with minimal loss of model accuracy. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 14 pages,12 figures

arXiv:2203.10897 [pdf, other]

Unified Multivariate Gaussian Mixture for Efficient Neural Image Compression

Authors: Xiaosu Zhu, **gkuan Song, Lianli Gao, Feng Zheng, Heng Tao Shen

Abstract: Modeling latent variables with priors and hyperpriors is an essential problem in variational image compression. Formally, trade-off between rate and distortion is handled well if priors and hyperpriors precisely describe latent variables. Current practices only adopt univariate priors and process each variable individually. However, we find inter-correlations and intra-correlations exist when obse… ▽ More Modeling latent variables with priors and hyperpriors is an essential problem in variational image compression. Formally, trade-off between rate and distortion is handled well if priors and hyperpriors precisely describe latent variables. Current practices only adopt univariate priors and process each variable individually. However, we find inter-correlations and intra-correlations exist when observing latent variables in a vectorized perspective. These findings reveal visual redundancies to improve rate-distortion performance and parallel processing ability to speed up compression. This encourages us to propose a novel vectorized prior. Specifically, a multivariate Gaussian mixture is proposed with means and covariances to be estimated. Then, a novel probabilistic vector quantization is utilized to effectively approximate means, and remaining covariances are further induced to a unified mixture and solved by cascaded estimation without context models involved. Furthermore, codebooks involved in quantization are extended to multi-codebooks for complexity reduction, which formulates an efficient compression procedure. Extensive experiments on benchmark datasets against state-of-the-art indicate our model has better rate-distortion performance and an impressive $3.18\times$ compression speed up, giving us the ability to perform real-time, high-quality variational image compression in practice. Our source code is publicly available at \url{https://github.com/xiaosu-zhu/McQuic}. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022

Journal ref: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022

arXiv:2202.06997 [pdf, other]

Cross-Modality Neuroimage Synthesis: A Survey

Authors: Guoyang Xie, Yawen Huang, **bao Wang, Jiayi Lyu, Feng Zheng, Yefeng Zheng, Yaochu **

Abstract: Multi-modality imaging improves disease diagnosis and reveals distinct deviations in tissues with anatomical properties. The existence of completely aligned and paired multi-modality neuroimaging data has proved its effectiveness in brain research. However, collecting fully aligned and paired data is expensive or even impractical, since it faces many difficulties, including high cost, long acquisi… ▽ More Multi-modality imaging improves disease diagnosis and reveals distinct deviations in tissues with anatomical properties. The existence of completely aligned and paired multi-modality neuroimaging data has proved its effectiveness in brain research. However, collecting fully aligned and paired data is expensive or even impractical, since it faces many difficulties, including high cost, long acquisition time, image corruption, and privacy issues. An alternative solution is to explore unsupervised or weakly supervised learning methods to synthesize the absent neuroimaging data. In this paper, we provide a comprehensive review of cross-modality synthesis for neuroimages, from the perspectives of weakly supervised and unsupervised settings, loss functions, evaluation metrics, imaging modalities, datasets, and downstream applications based on synthesis. We begin by highlighting several opening challenges for cross-modality neuroimage synthesis. Then, we discuss representative architectures of cross-modality synthesis methods under different supervisions. This is followed by a stepwise in-depth analysis to evaluate how cross-modality neuroimage synthesis improves the performance of its downstream tasks. Finally, we summarize the existing research findings and point out future research directions. All resources are available at https://github.com/M-3LAB/awesome-multimodal-brain-image-systhesis △ Less

Submitted 21 September, 2023; v1 submitted 14 February, 2022; originally announced February 2022.

arXiv:2201.12589 [pdf, other]

FedMed-ATL: Misaligned Unpaired Brain Image Synthesis via Affine Transform Loss

Authors: **bao Wang, Guoyang Xie, Yawen Huang, Yefeng Zheng, Yaochu **, Feng Zheng

Abstract: The existence of completely aligned and paired multi-modal neuroimaging data has proved its effectiveness in the diagnosis of brain diseases. However, collecting the full set of well-aligned and paired data is impractical, since the practical difficulties may include high cost, long time acquisition, image corruption, and privacy issues. Previously, the misaligned unpaired neuroimaging data (terme… ▽ More The existence of completely aligned and paired multi-modal neuroimaging data has proved its effectiveness in the diagnosis of brain diseases. However, collecting the full set of well-aligned and paired data is impractical, since the practical difficulties may include high cost, long time acquisition, image corruption, and privacy issues. Previously, the misaligned unpaired neuroimaging data (termed as MUD) are generally treated as noisy label. However, such a noisy label-based method fail to accomplish well when misaligned data occurs distortions severely. For example, the angle of rotation is different. In this paper, we propose a novel federated self-supervised learning (FedMed) for brain image synthesis. An affine transform loss (ATL) was formulated to make use of severely distorted images without violating privacy legislation for the hospital. We then introduce a new data augmentation procedure for self-supervised training and fed it into three auxiliary heads, namely auxiliary rotation, auxiliary translation and auxiliary scaling heads. The proposed method demonstrates the advanced performance in both the quality of our synthesized results under a severely misaligned and unpaired data setting, and better stability than other GAN-based algorithms. The proposed method also reduces the demand for deformable registration while encouraging to leverage the misaligned and unpaired data. Experimental results verify the outstanding performance of our learning paradigm compared to other state-of-the-art approaches. △ Less

Submitted 16 July, 2022; v1 submitted 29 January, 2022; originally announced January 2022.

Comments: arXiv admin note: text overlap with arXiv:2201.08953

arXiv:2111.12324 [pdf, other]

How Speech is Recognized to Be Emotional - A Study Based on Information Decomposition

Authors: Haoran Sun, Lantian Li, Thomas Fang Zheng, Dong Wang

Abstract: The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we… ▽ More The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we are able to decompose speech signals into separate information factors (content, pitch, rhythm). Based on this decomposition, we carefully studied the performance of each information component and their combinations. We conducted the study on three different speech emotion corpora and chose an attention-based convolutional RNN as the emotion classifier. Our results show that rhythm is the most important component for emotional expression. Moreover, the cross-corpus results are very bad (even worse than guess), demonstrating that the present speech emotion recognition model is rather weak. Interestingly, by removing one or several unimportant components, the cross-corpus results can be improved. This demonstrates the potential of the decomposition approach towards a generalizable emotion recognition. △ Less

Submitted 24 November, 2021; originally announced November 2021.

arXiv:2110.05087 [pdf]

A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

Authors: Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng

Abstract: The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a n… ▽ More The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a new network architecture is invented and introduced as the classifier. In this paper, we propose a multi-resolution front-end for feature extraction in an end-to-end classification framework. Optimal weighted combinations of multiple time-frequency resolutions will be learned automatically given the objective of a classification task. Features extracted with different time-frequency resolutions are weighted and concatenated as inputs to the successive networks, where the weights are predicted by a learnable neural network inspired by the weighting block in squeeze-and-excitation networks (SENet). Furthermore, the refinement of the chosen timefrequency resolutions is investigated by pruning the ones with relatively low importance, which reduces the complexity and size of the model. The proposed method is evaluated on the tasks of speech anti-spoofing in ASVSpoof 2019 and its superiority has been justified by comparing with similar baselines. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: submitted to ICASSP 2022

arXiv:2105.09022 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413467

Attack on practical speaker verification system using universal adversarial perturbations

Authors: Weiyi Zhang, Shuning Zhao, Le Liu, Jianmin Li, Xingliang Cheng, Thomas Fang Zheng, Xiaolin Hu

Abstract: In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate s… ▽ More In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking. △ Less

Submitted 19 May, 2021; originally announced May 2021.

Comments: 6 pages, 2 figures

arXiv:2103.11587 [pdf, other]

Brain Image Synthesis with Unsupervised Multivariate Canonical CSC$\ell_4$Net

Authors: Yawen Huang, Feng Zheng, Danyang Wang, Weilin Huang, Matthew R. Scott, Ling Shao

Abstract: Recent advances in neuroscience have highlighted the effectiveness of multi-modal medical data for investigating certain pathologies and understanding human cognition. However, obtaining full sets of different modalities is limited by various factors, such as long acquisition times, high examination costs and artifact suppression. In addition, the complexity, high dimensionality and heterogeneity… ▽ More Recent advances in neuroscience have highlighted the effectiveness of multi-modal medical data for investigating certain pathologies and understanding human cognition. However, obtaining full sets of different modalities is limited by various factors, such as long acquisition times, high examination costs and artifact suppression. In addition, the complexity, high dimensionality and heterogeneity of neuroimaging data remains another key challenge in leveraging existing randomized scans effectively, as data of the same modality is often measured differently by different machines. There is a clear need to go beyond the traditional imaging-dependent process and synthesize anatomically specific target-modality data from a source input. In this paper, we propose to learn dedicated features that cross both intre- and intra-modal variations using a novel CSC$\ell_4$Net. Through an initial unification of intra-modal data in the feature maps and multivariate canonical adaptation, CSC$\ell_4$Net facilitates feature-level mutual transformation. The positive definite Riemannian manifold-penalized data fidelity term further enables CSC$\ell_4$Net to reconstruct missing measurements according to transformed features. Finally, the maximization $\ell_4$-norm boils down to a computationally efficient optimization problem. Extensive experiments validate the ability and robustness of our CSC$\ell_4$Net compared to the state-of-the-art methods on multiple datasets. △ Less

Submitted 22 March, 2021; originally announced March 2021.

Comments: 10 pages, 5 figures CVPR2021 oral

arXiv:2012.12468 [pdf, other]

CN-Celeb: multi-genre speaker recognition

Authors: Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, Dong Wang

Abstract: Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic… ▽ More Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3,000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition and the performance gain when the new dataset is used to conduct multi-genre training. △ Less

Submitted 24 November, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: submitted to Speech Communication

arXiv:2010.14243 [pdf, ps, other]

Squeezing value of cross-domain labels: a decoupled scoring approach for speaker verification

Authors: Lantian Li, Yang Zhang, Jiawen Kang, Thomas Fang Zheng, Dong Wang

Abstract: Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems. The common wisdom is to collect cross-domain data and train a multi-domain PLDA model, with the hope to learn a domain-independent speaker subspace. In this paper, we firstly present an empirical study to show that simply adding cross-domain data does not help performance in… ▽ More Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems. The common wisdom is to collect cross-domain data and train a multi-domain PLDA model, with the hope to learn a domain-independent speaker subspace. In this paper, we firstly present an empirical study to show that simply adding cross-domain data does not help performance in conditions with enrollment-test mismatch. Careful analysis shows that this striking result is caused by the incoherent statistics between the enrollment and test conditions. Based on this analysis, we present a decoupled scoring approach that can maximally squeeze the value of cross-domain labels and obtain optimal verification scores when the enrollment and test are mismatched. When the statistics are coherent, the new formulation falls back to the conventional PLDA. Experimental results on cross-channel test show that the proposed approach is highly effective and is a principle solution to domain mismatch. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2010.14242 [pdf, other]

Deep generative factorization for speech signal

Authors: Haoran Sun, Lantian Li, Yunqi Cai, Yang Zhang, Thomas Fang Zheng, Dong Wang

Abstract: Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discrim… ▽ More Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discriminative normalization flow model (factorial DNF). Experiments conducted on a two-factor case that involves phonetic content and speaker trait demonstrates that the proposed factorial DNF has powerful capability to factorize speech signals and outperforms several comparative models in terms of information representation and manipulation. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2009.06863 [pdf]

When Automatic Voice Disguise Meets Automatic Speaker Verification

Authors: Linlin Zheng, Jiakang Li, Meng Sun, Xiongwei Zhang, Thomas Fang Zheng

Abstract: The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (AS… ▽ More The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (ASV). In this paper, we have found that ASV is not only a victim of AVD but could be a tool to beat some simple types of AVD. Firstly, three types of AVD, pitch scaling, vocal tract length normalization (VTLN) and voice conversion (VC), are introduced as representative methods. State-of-the-art ASV methods are subsequently utilized to objectively evaluate the impact of AVD on ASV by equal error rates (EER). Moreover, an approach to restore disguised voice to its original version is proposed by minimizing a function of ASV scores w.r.t. restoration parameters. Experiments are then conducted on disguised voices from Voxceleb, a dataset recorded in real-world noisy scenario. The results have shown that, for the voice disguise by pitch scaling, the proposed approach obtains an EER around 7% comparing to the 30% EER of a recently proposed baseline using the ratio of fundamental frequencies. The proposed approach generalizes well to restore the disguise with nonlinear frequency war** in VTLN by reducing its EER from 34.3% to 18.5%. However, it is difficult to restore the source speakers in VC by our approach, where more complex forms of restoration functions or other paralinguistic cues might be necessary to restore the nonlinear transform in VC. Finally, contrastive visualization on ASV features with and without restoration illustrate the role of the proposed approach in an intuitive way. △ Less

Submitted 15 September, 2020; originally announced September 2020.

Comments: accepted for publication

Journal ref: IEEE Transactions on Information Forensics and Security, 2020

arXiv:2005.11905 [pdf, other]

Neural Discriminant Analysis for Deep Speaker Embedding

Authors: Lantian Li, Dong Wang, Thomas Fang Zheng

Abstract: Probabilistic Linear Discriminant Analysis (PLDA) is a popular tool in open-set classification/verification tasks. However, the Gaussian assumption underlying PLDA prevents it from being applied to situations where the data is clearly non-Gaussian. In this paper, we present a novel nonlinear version of PLDA named as Neural Discriminant Analysis (NDA). This model employs an invertible deep neural n… ▽ More Probabilistic Linear Discriminant Analysis (PLDA) is a popular tool in open-set classification/verification tasks. However, the Gaussian assumption underlying PLDA prevents it from being applied to situations where the data is clearly non-Gaussian. In this paper, we present a novel nonlinear version of PLDA named as Neural Discriminant Analysis (NDA). This model employs an invertible deep neural network to transform a complex distribution to a simple Gaussian, so that the linear Gaussian model can be readily established in the transformed space. We tested this NDA model on a speaker recognition task where the deep speaker vectors (x-vectors) are presumably non-Gaussian. Experimental results on two datasets demonstrate that NDA consistently outperforms PLDA, by handling the non-Gaussian distributions of the x-vectors. △ Less

Submitted 24 May, 2020; originally announced May 2020.

Comments: submitted to INTERSPEECH 2020

arXiv:2005.11902 [pdf, other]

ASR-Free Pronunciation Assessment

Authors: Sitong Cheng, Zhixin Liu, Lantian Li, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng

Abstract: Most of the pronunciation assessment methods are based on local features derived from automatic speech recognition (ASR), e.g., the Goodness of Pronunciation (GOP) score. In this paper, we investigate an ASR-free scoring approach that is derived from the marginal distribution of raw speech signals. The hypothesis is that even if we have no knowledge of the language (so cannot recognize the phones/… ▽ More Most of the pronunciation assessment methods are based on local features derived from automatic speech recognition (ASR), e.g., the Goodness of Pronunciation (GOP) score. In this paper, we investigate an ASR-free scoring approach that is derived from the marginal distribution of raw speech signals. The hypothesis is that even if we have no knowledge of the language (so cannot recognize the phones/words), we can still tell how good a pronunciation is, by comparatively listening to some speech data from the target language. Our analysis shows that this new scoring approach provides an interesting correction for the phone-competition problem of GOP. Experimental results on the ERJ dataset demonstrated that combining the ASR-free score and GOP can achieve better performance than the GOP baseline. △ Less

Submitted 24 May, 2020; originally announced May 2020.

Comments: submitted to INTRESPEECH 2020

arXiv:2005.11900 [pdf, other]

Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning

Authors: Jiawen Kang, Ruiqi Liu, Lantian Li, Yunqi Cai, Dong Wang, Thomas Fang Zheng

Abstract: Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets. For example, a model trained on reading speech may largely fail when applied to scenarios of singing or movie. In this paper, we propose a domain-invariant projection to improve the generalizability of speaker vectors. This projection is a simple neural… ▽ More Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets. For example, a model trained on reading speech may largely fail when applied to scenarios of singing or movie. In this paper, we propose a domain-invariant projection to improve the generalizability of speaker vectors. This projection is a simple neural net and is trained following the Model-Agnostic Meta-Learning (MAML) principle, for which the objective is to classify speakers in one domain if it had been updated with speech data in another domain. We tested the proposed method on CNCeleb, a new dataset consisting of single-speaker multi-condition (SSMC) data. The results demonstrated that the MAML-based domain-invariant projection can produce more generalizable speaker vectors, and effectively improve the performance in unseen domains. △ Less

Submitted 24 May, 2020; originally announced May 2020.

Comments: submitted to INTERSPEECH 2020

arXiv:2005.02627 [pdf, other]

Joint Optimal Software Caching, Computation Offloading and Communications Resource Allocation for Mobile Edge Computing

Authors: Wanli Wen, Ying Cui, Tony Q. S. Quek, Fu-Chun Zheng, Shi **

Abstract: As software may be used by multiple users, caching popular software at the wireless edge has been considered to save computation and communications resources for mobile edge computing (MEC). However, fetching uncached software from the core network and multicasting popular software to users have so far been ignored. Thus, existing design is incomplete and less practical. In this paper, we propose… ▽ More As software may be used by multiple users, caching popular software at the wireless edge has been considered to save computation and communications resources for mobile edge computing (MEC). However, fetching uncached software from the core network and multicasting popular software to users have so far been ignored. Thus, existing design is incomplete and less practical. In this paper, we propose a joint caching, computation and communications mechanism which involves software fetching, caching and multicasting, as well as task input data uploading, task executing (with non-negligible time duration) and computation result downloading, and mathematically characterize it. Then, we optimize the joint caching, offloading and time allocation policy to minimize the weighted sum energy consumption subject to the caching and deadline constraints. The problem is a challenging two-timescale mixed integer nonlinear programming (MINLP) problem, and is NP-hard in general. We convert it into an equivalent convex MINLP problem by using some appropriate transformations and propose two low-complexity algorithms to obtain suboptimal solutions of the original non-convex MINLP problem. Specifically, the first suboptimal solution is obtained by solving a relaxed convex problem using the consensus alternating direction method of multipliers (ADMM), and then rounding its optimal solution properly. The second suboptimal solution is proposed by obtaining a stationary point of an equivalent difference of convex (DC) problem using the penalty convex-concave procedure (Penalty-CCP) and ADMM. Finally, by numerical results, we show that the proposed solutions outperform existing schemes and reveal their advantages in efficiently utilizing storage, computation and communications resources. △ Less

Submitted 6 May, 2020; originally announced May 2020.

Comments: To appear in IEEE Trans. Veh. Technol., 2020

arXiv:2002.00136 [pdf, ps, other]

A Novel Massive MIMO Beam Domain Channel Model

Authors: Fan Lai, Cheng-Xiang Wang, Jie Huang, Xiqi Gao, Fu-Chun Zheng

Abstract: A novel beam domain channel model (BDCM) for massive multiple-input multiple-output (MIMO) communication systems has been proposed in this paper. The near-field effect and spherical wavefront are firstly assumed in the proposed model, which is different from the conventional BDCM for MIMO based on the far-field effect and plane wavefront assumption. The proposed novel BDCM is the transformation of… ▽ More A novel beam domain channel model (BDCM) for massive multiple-input multiple-output (MIMO) communication systems has been proposed in this paper. The near-field effect and spherical wavefront are firstly assumed in the proposed model, which is different from the conventional BDCM for MIMO based on the far-field effect and plane wavefront assumption. The proposed novel BDCM is the transformation of an existing geometry-based stochastic model (GBSM) from the antenna domain into beam domain. The space-time non-stationarity is also modeled in the novel BDCM. Moreover, the comparison of computational complexity for both models is studied. Based on the numerical analysis, comparison of cluster-level statistical properties between the proposed BDCM and existing GBSM has shown that there exists little difference in the space, time, and frequency correlation properties for two models. Also, based on the simulation, coherence bandwidths of the two models in different scenarios are almost the same. The computational complexity of the novel BDCM is much lower than the existing GBSM. It can be observed that the proposed novel BDCM has similar statistical properties to the existing GBSM at the clusterlevel. The proposed BDCM has less complexity and is therefore more convenient for information theory and signal processing research than the conventional GBSMs. △ Less

Submitted 31 January, 2020; originally announced February 2020.

arXiv:1912.01300 [pdf, other]

Viewpoint-Aware Loss with Angular Regularization for Person Re-Identification

Authors: Zhihui Zhu, Xinyang Jiang, Feng Zheng, Xiaowei Guo, Feiyue Huang, Weishi Zheng, Xing Sun

Abstract: Although great progress in supervised person re-identification (Re-ID) has been made recently, due to the viewpoint variation of a person, Re-ID remains a massive visual challenge. Most existing viewpoint-based person Re-ID methods project images from each viewpoint into separated and unrelated sub-feature spaces. They only model the identity-level distribution inside an individual viewpoint but i… ▽ More Although great progress in supervised person re-identification (Re-ID) has been made recently, due to the viewpoint variation of a person, Re-ID remains a massive visual challenge. Most existing viewpoint-based person Re-ID methods project images from each viewpoint into separated and unrelated sub-feature spaces. They only model the identity-level distribution inside an individual viewpoint but ignore the underlying relationship between different viewpoints. To address this problem, we propose a novel approach, called \textit{Viewpoint-Aware Loss with Angular Regularization }(\textbf{VA-reID}). Instead of one subspace for each viewpoint, our method projects the feature from different viewpoints into a unified hypersphere and effectively models the feature distribution on both the identity-level and the viewpoint-level. In addition, rather than modeling different viewpoints as hard labels used for conventional viewpoint classification, we introduce viewpoint-aware adaptive label smoothing regularization (VALSR) that assigns the adaptive soft label to feature representation. VALSR can effectively solve the ambiguity of the viewpoint cluster label assignment. Extensive experiments on the Market1501 and DukeMTMC-reID datasets demonstrated that our method outperforms the state-of-the-art supervised Re-ID methods. △ Less

Submitted 3 December, 2019; originally announced December 2019.

arXiv:1911.12512 [pdf, other]

Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect

Authors: Xinyang Jiang, Yifei Gong, Xiaowei Guo, Qize Yang, Feiyue Huang, Weishi Zheng, Feng Zheng, Xing Sun

Abstract: Recently, the research interest of person re-identification (ReID) has gradually turned to video-based methods, which acquire a person representation by aggregating frame features of an entire video. However, existing video-based ReID methods do not consider the semantic difference brought by the outputs of different network stages, which potentially compromises the information richness of the per… ▽ More Recently, the research interest of person re-identification (ReID) has gradually turned to video-based methods, which acquire a person representation by aggregating frame features of an entire video. However, existing video-based ReID methods do not consider the semantic difference brought by the outputs of different network stages, which potentially compromises the information richness of the person features. Furthermore, traditional methods ignore important relationship among frames, which causes information redundancy in fusion along the time axis. To address these issues, we propose a novel general temporal fusion framework to aggregate frame features on both semantic aspect and time aspect. As for the semantic aspect, a multi-stage fusion network is explored to fuse richer frame features at multiple semantic levels, which can effectively reduce the information loss caused by the traditional single-stage fusion. While, for the time axis, the existing intra-frame attention method is improved by adding a novel inter-frame attention module, which effectively reduces the information redundancy in temporal fusion by taking the relationship among frames into consideration. The experimental results show that our approach can effectively improve the video-based re-identification accuracy, achieving the state-of-the-art performance. △ Less

Submitted 27 November, 2019; originally announced November 2019.

arXiv:1806.02692 [pdf, other]

doi 10.1016/j.trb.2018.07.004

Traffic state estimation using stochastic Lagrangian dynamics

Authors: Fangfang Zheng, Saif Eddin Jabari, Henry X. Liu, DianChao Lin

Abstract: This paper proposes a new stochastic model of traffic dynamics in Lagrangian coordinates. The source of uncertainty is heterogeneity in driving behavior, captured using driver-specific speed-spacing relations, i.e., parametric uncertainty. It also results in smooth vehicle trajectories in a stochastic context, which is in agreement with real-world traffic dynamics and, thereby, overcoming issues w… ▽ More This paper proposes a new stochastic model of traffic dynamics in Lagrangian coordinates. The source of uncertainty is heterogeneity in driving behavior, captured using driver-specific speed-spacing relations, i.e., parametric uncertainty. It also results in smooth vehicle trajectories in a stochastic context, which is in agreement with real-world traffic dynamics and, thereby, overcoming issues with aggressive oscillation typically observed in sample paths of stochastic traffic flow models. We utilize ensemble filtering techniques for data assimilation (traffic state estimation), but derive the mean and covariance dynamics as the ensemble sizes go to infinity, thereby bypassing the need to sample from the parameter distributions while estimating the traffic states. As a result, the estimation algorithm is just a standard Kalman-Bucy algorithm, which renders the proposed approach amenable to real-time applications using recursive data. Data assimilation examples are performed and our results indicate good agreement with out-of-sample data. △ Less

Submitted 31 May, 2018; originally announced June 2018.

Journal ref: Transportation Research Part B: Methodological Volume 115, September 2018, Pages 143-165

arXiv:1803.00886 [pdf, other]

Deep factorization for speech signal

Authors: Lantian Li, Dong Wang, Yixiang Chen, Ying Shi, Zhiyuan Tang, Thomas Fang Zheng

Abstract: Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors. An intuitive idea is to factorize each speech frame into individual informative factors, though it turns out to be highly difficult. Recently, we found that speaker traits, which were assumed to be long-term distributional properties, are actually short-time patterns, and can be learn… ▽ More Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors. An intuitive idea is to factorize each speech frame into individual informative factors, though it turns out to be highly difficult. Recently, we found that speaker traits, which were assumed to be long-term distributional properties, are actually short-time patterns, and can be learned by a carefully designed deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that will be presented in this paper. The proposed framework infers speech factors in a sequential way, where factors previously inferred are used as conditional variables when inferring other factors. We will show that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with a high accuracy. This factorization and reconstruction approach provides potential values for many speech processing tasks, e.g., speaker recognition and emotion recognition, as will be demonstrated in the paper. △ Less

Submitted 27 February, 2018; originally announced March 2018.

Comments: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap with arXiv:1706.01777

arXiv:1711.00366 [pdf, other]

Full-info Training for Deep Speaker Feature Learning

Authors: Lantian Li, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng

Abstract: In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential… ▽ More In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential problem of the present model is that it involves a parametric classifier, i.e., the last affine layer, which may consume some discriminative knowledge, thus leading to `information leak' for the feature learning. This paper presents a full-info training approach that discards the parametric classifier and enforces all the discriminative knowledge learned by the feature net. Our experiments on the Fisher database demonstrate that this new training scheme can produce more coherent features, leading to consistent and notable performance improvement on the speaker verification task. △ Less

Submitted 27 February, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

Comments: Accepted by ICASSP 2018

Showing 1–38 of 38 results for author: Zheng, F