Search | arXiv e-print repository

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Authors: Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui **, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

Abstract: We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrat… ▽ More We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01\% absolute (6.98\% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.10152 [pdf, other]

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Authors: Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui **, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint… ▽ More This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.10034 [pdf, other]

Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui **g, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

arXiv:2406.08920 [pdf, other]

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

Authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

Abstract: Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability… ▽ More Novel view acoustic synthesis (NVAS) aims to render binaural audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene. Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing binaural audio. However, in addition to low efficiency originating from heavy NeRF rendering, these methods all have a limited ability of characterizing the entire scene environment such as room geometry, material properties, and the spatial relation between the listener and sound source. To address these issues, we propose a novel Audio-Visual Gaussian Splatting (AV-GS) model. To obtain a material-aware and geometry-aware condition for audio synthesis, we learn an explicit point-based scene representation with an audio-guidance parameter on locally initialized Gaussian points, taking into account the space relation from the listener and sound source. To make the visual scene model audio adaptive, we propose a point densification and pruning strategy to optimally distribute the Gaussian points, with the per-point contribution in sound propagation (e.g., more points needed for texture-less wall surfaces as they affect sound path diversion). Extensive experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets. △ Less

Submitted 14 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2404.17903 [pdf, other]

3D Extended Object Tracking by Fusing Roadside Sparse Radar Point Clouds and Pixel Keypoints

Authors: Jiayin Deng, Zhiqun Hu, Yuxuan Xia, Zhaoming Lu, Xiangming Wen

Abstract: Roadside perception is a key component in intelligent transportation systems. In this paper, we present a novel three-dimensional (3D) extended object tracking (EOT) method, which simultaneously estimates the object kinematics and extent state, in roadside perception using both the radar and camera data. Because of the influence of sensor viewing angle and limited angle resolution, radar measureme… ▽ More Roadside perception is a key component in intelligent transportation systems. In this paper, we present a novel three-dimensional (3D) extended object tracking (EOT) method, which simultaneously estimates the object kinematics and extent state, in roadside perception using both the radar and camera data. Because of the influence of sensor viewing angle and limited angle resolution, radar measurements from objects are sparse and non-uniformly distributed, leading to inaccuracies in object extent and position estimation. To address this problem, we present a novel spherical Gaussian function weighted Gaussian mixture model. This model assumes that radar measurements originate from a series of probabilistic weighted radar reflectors on the vehicle's extent. Additionally, we utilize visual detection of vehicle keypoints to provide additional information on the positions of radar reflectors. Since keypoints may not always correspond to radar reflectors, we propose an elastic skeleton fusion mechanism, which constructs a virtual force to establish the relationship between the radar reflectors on the vehicle and its extent. Furthermore, to better describe the kinematic state of the vehicle and constrain its extent state, we develop a new 3D constant turn rate and velocity motion model, considering the complex 3D motion of the vehicle relative to the roadside sensor. Finally, we apply variational Bayesian approximation to the intractable measurement update step to enable recursive Bayesian estimation of the object's state. Simulation results using the Carla simulator and experimental results on the nuScenes dataset demonstrate the effectiveness and superiority of the proposed method in comparison to several state-of-the-art 3D EOT methods. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.14693 [pdf, other]

Double Privacy Guard: Robust Traceable Adversarial Watermarking against Face Recognition

Authors: Yunming Zhang, Dengpan Ye, Sipeng Shen, Caiyun Xie, Ziyi Liu, Jiacheng Deng, Long Tang

Abstract: The wide deployment of Face Recognition (FR) systems poses risks of privacy leakage. One countermeasure to address this issue is adversarial attacks, which deceive malicious FR searches but simultaneously interfere the normal identity verification of trusted authorizers. In this paper, we propose the first Double Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG employs a… ▽ More The wide deployment of Face Recognition (FR) systems poses risks of privacy leakage. One countermeasure to address this issue is adversarial attacks, which deceive malicious FR searches but simultaneously interfere the normal identity verification of trusted authorizers. In this paper, we propose the first Double Privacy Guard (DPG) scheme based on traceable adversarial watermarking. DPG employs a one-time watermark embedding to deceive unauthorized FR models and allows authorizers to perform identity verification by extracting the watermark. Specifically, we propose an information-guided adversarial attack against FR models. The encoder embeds an identity-specific watermark into the deep feature space of the carrier, guiding recognizable features of the image to deviate from the source identity. We further adopt a collaborative meta-optimization strategy compatible with sub-tasks, which regularizes the joint optimization direction of the encoder and decoder. This strategy enhances the representation of universal carrier features, mitigating multi-objective optimization conflicts in watermarking. Experiments confirm that DPG achieves significant attack success rates and traceability accuracy on state-of-the-art FR models, exhibiting remarkable robustness that outperforms the existing privacy protection methods using adversarial attacks and deep watermarking, or simple combinations of the two. Our work potentially opens up new insights into proactive protection for FR privacy. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.02663 [pdf]

Ground-to-UAV sub-Terahertz channel measurement and modeling

Authors: Da Li, Peian Li, Jiabiao Zhao, Jianjian Liang, Jiacheng Liu, Guohao Liu, Yuanshuai Lei, Wenbo Liu, Jianqin Deng, Fuyong Liu, Jianjun Ma

Abstract: Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless… ▽ More Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless channels leveraging UAVs remain under explored. This work delves into a ground-to-UAV channel at 140 GHz, with a specific focus on the influence of UAV hovering behavior on channel performance. Employing experimental measurements through an unmodulated channel setup and a geometry-based stochastic model (GBSM) that integrates three-dimensional positional coordinates and beamwidth, this work evaluates the impact of UAV dynamic movements and antenna orientation on channel performance. Our findings highlight the minimal impact of UAV orientation adjustments on channel performance and underscore the diminishing necessity for precise alignment between UAVs and ground stations as beamwidth increases. △ Less

Submitted 28 June, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: Submitted to Optics Express

arXiv:2404.00549 [pdf]

Pneumonia App: a mobile application for efficient pediatric pneumonia diagnosis using explainable convolutional neural networks (CNN)

Authors: Jiaming Deng, Zhenglin Chen, Minjiang Chen, Lulu Xu, Jiaqi Yang, Zhendong Luo, Peiwu Qin

Abstract: Mycoplasma pneumoniae pneumonia (MPP) poses significant diagnostic challenges in pediatric healthcare, especially in regions like China where it's prevalent. We introduce PneumoniaAPP, a mobile application leveraging deep learning techniques for rapid MPP detection. Our approach capitalizes on convolutional neural networks (CNNs) trained on a comprehensive dataset comprising 3345 chest X-ray (CXR)… ▽ More Mycoplasma pneumoniae pneumonia (MPP) poses significant diagnostic challenges in pediatric healthcare, especially in regions like China where it's prevalent. We introduce PneumoniaAPP, a mobile application leveraging deep learning techniques for rapid MPP detection. Our approach capitalizes on convolutional neural networks (CNNs) trained on a comprehensive dataset comprising 3345 chest X-ray (CXR) images, which includes 833 CXR images revealing MPP and additionally augmented with samples from a public dataset. The CNN model achieved an accuracy of 88.20% and an AUROC of 0.9218 across all classes, with a specific accuracy of 97.64% for the mycoplasma class, as demonstrated on the testing dataset. Furthermore, we integrated explainability techniques into PneumoniaAPP to aid respiratory physicians in lung opacity localization. Our contribution extends beyond existing research by targeting pediatric MPP, emphasizing the age group of 0-12 years, and prioritizing deployment on mobile devices. This work signifies a significant advancement in pediatric pneumonia diagnosis, offering a reliable and accessible tool to alleviate diagnostic burdens in healthcare settings. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 27 Pages,7 figures

MSC Class: 68 ACM Class: J.3

arXiv:2403.16643 [pdf, other]

Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution

Authors: Qing** Zheng, Ling Zheng, Yuanfan Guo, Ying Li, Songcen Xu, Jiankang Deng, Hang Xu

Abstract: Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such… ▽ More Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such artifacts, ranging from trivial noise to unauthentic textures, deviate from the true structure of the source image, thus challenging the integrity of the super-resolution process. In this work, we propose Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels, creating a binary mask that highlights artifacts. Following this, the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations, improving alignment with the original image. Nonetheless, initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score, enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method, delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X. We release our code at https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.02307 [pdf, other]

Harnessing Intra-group Variations Via a Population-Level Context for Pathology Detection

Authors: P. Bilha Githinji, Xi Yuan, Zhenglin Chen, Ijaz Gul, Dingqi Shang, Wen Liang, Jianming Deng, Dan Zeng, Dongmei yu, Chenggang Yan, Peiwu Qin

Abstract: Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph th… ▽ More Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph theoretic approach to model and incorporate it into the latent code of an autoencoder via a refinement module we term PopuSense. PopuSense seeks to capture additional intra-group variations inherent in biomedical data that a local or global context of the convolutional model might miss or smooth out. Experiments on contrast-based and texture-based images, with minimal adaptation, encounter the existing preference for intensity-based input. Nevertheless, PopuSense demonstrates improved separability in contrast-based images, presenting an additional avenue for refining representations learned by a model. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2312.08641 [pdf, other]

Towards Automatic Data Augmentation for Disordered Speech Recognition

Authors: Zengrui **, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task an… ▽ More Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task and system dependent, together with additionally introduced minimum and maximum cut-offs of these time-frequency masks, are now automatically learned using an RNN-based policy controller and tightly integrated with ASR system training. Experiments on the UASpeech corpus suggest the proposed RL-based data augmentation approach consistently produced performance superior or comparable that obtained using expert or handcrafted SpecAugment policies. Our RL auto-augmented PyChain TDNN system produced an overall WER of 28.79% on the UASpeech test set of 16 dysarthric speakers. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: To appear at IEEE ICASSP 2024

arXiv:2308.07293 [pdf, other]

DiffSED: Sound Event Detection with Denoising Diffusion

Authors: Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu

Abstract: Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate t… ▽ More Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training. △ Less

Submitted 16 August, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

arXiv:2307.02909 [pdf, other]

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui **, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu

Abstract: Accurate recognition of cocktail party speech containing overlap** speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is pro… ▽ More Accurate recognition of cocktail party speech containing overlap** speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral map** (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2306.15265 [pdf, other]

Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition

Authors: Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui **, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu

Abstract: Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer AS… ▽ More Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer ASR systems that are pre-trained on the Librispeech corpus before being domain adapted to the DementiaBank elderly and UASpeech dysarthric speech datasets. Experimental results suggest that hyper-parameter adaptation produced word error rate (WER) reductions of 0.45% and 0.67% over parameter-only fine-tuning on DBank and UASpeech tasks respectively. An intuitive correlation is found between the performance improvements by hyper-parameter domain adaptation and the relative utterance length ratio between the source and target domain data. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: 5 pages, 3 figures, 3 tables, accepted by Interspeech2023

arXiv:2306.14608 [pdf, other]

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Authors: Jiajun Deng, Guinan Li, Xurong Xie, Zengrui **, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

Abstract: Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately mo… ▽ More Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2306.13307 [pdf, other]

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Authors: Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen, Xunying Liu

Abstract: Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of t… ▽ More Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data. △ Less

Submitted 25 June, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2305.10659 [pdf, other]

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Authors: Mengzhe Geng, Zengrui **, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu

Abstract: A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognit… ▽ More A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition: a) multitask training incorporating severity prediction error; b) speaker-severity aware auxiliary feature adaptation; and c) structured LHUC transforms separately conditioned on speaker-identity and severity. Experiments conducted on UASpeech suggest incorporating additional speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems produced statistically significant WER reductions up to 4.78% (14.03% relative). Using the best system the lowest published WER of 17.82% (51.25% on very low intelligibility) was obtained on UASpeech. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to INTERSPEECH2023

arXiv:2302.14564 [pdf, other]

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Authors: Shujie Hu, Xurong Xie, Zengrui **, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng

Abstract: Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends… ▽ More Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively. △ Less

Submitted 22 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: accepted by ICASSP 2023

arXiv:2302.12434 [pdf, other]

Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion

Authors: Jiangyi Deng, Yanjiao Chen, Yinan Zhong, Qianhao Miao, Xueluan Gong, Wenyuan Xu

Abstract: Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling t… ▽ More Voice conversion (VC) techniques can be abused by malicious parties to transform their audios to sound like a target speaker, making it hard for a human being or a speaker verification/identification system to trace the source speaker. In this paper, we make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. However, unveiling the features of the source speaker from a converted audio is challenging since the voice conversion operation intends to disentangle the original features and infuse the features of the target speaker. To fulfill our goal, we develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples. We equip Revelio with a carefully-designed differential rectification algorithm to eliminate the influence of the target speaker by removing the representation component that is parallel to the voiceprint of the target speaker. We have conducted extensive experiments to evaluate the capability of Revelio in restoring voiceprint from audios converted by VQVC, VQVC+, AGAIN, and BNE. The experiments verify that Revelio is able to rebuild voiceprints that can be traced to the source speaker by speaker verification and identification systems. Revelio also exhibits robust performance under inter-gender conversion, unseen languages, and telephony networks. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: Accepted by USENIX Security Symposium 2023. Please cite this paper as "Jiangyi Deng, Yanjiao Chen, Yinan Zhong, Qianhao Miao, Xueluan Gong, Wenyuan Xu. Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion. In 32nd USENIX Security Symposium (USENIX Security 23)."

arXiv:2302.07521 [pdf, other]

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Authors: Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui **, Guinan Li, Shujie Hu, Xunying Liu

Abstract: Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compac… ▽ More Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2212.14354 [pdf]

A Fault Location Method Based on Electromagnetic Transient Convolution Considering Frequency-Dependent Parameters and Lossy Ground

Authors: Guanbo Wang, Chijie Zhuang, Jun Deng, Zhicheng Xie

Abstract: As the capacity of power systems grows, the need for quick and precise short-circuit fault location becomes increasingly vital for ensuring the safe and continuous supply of power. In this paper, we propose a fault location method that utilizes electromagnetic transient convolution (EMTC). We assess the performance of a naive EMTC implementation in multi-phase power lines by using frequency-depend… ▽ More As the capacity of power systems grows, the need for quick and precise short-circuit fault location becomes increasingly vital for ensuring the safe and continuous supply of power. In this paper, we propose a fault location method that utilizes electromagnetic transient convolution (EMTC). We assess the performance of a naive EMTC implementation in multi-phase power lines by using frequency-dependent parameters in real fault simulation, while using constant parameters in pre-calculation. Our results show that the location error increases as the distance between the fault location and the measurement location increases. Therefore, we adopt the aerial mode transients after phase-mode transformation to perform the convolution, which reduces the influence of frequency-dependence and ground loss. We conduct numerical experiments in a 3-phase 100-km transmission line, a radial distribution network and IEEE 9-bus system under different fault conditions. Our results show that the proposed method achieves tolerable location errors and operates efficiently through direct convolution of the real fault-generated transient signals and the pre-stored calculated transient signals. △ Less

Submitted 31 December, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

arXiv:2212.00014 [pdf, other]

Attentional Ptycho-Tomography (APT) for three-dimensional nanoscale X-ray imaging with minimal data acquisition and computation time

Authors: Iksung Kang, Ziling Wu, Yi Jiang, Yudong Yao, Jun**g Deng, Jeffrey Klug, Stefan Vogt, George Barbastathis

Abstract: Noninvasive X-ray imaging of nanoscale three-dimensional objects, e.g. integrated circuits (ICs), generally requires two types of scanning: ptychographic, which is translational and returns estimates of complex electromagnetic field through ICs; and tomographic scanning, which collects complex field projections from multiple angles. Here, we present Attentional Ptycho-Tomography (APT), an approach… ▽ More Noninvasive X-ray imaging of nanoscale three-dimensional objects, e.g. integrated circuits (ICs), generally requires two types of scanning: ptychographic, which is translational and returns estimates of complex electromagnetic field through ICs; and tomographic scanning, which collects complex field projections from multiple angles. Here, we present Attentional Ptycho-Tomography (APT), an approach trained to provide accurate reconstructions of ICs despite incomplete measurements, using a dramatically reduced amount of angular scanning. Training process includes regularizing priors based on typical IC patterns and the physics of X-ray propagation. We demonstrate that APT with 12-time reduced angles achieves fidelity comparable to the gold standard with the original set of angles. With the same set of reduced angles, APT also outperforms baseline reconstruction methods. In our experiments, APT achieves 108-time aggregate reduction in data acquisition and computation without compromising quality. We expect our physics-assisted machine learning framework could also be applied to other branches of nanoscale imaging. △ Less

Submitted 29 November, 2022; originally announced December 2022.

Comments: 27 pages, 7 figures

arXiv:2211.12845 [pdf, other]

doi 10.1007/s41095-023-0387-8

Super-resolution Reconstruction of Single Image for Latent features

Authors: Xin Wang, **g-Ke Yan, **g-Ye Cai, Jian-Hua Deng, Qin Qin, Yao Cheng

Abstract: Single-image super-resolution (SISR) typically focuses on restoring various degraded low-resolution (LR) images to a single high-resolution (HR) image. However, during SISR tasks, it is often challenging for models to simultaneously maintain high quality and rapid sampling while preserving diversity in details and texture features. This challenge can lead to issues such as model collapse, lack of… ▽ More Single-image super-resolution (SISR) typically focuses on restoring various degraded low-resolution (LR) images to a single high-resolution (HR) image. However, during SISR tasks, it is often challenging for models to simultaneously maintain high quality and rapid sampling while preserving diversity in details and texture features. This challenge can lead to issues such as model collapse, lack of rich details and texture features in the reconstructed HR images, and excessive time consumption for model sampling. To address these problems, this paper proposes a Latent Feature-oriented Diffusion Probability Model (LDDPM). First, we designed a conditional encoder capable of effectively encoding LR images, reducing the solution space for model image reconstruction and thereby improving the quality of the reconstructed images. We then employed a normalized flow and multimodal adversarial training, learning from complex multimodal distributions, to model the denoising distribution. Doing so boosts the generative modeling capabilities within a minimal number of sampling steps. Experimental comparisons of our proposed model with existing SISR methods on mainstream datasets demonstrate that our model reconstructs more realistic HR images and achieves better performance on multiple evaluation metrics, providing a fresh perspective for tackling SISR tasks. △ Less

Submitted 9 November, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Journal ref: Computational Visual Media,2023

arXiv:2211.01646 [pdf, other]

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Authors: Zengrui **, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personali… ▽ More Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility. △ Less

Submitted 19 March, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.16539 [pdf, other]

Exploiting prompt learning with pre-trained language models for Alzheimer's Disease detection

Authors: Yi Wang, Jiajun Deng, Tianzi Wang, Bo Zheng, Shoukang Hu, Xunying Liu, Helen Meng

Abstract: Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Textual embedding features produced by pre-trained language models (PLMs) such as BERT are widely used in such systems. However, PLM domain f… ▽ More Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Textual embedding features produced by pre-trained language models (PLMs) such as BERT are widely used in such systems. However, PLM domain fine-tuning is commonly based on the masked word or sentence prediction costs that are inconsistent with the back-end AD detection task. To this end, this paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function. Disfluency features based on hesitation or pause filler token frequencies are further incorporated into prompt phrases during PLM fine-tuning. The decision voting based combination among systems using different PLMs (BERT and RoBERTa) or systems with different fine-tuning paradigms (conventional masked-language modelling fine-tuning and prompt-based fine-tuning) is further applied. Mean, standard deviation and the maximum among accuracy scores over 15 experiment runs are adopted as performance measurements for the AD detection system. Mean detection accuracy of 84.20% (with std 2.09%, best 87.5%) and 82.64% (with std 4.0%, best 89.58%) were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers. △ Less

Submitted 31 March, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

Comments: Accepted ICASSP 2023 (will update with IEEE vision later)

arXiv:2210.15140 [pdf, other]

V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

Authors: Jiangyi Deng, Fei Teng, Yanjiao Chen, Xiaofu Chen, Zhaohui Wang, Wenyuan Xu

Abstract: Voice data generated on instant messaging or social media applications contains unique user voiceprints that may be abused by malicious adversaries for identity inference or identity theft. Existing voice anonymization techniques, e.g., signal processing and voice conversion/synthesis, suffer from degradation of perceptual quality. In this paper, we develop a voice anonymization system, named V-Cl… ▽ More Voice data generated on instant messaging or social media applications contains unique user voiceprints that may be abused by malicious adversaries for identity inference or identity theft. Existing voice anonymization techniques, e.g., signal processing and voice conversion/synthesis, suffer from degradation of perceptual quality. In this paper, we develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization while preserving the intelligibility, naturalness and timbre of the audio. Our designed anonymizer features a one-shot generative model that modulates the features of the original audio at different frequency levels. We train the anonymizer with a carefully-designed loss function. Apart from the anonymity loss, we further incorporate the intelligibility loss and the psychoacoustics-based naturalness loss. The anonymizer can realize untargeted and targeted anonymization to achieve the anonymity goals of unidentifiability and unlinkability. We have conducted extensive experiments on four datasets, i.e., LibriSpeech (English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian), five Automatic Speaker Verification (ASV) systems (including two DNN-based, two statistical and one commercial ASV), and eleven Automatic Speech Recognition (ASR) systems (for different languages). Experiment results confirm that V-Cloak outperforms five baselines in terms of anonymity performance. We also demonstrate that V-Cloak trained only on the VoxCeleb1 dataset against ECAPA-TDNN ASV and DeepSpeech2 ASR has transferable anonymity against other ASVs and cross-language intelligibility for other ASRs. Furthermore, we verify the robustness of V-Cloak against various de-noising techniques and adaptive attacks. Hopefully, V-Cloak may provide a cloak for us in a prism world. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: Accepted by USENIX Security Symposium 2023

arXiv:2210.11422 [pdf, other]

A Hybrid Millimeter-wave Channel Simulator for Joint Communication and Localization

Authors: Junquan Deng

Abstract: Joint communication and localization~(JCL) is envisioned to be a key feature in future millimeter-wave~(mmWave) wireless networks for context-aware applications. A map-based channel model considering both site-specific radio environment and statistical channel characteristics is essential to facilitate JCL research and to evaluate the performance of various JCL systems. To this end, this paper pre… ▽ More Joint communication and localization~(JCL) is envisioned to be a key feature in future millimeter-wave~(mmWave) wireless networks for context-aware applications. A map-based channel model considering both site-specific radio environment and statistical channel characteristics is essential to facilitate JCL research and to evaluate the performance of various JCL systems. To this end, this paper presents an open-source hybrid mmWave channel simulator called OmniSIM for site-specific JCL research, which uses digital map, network layout and user trajectories as inputs to predict the channel responses between users and base stations. A fast shooting-bouncing rays~(FSBR) algorithm combined with Computational Electromagnetic, has been developed to generate channel parameters relevant to JCL, considering mmWave reflection, diffusing, diffraction and scattering. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: 6 pages,8 figures,submitted to ICCC 2022

arXiv:2208.14007 [pdf, other]

Finding neural signatures for obesity through feature selection on source-localized EEG

Authors: Yuan Yue, Dirk De Ridder, Patrick Manning, Samantha Ross, Jeremiah D. Deng

Abstract: Obesity is a serious issue in the modern society and is often associated to significantly reduced quality of life. Current research conducted to explore obesity-related neurological evidences using electroencephalography (EEG) data are limited to traditional approaches. In this study, we developed a novel machine learning model to identify brain networks of obese females using alpha band functiona… ▽ More Obesity is a serious issue in the modern society and is often associated to significantly reduced quality of life. Current research conducted to explore obesity-related neurological evidences using electroencephalography (EEG) data are limited to traditional approaches. In this study, we developed a novel machine learning model to identify brain networks of obese females using alpha band functional connectivity features derived from EEG data. An overall classification accuracy of 0.937 is achieved. Our finding suggests that the obese brain is characterized by a dysfunctional network in which the areas that responsible for processing self-referential information and environmental context information are impaired. △ Less

Submitted 21 June, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

Comments: 4 pages, 3 figures, conference submission

arXiv:2206.14940 [pdf, other]

Physics-Inspired Unsupervised Classification for Region of Interest in X-Ray Ptychography

Authors: Dergan Lin, Yi Jiang, Jun**g Deng, Zichao Wendy Di

Abstract: X-ray ptychography allows for large fields to be imaged at high resolution at the cost of additional computational expense due to the large volume of data. Given limited information regarding the object, the acquired data often has an excessive amount of information that is outside the region of interest (RoI). In this work we propose a physics-inspired unsupervised learning algorithm to identify… ▽ More X-ray ptychography allows for large fields to be imaged at high resolution at the cost of additional computational expense due to the large volume of data. Given limited information regarding the object, the acquired data often has an excessive amount of information that is outside the region of interest (RoI). In this work we propose a physics-inspired unsupervised learning algorithm to identify the RoI of an object using only diffraction patterns from a ptychography dataset before committing computational resources to reconstruction. Obtained diffraction patterns that are automatically identified as not within the RoI are filtered out, allowing efficient reconstruction by focusing only on important data within the RoI while preserving image quality. △ Less

Submitted 29 June, 2022; originally announced June 2022.

arXiv:2206.13232 [pdf, other]

Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection

Authors: Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui **, Xunying Liu, Helen Meng

Abstract: Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantl… ▽ More Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantly improved by incorporating a set of purposefully designed modeling features, including neural architecture search based auto-configuration of domain-specific Conformer hyper-parameters in addition to parameter fine-tuning; fine-grained elderly speaker adaptation using learning hidden unit contributions (LHUC); and two-pass cross-system rescoring based combination with hybrid TDNN systems. An overall word error rate (WER) reduction of 13.6% absolute (34.8% relative) was obtained on the evaluation data of 48 elderly speakers. Using the final systems' recognition outputs to extract textual features, the best-published speech recognition based AD detection accuracy of 91.7% was obtained. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 5 pages, 1 figure, accepted by INTERSPEECH 2022

arXiv:2206.12045 [pdf, other]

Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

Authors: Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui **, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng

Abstract: A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervi… ▽ More A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervision error rate is reduced using confidence score based selection of the more "trustworthy" subset of speaker specific data. A confidence estimation module is used to smooth the over-confident Conformer decoder output probabilities before serving as confidence scores. The increased data sparsity due to speaker level data selection is addressed using Bayesian estimation of LHUC parameters. Experiments on the 300-hour Switchboard corpus suggest that the proposed LHUC-SAT Conformer with confidence score based test time unsupervised adaptation outperformed the baseline speaker independent and i-vector adapted Conformer systems by up to 1.0%, 1.0%, and 1.2% absolute (9.0%, 7.9%, and 8.9% relative) word error rate (WER) reductions on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Consistent performance improvements were retained after external Transformer and LSTM language models were used for rescoring. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: It's accepted to INTERSPEECH 2022. arXiv admin note: text overlap with arXiv:2206.11596

arXiv:2206.11596 [pdf, other]

doi 10.21437/Interspeech.2022-696

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Authors: Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng

Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system fea… ▽ More Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: It' s accepted to ISCA 2022

arXiv:2206.07327 [pdf, other]

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Xunying Liu, Helen Meng

Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This pa… ▽ More Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems constructed using acoustic features only by statistically significant word or character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding were applied. △ Less

Submitted 22 June, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: accepted by INTERSPEECH 2023

arXiv:2205.09895 [pdf]

doi 10.1016/j.applthermaleng.2022.118685

A BCS-GDE Multi-objective Optimization Algorithm for Combined Cooling, Heating and Power Model with Decision Strategies

Authors: Jiaze Sun, Jiahui Deng, Yang Li, Nan Han

Abstract: District energy systems can not only reduce energy consumption but also set energy supply dispatching schemes according to demand. In addition to economic cost, energy consumption and pollutant are more worthy of attention when evaluating combined cooling, heating and power (CCHP) models. In this paper, the CCHP model is established with the objective of economic cost, primary energy consumption,… ▽ More District energy systems can not only reduce energy consumption but also set energy supply dispatching schemes according to demand. In addition to economic cost, energy consumption and pollutant are more worthy of attention when evaluating combined cooling, heating and power (CCHP) models. In this paper, the CCHP model is established with the objective of economic cost, primary energy consumption, and pollutant emissions. The mathematical expression of the CCHP system is proposed, and a multi-objective optimization model with constraints is established. According to different usage requirements, two decision-making strategies are designed, which can adapt to different scenarios. Besides, a generalized differential evolution with the best compromise solution processing mechanism (BCS-GDE) algorithm is proposed to optimize the CCHP model for the first time. The algorithm provides the optimal energy scheduling scheme by optimizing the production capacity of different capacity equipment. The simulation is conducted in three application scenarios: hotels, offices, and residential buildings. The simulation results show that the model established in this paper can reduce economic cost by 72%, primary energy consumption by 73%, and pollutant emission by 88%. Concurrently, the Wilcoxon signed-rank test shows that BCSGDE is significantly better than OMOPSO, NSGA-II, and SPEA2 with greater than 95% confidence. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accpeted by Applied Thermal Engineering. arXiv admin note: substantial text overlap with arXiv:2108.07394

Journal ref: Applied Thermal Engineering 213 (2022) 118685

arXiv:2205.07711 [pdf, other]

Transferability of Adversarial Attacks on Synthetic Speech Detection

Authors: Jiacheng Deng, Shunyi Chen, Li Dong, Diqun Yan, Rangding Wang

Abstract: Synthetic speech detection is one of the most important research problems in audio security. Meanwhile, deep neural networks are vulnerable to adversarial attacks. Therefore, we establish a comprehensive benchmark to evaluate the transferability of adversarial attacks on the synthetic speech detection task. Specifically, we attempt to investigate: 1) The transferability of adversarial attacks betw… ▽ More Synthetic speech detection is one of the most important research problems in audio security. Meanwhile, deep neural networks are vulnerable to adversarial attacks. Therefore, we establish a comprehensive benchmark to evaluate the transferability of adversarial attacks on the synthetic speech detection task. Specifically, we attempt to investigate: 1) The transferability of adversarial attacks between different features. 2) The influence of varying extraction hyperparameters of features on the transferability of adversarial attacks. 3) The effect of clip** or self-padding operation on the transferability of adversarial attacks. By performing these analyses, we summarise the weaknesses of synthetic speech detectors and the transferability behaviours of adversarial attacks, which provide insights for future research. More details can be found at https://gitee.com/djc_QRICK/Attack-Transferability-On-Synthetic-Detection. △ Less

Submitted 16 May, 2022; originally announced May 2022.

Comments: 5 pages, submit to Interspeech2022

arXiv:2205.06445 [pdf, other]

doi 10.1109/TASLP.2023.3323888

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Authors: Zengrui **, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu

Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role… ▽ More Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role. In contrast to existing data augmentation techniques only modifying the speaking rate or overall shape of spectral contour, fine-grained spectro-temporal differences between dysarthric, elderly and normal speech are modelled using a novel set of speaker dependent (SD) generative adversarial networks (GAN) based data augmentation approaches in this paper. These flexibly allow both: a) temporal or speed perturbed normal speech spectra to be modified and closer to those of an impaired speaker when parallel speech data is available; and b) for non-parallel data, the SVD decomposed normal speech spectral basis features to be transformed into those of a target elderly speaker before being re-composed with the temporal bases to produce the augmented data for state-of-the-art TDNN and Conformer ASR system training. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute (9.61% and 6.4% relative) WER reduction on the TORGO and DementiaBank data respectively. Consistent performance improvements are retained after applying LHUC based speaker adaptation. △ Less

Submitted 23 June, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

Comments: arXiv admin note: text overlap with arXiv:2202.10290

arXiv:2204.03471 [pdf, other]

DynLight: Realize dynamic phase duration with multi-level traffic signal control

Authors: Liang Zhang, Shubin Xie, Jianming Deng

Abstract: We would like to withdraw this article for the following reasons: 1 this article is not satisfactory for limited language and theoretical description; 2 we have enriched and revised this article with the help of other authors; 3 we must update the author contribution information. We would like to withdraw this article for the following reasons: 1 this article is not satisfactory for limited language and theoretical description; 2 we have enriched and revised this article with the help of other authors; 3 we must update the author contribution information. △ Less

Submitted 13 March, 2023; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: We would like to withdraw this article for the following reasons: 1 this article is not satisfactory for limited language and theoretical description; 2 we have enriched and revised this article with the help of other authors; 3 we must update the author contribution information. PLease see: arXiv:2211.01025

arXiv:2204.01977 [pdf, ps, other]

Audio-visual multi-channel speech separation, dereverberation and recognition

Authors: Guinan Li, Jianwei Yu, Jiajun Deng, Xunying Liu, Helen Meng

Abstract: Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlap** speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been d… ▽ More Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlap** speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlap** speech separation and recognition tasks. In this paper, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral map** respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06% absolute (8.77% relative). △ Less

Submitted 8 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted by ICASSP 2022

arXiv:2202.03144 [pdf, other]

Ptychopy: GPU framework for ptychographic data analysis

Authors: Ke Yue, Jun**g Deng, Yi Jiang, Youssef Nashed, David Vine, Stefan Vogt

Abstract: X-ray ptychography imaging at synchrotron facilities like the Advanced Photon Source (APS) involves controlling instrument hardwares to collect a set of diffraction patterns from overlap** coherent illumination spots on extended samples, managing data storage, reconstructing ptychographic images from acquired diffraction patterns, and providing the visualization of results and feedback. In addit… ▽ More X-ray ptychography imaging at synchrotron facilities like the Advanced Photon Source (APS) involves controlling instrument hardwares to collect a set of diffraction patterns from overlap** coherent illumination spots on extended samples, managing data storage, reconstructing ptychographic images from acquired diffraction patterns, and providing the visualization of results and feedback. In addition to the complicated workflow, ptychography instrument could produce up to several TB's of data per second that is needed to be processed in real time. This brings up the need to develop a high performance, robust and user friendly processing software package for ptychographic data analysis. In this paper we present a software framework which provides functionality of visualization, work flow control, and data reconstruction. To accelerate the computation and large datasets process, the data reconstruction part is implemented with three algorithms, ePIE, DM and LSQML using CUDA-C on GPU. △ Less

Submitted 24 January, 2022; originally announced February 2022.

Comments: X-Ray Nanoimaging: Instruments and Methods V

arXiv:2201.03943 [pdf, other]

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Authors: Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng

Abstract: State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural network… ▽ More State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7^28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that ... △ Less

Submitted 28 March, 2022; v1 submitted 8 January, 2022; originally announced January 2022.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2007.08818

arXiv:2201.00006 [pdf, ps, other]

Leveraging Queue Length and Attention Mechanisms for Enhanced Traffic Signal Control Optimization

Authors: Liang Zhang, Shubin Xie, Jianming Deng

Abstract: Reinforcement learning (RL) techniques for traffic signal control (TSC) have gained increasing popularity in recent years. However, most existing RL-based TSC methods tend to focus primarily on the RL model structure while neglecting the significance of proper traffic state representation. Furthermore, some RL-based methods heavily rely on expert-designed traffic signal phase competition. In this… ▽ More Reinforcement learning (RL) techniques for traffic signal control (TSC) have gained increasing popularity in recent years. However, most existing RL-based TSC methods tend to focus primarily on the RL model structure while neglecting the significance of proper traffic state representation. Furthermore, some RL-based methods heavily rely on expert-designed traffic signal phase competition. In this paper, we present a novel approach to TSC that utilizes queue length as an efficient state representation. We propose two new methods: (1) Max Queue-Length (M-QL), an optimization-based traditional method designed based on the property of queue length; and (2) AttentionLight, an RL model that employs the self-attention mechanism to capture the signal phase correlation without requiring human knowledge of phase relationships. Comprehensive experiments on multiple real-world datasets demonstrate the effectiveness of our approach: (1) the M-QL method outperforms the latest RL-based methods; (2) AttentionLight achieves a new state-of-the-art performance; and (3) our results highlight the significance of proper state representation, which is as crucial as neural network design in TSC methods. Our findings have important implications for advancing the development of more effective and efficient TSC methods. Our code is released on Github (https://github. com/LiangZhang1996/AttentionLight). △ Less

Submitted 25 September, 2023; v1 submitted 30 December, 2021; originally announced January 2022.

Comments: 16 pages, 5 figures

Journal ref: "Leveraging Queue Length and Attention Mechanisms for Enhanced Traffic Signal Control Optimization." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer Nature Switzerland, 2023

arXiv:2111.13573 [pdf, other]

Semi-supervised t-SNE for Millimeter-wave Wireless Localization

Authors: Junquan Deng, Wei Shi, Jian Hu, Xianlong Jiao

Abstract: We consider the mobile localization problem in future millimeter-wave wireless networks with distributed Base Stations (BSs) based on multi-antenna channel state information (CSI). For this problem, we propose a Semi-supervised tdistributed Stochastic Neighbor Embedding (St-SNE) algorithm to directly embed the high-dimensional CSI samples into the 2D geographical map. We evaluate the performance o… ▽ More We consider the mobile localization problem in future millimeter-wave wireless networks with distributed Base Stations (BSs) based on multi-antenna channel state information (CSI). For this problem, we propose a Semi-supervised tdistributed Stochastic Neighbor Embedding (St-SNE) algorithm to directly embed the high-dimensional CSI samples into the 2D geographical map. We evaluate the performance of St-SNE in a simulated urban outdoor millimeter-wave radio access network. Our results show that St-SNE achieves a mean localization error of 6.8 m with only 5% of labeled CSI samples in a 200*200 m^2 area with a ray-tracing channel model. St-SNE does not require accurate synchronization among multiple BSs, and is promising for future large-scale millimeter-wave localization. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: 5 pages,6 figures, accepted to 7th International Conference on Computer and Communications

arXiv:2110.11998 [pdf, other]

Semi-Supervised Semantic Segmentation of Vessel Images using Leaking Perturbations

Authors: **yong Hou, Xuejie Ding, Jeremiah D. Deng

Abstract: Semantic segmentation based on deep learning methods can attain appealing accuracy provided large amounts of annotated samples. However, it remains a challenging task when only limited labelled data are available, which is especially common in medical imaging. In this paper, we propose to use Leaking GAN, a GAN-based semi-supervised architecture for retina vessel semantic segmentation. Our key ide… ▽ More Semantic segmentation based on deep learning methods can attain appealing accuracy provided large amounts of annotated samples. However, it remains a challenging task when only limited labelled data are available, which is especially common in medical imaging. In this paper, we propose to use Leaking GAN, a GAN-based semi-supervised architecture for retina vessel semantic segmentation. Our key idea is to pollute the discriminator by leaking information from the generator. This leads to more moderate generations that benefit the training of GAN. As a result, the unlabelled examples can be better utilized to boost the learning of the discriminator, which eventually leads to stronger classification performance. In addition, to overcome the variations in medical images, the mean-teacher mechanism is utilized as an auxiliary regularization of the discriminator. Further, we modify the focal loss to fit it as the consistency objective for mean-teacher regularizer. Extensive experiments demonstrate that the Leaking GAN framework achieves competitive performance compared to the state-of-the-art methods when evaluated on benchmark datasets including DRIVE, STARE and CHASE\_DB1, using as few as 8 labelled images in the semi-supervised setting. It also outperforms existing algorithms on cross-domain segmentation tasks. △ Less

Submitted 22 October, 2021; originally announced October 2021.

Comments: To appear in WACV'22

arXiv:2110.09966 [pdf, other]

SleepPriorCL: Contrastive Representation Learning with Prior Knowledge-based Positive Mining and Adaptive Temperature for Sleep Staging

Authors: Hongjun Zhang, **g Wang, Qinfeng Xiao, Jiaoxue Deng, Youfang Lin

Abstract: The objective of this paper is to learn semantic representations for sleep stage classification from raw physiological time series. Although supervised methods have gained remarkable performance, they are limited in clinical situations due to the requirement of fully labeled data. Self-supervised learning (SSL) based on contrasting semantically similar (positive) and dissimilar (negative) pairs of… ▽ More The objective of this paper is to learn semantic representations for sleep stage classification from raw physiological time series. Although supervised methods have gained remarkable performance, they are limited in clinical situations due to the requirement of fully labeled data. Self-supervised learning (SSL) based on contrasting semantically similar (positive) and dissimilar (negative) pairs of samples have achieved promising success. However, existing SSL methods suffer the problem that many semantically similar positives are still uncovered and even treated as negatives. In this paper, we propose a novel SSL approach named SleepPriorCL to alleviate the above problem. Advances of our approach over existing SSL methods are two-fold: 1) by incorporating prior domain knowledge into the training regime of SSL, more semantically similar positives are discovered without accessing ground-truth labels; 2) via investigating the influence of the temperature in contrastive loss, an adaptive temperature mechanism for each sample according to prior domain knowledge is further proposed, leading to better performance. Extensive experiments demonstrate that our method achieves state-of-the-art performance and consistently outperforms baselines. △ Less

Submitted 15 October, 2021; originally announced October 2021.

arXiv:2108.07394 [pdf]

doi 10.1109/ICPSAsia52756.2021.9621668

A BCS-GDE Algorithm for Multi-objective Optimization of Combined Cooling, Heating and Power Model

Authors: Jiaze Sun, Jiahui Deng, Yang Li, Shuaiyin Ma, Nan Han

Abstract: District energy systems can not only reduce energy consumption but also set energy supply dispatching schemes according to demand. In this paper, the combined cooling heating and power economic emission dispatch (CCHPEED) model is established with the objective of economic cost, primary energy consumption, and pollutant emissions, as well as three decision-making strategies, are proposed to meet t… ▽ More District energy systems can not only reduce energy consumption but also set energy supply dispatching schemes according to demand. In this paper, the combined cooling heating and power economic emission dispatch (CCHPEED) model is established with the objective of economic cost, primary energy consumption, and pollutant emissions, as well as three decision-making strategies, are proposed to meet the demand for energy supply. Besides, a generalized differential evolution with the best compromise solution processing mechanism (BCS-GDE) is proposed to solve the model, also, the best compromise solution processing mechanism is put forward in the algorithm. In the simulation, the resource dispatching is performed according to the different energy demands of hotels, offices, and residential buildings on the whole day. The simulation results show that the model established in this paper can reduce the economic cost, energy consumption, and pollutant emission, in which the maximum reduction rate of economic cost is 72%, the maximum reduction rate of primary energy consumption is 73%, and the maximum reduction rate of pollutant emission is 88%. Concurrently, BCS-GDE also has better convergence and diversity than the classic algorithms. △ Less

Submitted 16 August, 2021; originally announced August 2021.

Comments: Accepted by 2021 IEEE/IAS Industrial and Commercial Power System Asia (IEEE I&CPS Asia 2021)

Journal ref: 2021 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia)

arXiv:2106.08505 [pdf, other]

Dynamically Grown Generative Adversarial Networks

Authors: Lanlan Liu, Yuting Zhang, Jia Deng, Stefano Soatto

Abstract: Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. T… ▽ More Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generator-discriminator balance and convolutional layer choices. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: Accepted to AAAI 2021

arXiv:2012.12686 [pdf, other]

doi 10.1364/OE.418296

Adorym: A multi-platform generic x-ray image reconstruction framework based on automatic differentiation

Authors: Ming Du, Saugat Kandel, Jun**g Deng, Xiao**g Huang, Arnaud Demortiere, Tuan Tu Nguyen, Remi Tucoulou, Vincent De Andrade, Qiaoling **, Chris Jacobsen

Abstract: We describe and demonstrate an optimization-based x-ray image reconstruction framework called Adorym. Our framework provides a generic forward model, allowing one code framework to be used for a wide range of imaging methods ranging from near-field holography to and fly-scan ptychographic tomography. By using automatic differentiation for optimization, Adorym has the flexibility to refine experime… ▽ More We describe and demonstrate an optimization-based x-ray image reconstruction framework called Adorym. Our framework provides a generic forward model, allowing one code framework to be used for a wide range of imaging methods ranging from near-field holography to and fly-scan ptychographic tomography. By using automatic differentiation for optimization, Adorym has the flexibility to refine experimental parameters including probe positions, multiple hologram alignment, and object tilts. It is written with strong support for parallel processing, allowing large datasets to be processed on high-performance computing systems. We demonstrate its use on several experimental datasets to show improved image quality through parameter refinement. △ Less

Submitted 22 December, 2020; originally announced December 2020.

MSC Class: 78-04

arXiv:2012.11105 [pdf, other]

Resting-state EEG sex classification using selected brain connectivity representation

Authors: Jean Li, Jeremiah D. Deng, Divya Adhia, Dirk de Ridder

Abstract: Effective analysis of EEG signals for potential clinical applications remains a challenging task. So far, the analysis and conditioning of EEG have largely remained sex-neutral. This paper employs a machine learning approach to explore the evidence of sex effects on EEG signals, and confirms the generality of these effects by achieving successful sex prediction of resting-state EEG signals. We hav… ▽ More Effective analysis of EEG signals for potential clinical applications remains a challenging task. So far, the analysis and conditioning of EEG have largely remained sex-neutral. This paper employs a machine learning approach to explore the evidence of sex effects on EEG signals, and confirms the generality of these effects by achieving successful sex prediction of resting-state EEG signals. We have found that the brain connectivity represented by the coherence between certain sensor channels are good predictors of sex. △ Less

Submitted 20 December, 2020; originally announced December 2020.

Comments: 11 pages, 6 figures, book chapter to be published by Springer

arXiv:2007.06341 [pdf, other]

DeU-Net: Deformable U-Net for 3D Cardiac MRI Video Segmentation

Authors: Shunjie Dong, **long Zhao, Maojun Zhang, Zhengxue Shi, Jianing Deng, Yiyu Shi, Mei Tian, Cheng Zhuo

Abstract: Automatic segmentation of cardiac magnetic resonance imaging (MRI) facilitates efficient and accurate volume measurement in clinical applications. However, due to anisotropic resolution and ambiguous border (e.g., right ventricular endocardium), existing methods suffer from the degradation of accuracy and robustness in 3D cardiac MRI video segmentation. In this paper, we propose a novel Deformable… ▽ More Automatic segmentation of cardiac magnetic resonance imaging (MRI) facilitates efficient and accurate volume measurement in clinical applications. However, due to anisotropic resolution and ambiguous border (e.g., right ventricular endocardium), existing methods suffer from the degradation of accuracy and robustness in 3D cardiac MRI video segmentation. In this paper, we propose a novel Deformable U-Net (DeU-Net) to fully exploit spatio-temporal information from 3D cardiac MRI video, including a Temporal Deformable Aggregation Module (TDAM) and a Deformable Global Position Attention (DGPA) network. First, the TDAM takes a cardiac MRI video clip as input with temporal information extracted by an offset prediction network. Then we fuse extracted temporal information via a temporal aggregation deformable convolution to produce fused feature maps. Furthermore, to aggregate meaningful features, we devise the DGPA network by employing deformable attention U-Net, which can encode a wider range of multi-dimensional contextual information into global and local features. Experimental results show that our DeU-Net achieves the state-of-the-art performance on commonly used evaluation metrics, especially for cardiac marginal information (ASSD and HD). △ Less

Submitted 13 July, 2020; originally announced July 2020.

arXiv:2004.04975 [pdf, other]

Supervised Learning Based Online Tracking Filters: An XGBoost Implementation

Authors: Jie Deng, Wei Yi

Abstract: The target state filter is an important module in the traditional target tracking framework. In order to get satisfactory tracking results, traditional Bayesian methods usually need accurate motion models, which require the complicated prior information and parameter estimation. Therefore, the modeling process has a key impact on traditional Bayesian filters for target tracking. However, when enco… ▽ More The target state filter is an important module in the traditional target tracking framework. In order to get satisfactory tracking results, traditional Bayesian methods usually need accurate motion models, which require the complicated prior information and parameter estimation. Therefore, the modeling process has a key impact on traditional Bayesian filters for target tracking. However, when encountering unknown prior information or the complicated environment, traditional Bayesian filters have the limitation of greatly reduced accuracy. In this paper, we propose a supervised learning based online tracking filter(SLF). First, a complete tracking filter framework based on supervised learning is established, which is directly based on data-driven and establishes the map** relationship between data. In other words, the proposed filter does not require the prior information about target dynamics and clutter distribution. Then, an implementation based on eXtreme Gradient Boosting (XGBoost) is provided, which proves the portability and applicability of the SLF framework. Meanwhile, the proposed framework will encourage other researchers to continue to expand the field of combining traditional filters with supervised learning. Finally, numerical simulation experiments prove the effectiveness of the proposed filter. △ Less

Submitted 4 May, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

Showing 1–50 of 57 results for author: Deng, J