Search | arXiv e-print repository

Towards Audio Codec-based Speech Separation

Authors: Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

Abstract: Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp… ▽ More Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: This paper was accepted by Interspeech 2024, Blue Sky Track

arXiv:2406.02009 [pdf, other]

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method. △ Less

Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.02208 [pdf, other]

Reference-Free Image Quality Metric for Degradation and Reconstruction Artifacts

Authors: Han Cui, Alfredo De Goyeneche, Efrat Shimron, Boyuan Ma, Michael Lustig

Abstract: Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality facto… ▽ More Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality factor is representatives of image quality measurement, and a well-trained neural network can learn to accurately evaluate image quality without requiring a clean reference, as it can recognize image degradation artifacts based on prior knowledge. Thus, we developed a reference-free quality evaluation network, dubbed "Quality Factor (QF) Predictor", which does not require any reference. Our QF Predictor is a lightweight, fully convolutional network comprising seven layers. The model is trained in a self-supervised manner: it receives JPEG compressed image patch with a random QF as input, is trained to accurately predict the corresponding QF. We demonstrate the versatility of the model by applying it to various tasks. First, our QF Predictor can generalize to measure the severity of various image artifacts, such as Gaussian Blur and Gaussian noise. Second, we show that the QF Predictor can be trained to predict the undersampling rate of images reconstructed from Magnetic Resonance Imaging (MRI) data. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2403.04626 [pdf, other]

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

Authors: Lei Li, Tianfang Zhang, Xinglin Zhang, Jiaqi Liu, Bingqi Ma, Yan Luo, Tao Chen

Abstract: Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model… ▽ More Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data, a common scenario in medical diagnostics. We verify that masking an image does not affect inter-modal learning. Furthermore, we propose the SVD loss to enhance the representation learning for characteristics of medical images, aiming to improve classification accuracy by leveraging the structural intricacies of such data. Our theory posits that masking encourages semantic preservation, robust feature extraction, regularization, domain adaptation, and invariance learning. Lastly, we validate using language will improve the zero-shot performance for the medical image analysis. MedFLIP's scaling of the masking process marks an advancement in the field, offering a pathway to rapid and precise medical image analysis without the traditional computational bottlenecks. Through experiments and validation, MedFLIP demonstrates efficient performance improvements, helps for future research and application in medical diagnostics. △ Less

Submitted 30 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

arXiv:2312.11825 [pdf, other]

MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

Authors: Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, Bin Ma

Abstract: Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to… ▽ More Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to model both long-range, coarse-scale dependencies and fine-scale recurrent patterns by integrating a recurrent module into the MossFormer framework. Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence. The integrated MossFormer2 hybrid model demonstrates remarkable enhancements over MossFormer and surpasses other state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: 5 pages, 3 figures, accepted by ICASSP 2024

arXiv:2309.12608 [pdf, other]

SPGM: Prioritizing Local Features for enhanced speech separation performance

Authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

Abstract: Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro… ▽ More Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters. Model and weights are available at huggingface.co/yipjiaqi/spgm △ Less

Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: This paper was accepted by ICASSP 2024

arXiv:2309.09413 [pdf, other]

Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

Authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

Abstract: Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa… ▽ More Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). Our findings highlight their role as zero-shot learners in improving ASR performance but also make them vulnerable to malicious modifications. Soft prompts aid generalization but are not obligatory for inference. We also identify two primary roles of soft prompts: content refinement and noise information enhancement, which enhances robustness against background noise. Additionally, we propose an effective modification on noise prompts to show that they are capable of zero-shot learning on adapting to out-of-distribution noise environments. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2309.07458 [pdf, other]

Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures

Authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Chng Eng Siong

Abstract: Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in s… ▽ More Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in speech from the impact of out-of-domain inference. This is measured using a carefully designed test dataset, Emo2Mix, consisting of balanced data across all emotional combinations. We show that even models with strong out-of-domain performance such as Sepformer can still suffer significant degradation of up to 5.1 dB SI-SDRi on mixtures with strong emotions. This demonstrates the importance of accounting for emotions in real-world speech separation applications. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Accepted by APSIPA ASC 2023

arXiv:2305.12121 [pdf, other]

ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

Authors: Jia Qi Yip, Tuan Truong, Dianwen Ng, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

Abstract: In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we buil… ▽ More In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we build a Multi-Layer Aggregation (MLA) block using ACA to generate fixed-sized identity vectors from variable-length inputs. Through global attention, ACA-Net acts as an efficient global feature extractor that adapts to temporal variability unlike existing SV models that apply a fixed function for pooling over the temporal dimension which may obscure information about the signal's non-stationary temporal variability. Our experiments on the WSJ0-1talker show ACA-Net outperforms a strong baseline by 5\% relative improvement in EER using only 1/5 of the parameters. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted to INTERSPEECH 2023

arXiv:2305.08542 [pdf, other]

Design and Implementation of Emergency Simulated Lighting System Based on Tello UAV

Authors: Yexin Pan, Yong Xu, Bo Ma, Chuanhuang Li

Abstract: In recent years, with the increasing maturity of UAV technology, the application of UAV in the civilian field has seen explosive growth due to their low cost, high flexibility, and wide adaptability. In order to address the drawbacks of current tethered UAV lighting, which necessitates manual operation and coordination with tethered cables, this paper presents a rapid reaction and autonomous deplo… ▽ More In recent years, with the increasing maturity of UAV technology, the application of UAV in the civilian field has seen explosive growth due to their low cost, high flexibility, and wide adaptability. In order to address the drawbacks of current tethered UAV lighting, which necessitates manual operation and coordination with tethered cables, this paper presents a rapid reaction and autonomous deployment emergency lighting system prototype based on Tello UAV. The system design is divided into three modules. First, the lighting module has designed the protogype lighting extensions for the Tello UAV. By selecting and installing LED light sources reasonably, the stability of the UAV's takeoff, landing, and flight is not affected, while ensuring good lighting effects; Second, the addressing module extracts the optimization objective function based on the optimization deployment requirements of UAV, and uses simulated annealing algorithm to iteratively optimize for different distribution scenarios and user needs, calculating the optimal deployment location of UAV and planning flight missions for UAV. Third, the flight control module has designed a specialized command control framework based on Tello UAV's API, which converts the planned flight path into command statements, forms flight text, and controls the flight of unmanned aerial vehicles accordingly. The test results of the emergency lighting system show that the system can respond quickly to both densely distributed and sparsely distributed user scenarios, deploying UAV to the best service locations, and providing high-quality services to users. This work represents an early attempt of Python-based real-world UAV lighting system prototype design to associate the UAV hardware and Python algorithms. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: plan to submit to IEEE conference

arXiv:2305.01170 [pdf, other]

Contrastive Speech Mixup for Low-resource Keyword Spotting

Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma

Abstract: Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-… ▽ More Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss to the existing mixup augmentation technique to maximize the relative similarity between the original pre-mixed samples and the augmented samples. The goal is to inject enhancing constraints to guide the model towards simpler but richer content-based speech representations from two augmented views (i.e. noisy mixed and clean pre-mixed utterances). We conduct our experiments on the Google Speech Command dataset, where we trim the size of the training set to as small as 2.5 mins per keyword to simulate a low-resource condition. Our experimental results show a consistent improvement in the performance of multiple models, which exhibits the effectiveness of our method. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.03600 [pdf, other]

Adaptive Knowledge Distillation between Text and Speech Pre-trained Models

Authors: **jie Ni, Yukun Ma, Wen Wang, Qian Chen, Dianwen Ng, Han Lei, Trung Hieu Nguyen, Chong Zhang, Bin Ma, Erik Cambria

Abstract: Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper st… ▽ More Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data without modifying the model structure. Since the semantic and granularity gap between text and speech has been omitted in literature, which impairs the distillation, we propose the Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages text/speech units of variable granularity and prior distributions to achieve better global and local alignments between text and speech pre-trained models. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches. △ Less

Submitted 6 March, 2023; originally announced March 2023.

arXiv:2302.14597 [pdf, other]

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, **jie Ni, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

Abstract: Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world… ▽ More Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow's redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.11832 [pdf, other]

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Map** for Monaural Speech Enhancement

Authors: Shengkui Zhao, Bin Ma

Abstract: Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of spe… ▽ More Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of speech, has been proved learnable together with magnitude from noisy speech using complex masking or complex spectral map**. Many recent studies focus on either complex masking or complex spectral map**, ignoring their performance boundaries. To address above issues, we propose a fully complex dual-path dual-decoder conformer network (D2Former) using joint complex masking and complex spectral map** for monaural speech enhancement. In D2Former, we extend the conformer network into the complex domain and form a dual-path complex TF self-attention architecture for effectively modelling the complex-valued TF sequence. We further boost the TF feature representation in the encoder and the decoders using a dual-path learning structure by exploiting complex dilated convolutions on time dependency and complex feedforward sequential memory networks (CFSMN) for frequency recurrence. In addition, we improve the performance boundaries of complex masking and complex spectral map** by combining the strengths of the two training targets into a joint-learning framework. As a consequence, D2Former takes fully advantages of the complex-valued operations, the dual-path processing, and the joint-training targets. Compared to the previous models, D2Former achieves state-of-the-art results on the VoiceBank+Demand benchmark with the smallest model size of 0.87M parameters. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 5 pages, 3 figures, accepted by ICASSP 2023

arXiv:2302.11824 [pdf, other]

MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

Authors: Shengkui Zhao, Bin Ma

Abstract: Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by propo… ▽ More Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix. △ Less

Submitted 23 February, 2023; originally announced February 2023.

Comments: 5 pages, 3 figures, accepted by ICASSP 2023

arXiv:2212.00329 [pdf, ps, other]

A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition

Authors: Biao Ma, Chengben Xu, Ye Zhang

Abstract: A novel speech feature fusion algorithm with independent vector analysis (IVA) and parallel convolutional neural network (PCNN) is proposed for text-independent speaker recognition. Firstly, some different feature types, such as the time domain (TD) features and the frequency domain (FD) features, can be extracted from a speaker's speech, and the TD and the FD features can be considered as the lin… ▽ More A novel speech feature fusion algorithm with independent vector analysis (IVA) and parallel convolutional neural network (PCNN) is proposed for text-independent speaker recognition. Firstly, some different feature types, such as the time domain (TD) features and the frequency domain (FD) features, can be extracted from a speaker's speech, and the TD and the FD features can be considered as the linear mixtures of independent feature components (IFCs) with an unknown mixing system. To estimate the IFCs, the TD and the FD features of the speaker's speech are concatenated to build the TD and the FD feature matrix, respectively. Then, a feature tensor of the speaker's speech is obtained by paralleling the TD and the FD feature matrix. To enhance the dependence on different feature types and remove the redundancies of the same feature type, the independent vector analysis (IVA) can be used to estimate the IFC matrices of TD and FD features with the feature tensor. The IFC matrices are utilized as the input of the PCNN to extract the deep features of the TD and FD features, respectively. The deep features can be integrated to obtain the fusion feature of the speaker's speech. Finally, the fusion feature of the speaker's speech is employed as the input of a deep convolutional neural network (DCNN) classifier for speaker recognition. The experimental results show the effectiveness and performances of the proposed speaker recognition system. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2210.16477 [pdf, other]

Adaptive Fuzzy Tracking Control with Global Prescribed-Time Prescribed Performance for Uncertain Strict-Feedback Nonlinear Systems

Authors: Bing Mao, Xiaoqun Wu, Hui Liu, Yuhua Xu, **hu Lü

Abstract: Adaptive fuzzy control strategies are established to achieve global prescribed performance with prescribed-time convergence for strict-feedback systems with mismatched uncertainties and unknown nonlinearities. Firstly, to quantify the transient and steady performance constraints of the tracking error, a class of prescribed-time prescribed performance functions are designed, and a novel error trans… ▽ More Adaptive fuzzy control strategies are established to achieve global prescribed performance with prescribed-time convergence for strict-feedback systems with mismatched uncertainties and unknown nonlinearities. Firstly, to quantify the transient and steady performance constraints of the tracking error, a class of prescribed-time prescribed performance functions are designed, and a novel error transformation function is introduced to remove the initial value constraints and solve the singularity problem in existing works. Secondly, based on dynamic surface control methods, controllers with or without approximating structures are established to guarantee that the tracking error achieves prescribed transient performance and converges into a prescribed bounded set within prescribed time. In particular, the settling time and initial value of the prescribed performance function are completely independent of initial conditions of the tracking error and system parameters, which improves existing results. Moreover, with a novel Lyapunov-like energy function, not only the differential explosion problem frequently occurring in backstep** techniques is solved, but the drawback of the semi-global boundedness of tracking error induced by dynamic surface control can be overcome. The validity and effectiveness of the main results are verified by numerical simulations on practical examples. △ Less

Submitted 27 December, 2022; v1 submitted 28 October, 2022; originally announced October 2022.

arXiv:2210.13756 [pdf, other]

Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

Authors: Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC f… ▽ More Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels. We construct an attribute vector that encodes the relationships among these discrete emotions, which is predicted using a ranking-based support vector machine and then integrated into a sequence-to-sequence (seq2seq) EVC framework. Mixed-EVC not only learns to characterize the input emotional style but also quantifies its relevance to other emotions during training. As a result, users have the ability to assign these attributes to achieve their desired rendering of mixed emotions. Objective and subjective evaluations confirm the effectiveness of our approach in terms of mixed emotion synthesis and control while surpassing traditional baselines in the conversion of discrete emotions from one to another. △ Less

Submitted 17 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

arXiv:2210.03580 [pdf]

doi 10.1109/ICOT.2017.8336109

Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

Authors: Lei Wang, Rong Tong, Cheung Chi Leung, Sunil Sivadas, Chongjia Ni, Bin Ma

Abstract: This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to… ▽ More This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017)

ACM Class: I.2.7

arXiv:2209.06360 [pdf, other]

I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization

Authors: Dianwen Ng, Jia Qi Yip, Tanmay Surana, Zhao Yang, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

Abstract: Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to… ▽ More Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to the cluster. This involves maximizing the similarity across Intra and Inter samples of the same class. As a result, it pulls the instances closer to more generalized representations that form more prominent clusters and reduces the adverse impact of noises. We show that our method provides consistent improvements in accuracy over different backbone model architectures under different noise environments. We also demonstrate that our proposed framework has improved the accuracy of unseen out-of-domain noises and unseen variant noise SNRs. This indicates the significance of our work with the overall refinement in noise robustness. △ Less

Submitted 13 September, 2022; originally announced September 2022.

arXiv:2209.01740 [pdf]

A Multi-scale Video Denoising Algorithm for Raw Image

Authors: Bin Ma, Yueli Hu, Xianxian Lv, Kai Li

Abstract: Video denoising for raw image has always been the difficulty of camera image processing. On the one hand, image denoising performance largely determines the image quality, moreover denoising effect in raw image will affect the accuracy of the following operations of ISP processing flow. On the other hand, compared with image, video have motion information in time sequence, thus motion estimation w… ▽ More Video denoising for raw image has always been the difficulty of camera image processing. On the one hand, image denoising performance largely determines the image quality, moreover denoising effect in raw image will affect the accuracy of the following operations of ISP processing flow. On the other hand, compared with image, video have motion information in time sequence, thus motion estimation which is complex and computationally expensive is needed in video denoising. In view of the above problems, this paper proposes a video denoising algorithm for raw image, performing multiple cascading processing stages on raw-RGB image based on convolutional neural network, and carries out implicit motion estimation in the network. The denoising performance is far superior to that of traditional algorithms with minimal computation and bandwidth, and has computational advantages compared with most deep learning algorithms. △ Less

Submitted 4 September, 2022; originally announced September 2022.

arXiv:2208.09096 [pdf, other]

Representation Learning for the Automatic Indexing of Sound Effects Libraries

Authors: Alison B. Ma, Alexander Lerch

Abstract: Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overc… ▽ More Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overcome dataset-dependent limitations that inhibit the successful training of deep learning models, we pursue representation learning to train generalized embeddings that can be used for a wide variety of sound effects libraries and are a taxonomy-agnostic representation of sound. We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size, outperforming established representations such as OpenL3. Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted at the 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 10 pages, 7 figures

arXiv:2208.00935 [pdf, other]

Amino Acid Classification in 2D NMR Spectra via Acoustic Signal Embeddings

Authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Konstantin Pervushin, Eng Siong Chng

Abstract: Nuclear Magnetic Resonance (NMR) is used in structural biology to experimentally determine the structure of proteins, which is used in many areas of biology and is an important part of drug development. Unfortunately, NMR data can cost thousands of dollars per sample to collect and it can take a specialist weeks to assign the observed resonances to specific chemical groups. There has thus been gro… ▽ More Nuclear Magnetic Resonance (NMR) is used in structural biology to experimentally determine the structure of proteins, which is used in many areas of biology and is an important part of drug development. Unfortunately, NMR data can cost thousands of dollars per sample to collect and it can take a specialist weeks to assign the observed resonances to specific chemical groups. There has thus been growing interest in the NMR community to use deep learning to automate NMR data annotation. Due to similarities between NMR and audio data, we propose that methods used in acoustic signal processing can be applied to NMR as well. Using a simulated amino acid dataset, we show that by swap** out filter banks with a trainable convolutional encoder, acoustic signal embeddings from speaker verification models can be used for amino acid classification in 2D NMR spectra by treating each amino acid as a unique speaker. On an NMR dataset comparable in size with of 46 hours of audio, we achieve a classification performance of 97.7% on a 20-class problem. We also achieve a 23% relative improvement by using an acoustic embedding model compared to an existing NMR-based model. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2206.07293 [pdf, other]

FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement

Authors: Shengkui Zhao, Bin Ma, Karn N. Watcharasupat, Woon-Seng Gan

Abstract: Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder… ▽ More Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder (CRED) structure to boost feature representation along the frequency axis. The CRED applies frequency recurrence on 3D convolutional feature maps along the frequency axis following each convolution, therefore, it is capable of catching long-range frequency correlations and enhancing feature representations of speech inputs. The proposed frequency recurrence is realized efficiently using a feedforward sequential memory network (FSMN). Besides the CRED, we insert two stacked FSMN layers between the encoder and the decoder to model further temporal dynamics. We name the proposed framework as Frequency Recurrent CRN (FRCRN). We design FRCRN to predict complex Ideal Ratio Mask (cIRM) in complex-valued domain and optimize FRCRN using both time-frequency-domain and time-domain losses. Our proposed approach achieved state-of-the-art performance on wideband benchmark datasets and achieved 2nd place for the real-time fullband track in terms of Mean Opinion Score (MOS) and Word Accuracy (WAcc) in the ICASSP 2022 Deep Noise Suppression (DNS) challenge (https://github.com/alibabasglab/FRCRN). △ Less

Submitted 24 November, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: The paper has been accepted by ICASSP 2022. 5 pages, 2 figures, 5 tables

arXiv:2204.01966 [pdf, other]

Time Efficient Joint UAV-BS Deployment and User Association based on Machine Learning

Authors: Bo Ma, Zitian Zhang, Jiliang Zhang, Jie Zhang

Abstract: This paper proposes a time-efficient mechanism to decrease the on-line computing time of solving the joint unmanned aerial vehicle base station (UAV-BS) deployment and user/sensor association (UDUA) problem aiming at maximizing the downlink sum transmission throughput. The joint UDUA problem is decoupled into two sub-problems: one is the user association sub-problem, which gets the optimal matchin… ▽ More This paper proposes a time-efficient mechanism to decrease the on-line computing time of solving the joint unmanned aerial vehicle base station (UAV-BS) deployment and user/sensor association (UDUA) problem aiming at maximizing the downlink sum transmission throughput. The joint UDUA problem is decoupled into two sub-problems: one is the user association sub-problem, which gets the optimal matching strategy between aerial and ground nodes for certain UAV-BS positions; and the other is the UAV-BS deployment sub-problem trying to find the best position combination of the UAV-BSs that make the solution of the first sub-problem optimal among all the possible position combinations of the UAV-BSs. In the proposed mechanism, we transform the user association sub-problem into an equivalent bipartite matching problem and solve it using the Kuhn-Munkres algorithm. For the UAV-BS deployment sub-problem, we theoretically prove that adopting the best UAV-BS deployment strategy of a previous user distribution for each new user distribution will introduce little performance decline compared with the new user distribution's ground true best strategy if the two user distributions are similar enough. Based on our mathematical analyses, the similarity level between user distributions is well defined and becomes the key to solve the second sub-problem. Numerical results indicate that the proposed UDUA mechanism can achieve near-optimal system performance in terms of average downlink sum transmission throughput and failure rate with enormously reduced computing time compared with benchmark approaches. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: 13 pages, this paper has been submitted to IEEE IoT Journal

arXiv:2202.03647 [pdf, other]

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions. △ Less

Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2111.07892 [pdf]

Data privacy protection in microscopic image analysis for material data mining

Authors: Boyuan Ma, Xiang Yin, Xiaojuan Ban, Haiyou Huang, Neng Zhang, Hao Wang, Weihua Xue

Abstract: Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data has been extremely costly owing to the amount of human effort and expertise required. Therefore, material researchers are often reluctant to easily disclose their private data, which leads to the problem of data island, and it is difficult to collect a la… ▽ More Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data has been extremely costly owing to the amount of human effort and expertise required. Therefore, material researchers are often reluctant to easily disclose their private data, which leads to the problem of data island, and it is difficult to collect a large amount of data to train high-quality models. In this study, a material microstructure image feature extraction algorithm FedTransfer based on data privacy protection is proposed. The core contributions are as follows: 1) the federated learning algorithm is introduced into the polycrystalline microstructure image segmentation task to make full use of different user data to carry out machine learning, break the data island and improve the model generalization ability under the condition of ensuring the privacy and security of user data; 2) A data sharing strategy based on style transfer is proposed. By sharing style information of images that is not urgent for user confidentiality, it can reduce the performance penalty caused by the distribution difference of data among different users. △ Less

Submitted 9 November, 2021; originally announced November 2021.

Comments: 14 pages

arXiv:2110.08545 [pdf, other]

A Unified Speaker Adaptation Approach for ASR

Authors: Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

Abstract: Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the exi… ▽ More Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training. △ Less

Submitted 16 October, 2021; originally announced October 2021.

Comments: Accepted by EMNLP 2021

arXiv:2110.07393 [pdf, other]

M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

Authors: Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

Abstract: Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition hav… ▽ More Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition have attracted much attention recently. However, the lack of large public meeting data has been a major obstacle for the advancement of the field. Therefore, we make available the AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by headset microphone. Each meeting session is composed of 2-4 speakers with different speaker overlap ratio, recorded in rooms with different size. Along with the dataset, we launch the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field. In this paper we provide a detailed introduction of the AliMeeting dateset, challenge rules, evaluation methods and baseline systems. △ Less

Submitted 25 February, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: Accepted by ICASSP 2022

arXiv:2110.00745 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747034

End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression

Authors: Karn N. Watcharasupat, Thi Ngoc Tho Nguyen, Woon-Seng Gan, Shengkui Zhao, Bin Ma

Abstract: Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this… ▽ More Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this two-stage framework also often introduces unnecessary delays to the AEC system when neural modules are already capable of both linear and nonlinear echo suppression. In this paper, we exploit the offset-compensating ability of complex time-frequency masks and propose an end-to-end complex-valued neural network architecture. The building block of the proposed model is a pseudocomplex extension based on the densely-connected multidilated DenseNet (D3Net) building block, resulting in a very small network of only 354K parameters. The architecture utilized the multi-resolution nature of the D3Net building blocks to eliminate the need for pooling, allowing the network to extract features using large receptive fields without any loss of output resolution. We also propose a dual-mask technique for joint echo and noise suppression with simultaneous speech enhancement. Evaluation on both synthetic and real test sets demonstrated promising results across multiple energy-based metrics and perceptual proxies. △ Less

Submitted 22 January, 2022; v1 submitted 2 October, 2021; originally announced October 2021.

Comments: To be presented at the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP)

Journal ref: Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 656-660

arXiv:2107.14051 [pdf]

Improvement of image classification by multiple optical scattering

Authors: Xinyu Gao, Yi Li, Yanqing Qiu, Bangning Mao, Miaogen Chen, Yanlong Meng, Chunliu Zhao, Juan Kang, Yong Guo, Changyu Shen

Abstract: Multiple optical scattering occurs when light propagates in a non-uniform medium. During the multiple scattering, images were distorted and the spatial information they carried became scrambled. However, the image information is not lost but presents in the form of speckle patterns (SPs). In this study, we built up an optical random scattering system based on an LCD and an RGB laser source. We fou… ▽ More Multiple optical scattering occurs when light propagates in a non-uniform medium. During the multiple scattering, images were distorted and the spatial information they carried became scrambled. However, the image information is not lost but presents in the form of speckle patterns (SPs). In this study, we built up an optical random scattering system based on an LCD and an RGB laser source. We found that the image classification can be improved by the help of random scattering which is considered as a feedforward neural network to extracts features from image. Along with the ridge classification deployed on computer, we achieved excellent classification accuracy higher than 94%, for a variety of data sets covering medical, agricultural, environmental protection and other fields. In addition, the proposed optical scattering system has the advantages of high speed, low power consumption, and miniaturization, which is suitable for deploying in edge computing applications. △ Less

Submitted 12 July, 2021; originally announced July 2021.

arXiv:2107.05190 [pdf]

doi 10.1117/1.JEI.30.5.053014

Deep-learning-based Hyperspectral imaging through a RGB camera

Authors: Xinyu Gao, Tianlang Wang, **g Yang, **chao Tao, Yanqing Qiu, Yanlong Meng, Banging Mao, Pengwei Zhou, Yi Li

Abstract: Hyperspectral image (HSI) contains both spatial pattern and spectral information which has been widely used in food safety, remote sensing, and medical detection. However, the acquisition of hyperspectral images is usually costly due to the complicated apparatus for the acquisition of optical spectrum. Recently, it has been reported that HSI can be reconstructed from single RGB image using convolu… ▽ More Hyperspectral image (HSI) contains both spatial pattern and spectral information which has been widely used in food safety, remote sensing, and medical detection. However, the acquisition of hyperspectral images is usually costly due to the complicated apparatus for the acquisition of optical spectrum. Recently, it has been reported that HSI can be reconstructed from single RGB image using convolution neural network (CNN) algorithms. Compared with the traditional hyperspectral cameras, the method based on CNN algorithms is simple, portable and low cost. In this study, we focused on the influence of the RGB camera spectral sensitivity (CSS) on the HSI. A Xenon lamp incorporated with a monochromator were used as the standard light source to calibrate the CSS. And the experimental results show that the CSS plays a significant role in the reconstruction accuracy of an HSI. In addition, we proposed a new HSI reconstruction network where the dimensional structure of the original hyperspectral datacube was modified by 3D matrix transpose to improve the reconstruction accuracy. △ Less

Submitted 12 July, 2021; originally announced July 2021.

arXiv:2102.01993 [pdf, other]

Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

Authors: Shengkui Zhao, Trung Hieu Nguyen, Bin Ma

Abstract: Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention mo… ▽ More Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention module (CCBAM) to boost the representation power of the complex-valued convolutional layers by constructing more informative features. The CCBAM is a lightweight and general module which can be easily integrated into any complex-valued convolutional layers. We integrate CCBAM with the deep complex U-Net and CRN to enhance their performance for speech enhancement. We further propose a mixed loss function to jointly optimize the complex models in both time-frequency (TF) domain and time domain. By integrating CCBAM and the mixed loss, we form a new end-to-end (E2E) complex speech enhancement framework. Ablation experiments and objective evaluations show the superior performance of the proposed approaches. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: 5 pages, 4 figures, 2 tables, accepted by ICASSP 2021

arXiv:2102.01991 [pdf, other]

Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram

Authors: Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma

Abstract: Cross-lingual voice conversion (VC) is an important and challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. In this paper, we build upon the neural text-to-speech (TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new cross-lingual VC framework named FastSpeech-VC. We address the mismatches of the phonetic set and t… ▽ More Cross-lingual voice conversion (VC) is an important and challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. In this paper, we build upon the neural text-to-speech (TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new cross-lingual VC framework named FastSpeech-VC. We address the mismatches of the phonetic set and the speech prosody by applying Phonetic PosteriorGrams (PPGs), which have been proved to bridge across speaker and language boundaries. Moreover, we add normalized logarithm-scale fundamental frequency (Log-F0) to further compensate for the prosodic mismatches and significantly improve naturalness. Our experiments on English and Mandarin languages demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can achieve high quality converted speech with mean opinion score (MOS) close to the professional records while maintaining good speaker similarity. Compared to the baselines using Tacotron2 and Transformer TTS models, the FastSpeech-VC can achieve controllable converted speech rate and much faster inference speed. More importantly, the FastSpeech-VC can easily be adapted to a speaker with limited training utterances. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: 5 pages, 2 figures, 4 tables, accepted by ICASSP 2021

arXiv:2010.08136 [pdf, other]

Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

Authors: Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma

Abstract: Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In t… ▽ More Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In this paper, we explore the use of Mandarin speech recordings from a Mandarin speaker, and English speech recordings from another English speaker to build high-quality bilingual and code-switched TTS for both speakers. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech, which show good naturalness and speaker similarity. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model. With these data, three neural TTS models -- Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS. Subjective evaluation results show that all the three systems can produce (near-)native-level speech in both languages for each of the speaker. △ Less

Submitted 15 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures, INTERSPEECH 2020

arXiv:2005.10407 [pdf, other]

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Authors: Zhi** Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures. △ Less

Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:2003.01084 [pdf, ps, other]

Distributed Leader-Follower Formation Tracking Control of Multiple Quad-rotors

Authors: Lixia Yan, Baoli Ma

Abstract: The leader-follower formation control analysis for multiple quad-rotor systems is investigated in this paper. To achieve predefined formation in the three-dimensional air space ($x,y$ and $z$), a novel local tracking control law and a distributed observer are obtained. The local tracking control law starts with finding a bounded continuous yet greater-than-zero control in $z$, based on which follo… ▽ More The leader-follower formation control analysis for multiple quad-rotor systems is investigated in this paper. To achieve predefined formation in the three-dimensional air space ($x,y$ and $z$), a novel local tracking control law and a distributed observer are obtained. The local tracking control law starts with finding a bounded continuous yet greater-than-zero control in $z$, based on which following a feedback linearization controls derived for errors associated with $x$ and $y$. The distributed observer achieves position coordination among followers, though there are only partial followers can know the leader's states and only neighboring communication is available. Simulation results validate the proposed formation scheme. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1912.00863 [pdf, other]

Independent language modeling architecture for end-to-end ASR

Authors: Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

Abstract: The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language mo… ▽ More The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reduction on Mandarin HKUST and English NSC datasets respectively; 2)the proposed architecture works well with external LM and can be generalized to different amount of labelled data. △ Less

Submitted 25 November, 2019; originally announced December 2019.

arXiv:1905.04711 [pdf]

doi 10.1038/s41524-020-00392-6

Data augmentation in microscopic images for material data mining

Authors: Boyuan Ma, Xiaoyan Wei, Chuni Liu, Xiaojuan Ban, Haiyou Huang, Hao Wang, Weihua Xue, Stephen Wu, Mingfei Gao, Qing Shen, Adnan Omer Abuassba, Haokai Shen, Yan**g Su

Abstract: Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data (real data) has been extremely costly since the amount of human effort and expertise required. Here, we develop a novel transfer learning strategy to address small or insufficient data problem. This strategy realizes the fusion of real and simulated data,… ▽ More Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data (real data) has been extremely costly since the amount of human effort and expertise required. Here, we develop a novel transfer learning strategy to address small or insufficient data problem. This strategy realizes the fusion of real and simulated data, and the augmentation of training data in data mining procedure. For a specific task of image segmentation, this strategy can generate synthetic images by fusing physical mechanism of simulated images and "image style" of real images. The result shows that the model trained with the acquired synthetic images and 35% of the real images outperforms the model trained on all real images. As the time required to generate synthetic data is almost negligible, this strategy is able to reduce the time cost of real data preparation by roughly 65%. △ Less

Submitted 28 October, 2019; v1 submitted 12 May, 2019; originally announced May 2019.

Comments: 17 pages, technical report

Journal ref: npj computational materials 2020

arXiv:1810.08906 [pdf]

Analog-to-digital conversion revolutionized by deep learning

Authors: Shaofu Xu, Xiuting Zou, Bowen Ma, Jian** Chen, Lei Yu, Weiwen Zou

Abstract: As the bridge between the analog world and digital computers, analog-to-digital converters are generally used in modern information systems such as radar, surveillance, and communications. For the configuration of analog-to-digital converters in future high-frequency broadband systems, we introduce a revolutionary architecture that adopts deep learning technology to overcome tradeoffs between band… ▽ More As the bridge between the analog world and digital computers, analog-to-digital converters are generally used in modern information systems such as radar, surveillance, and communications. For the configuration of analog-to-digital converters in future high-frequency broadband systems, we introduce a revolutionary architecture that adopts deep learning technology to overcome tradeoffs between bandwidth, sampling rate, and accuracy. A photonic front-end provides broadband capability for direct sampling and speed multiplication. Trained deep neural networks learn the patterns of system defects, maintaining high accuracy of quantized data in a succinct and adaptive manner. Based on numerical and experimental demonstrations, we show that the proposed architecture outperforms state-of-the-art analog-to-digital converters, confirming the potential of our approach in future analog-to-digital converter design and performance enhancement of future information systems. △ Less

Submitted 21 October, 2018; originally announced October 2018.

Showing 1–40 of 40 results for author: Ma, B