Skip to main content

Showing 1–40 of 40 results for author: Ma, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.12434  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Audio Codec-based Speech Separation

    Authors: Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: This paper was accepted by Interspeech 2024, Blue Sky Track

  2. arXiv:2406.02009  [pdf, other

    eess.AS cs.CL cs.SD

    Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

    Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

    Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More

    Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  3. arXiv:2405.02208  [pdf, other

    eess.IV cs.CV

    Reference-Free Image Quality Metric for Degradation and Reconstruction Artifacts

    Authors: Han Cui, Alfredo De Goyeneche, Efrat Shimron, Boyuan Ma, Michael Lustig

    Abstract: Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality facto… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  4. arXiv:2403.04626  [pdf, other

    eess.IV cs.CL cs.CV cs.LG

    MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

    Authors: Lei Li, Tianfang Zhang, Xinglin Zhang, Jiaqi Liu, Bingqi Ma, Yan Luo, Tao Chen

    Abstract: Within the domain of medical analysis, extensive research has explored the potential of mutual learning between Masked Autoencoders(MAEs) and multimodal data. However, the impact of MAEs on intermodality remains a key challenge. We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis. We explore MAEs for zero-shot learning with crossed domains, which enhances the model… ▽ More

    Submitted 30 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  5. arXiv:2312.11825  [pdf, other

    cs.SD eess.AS

    MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

    Authors: Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, Bin Ma

    Abstract: Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2024

  6. arXiv:2309.12608  [pdf, other

    eess.AS cs.SD

    SPGM: Prioritizing Local Features for enhanced speech separation performance

    Authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: This paper was accepted by ICASSP 2024

  7. arXiv:2309.09413  [pdf, other

    cs.SD eess.AS

    Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

    Authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  8. arXiv:2309.07458  [pdf, other

    cs.SD eess.AS

    Analysis of Speech Separation Performance Degradation on Emotional Speech Mixtures

    Authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Chng Eng Siong

    Abstract: Despite recent strides made in Speech Separation, most models are trained on datasets with neutral emotions. Emotional speech has been known to degrade performance of models in a variety of speech tasks, which reduces the effectiveness of these models when deployed in real-world scenarios. In this paper we perform analysis to differentiate the performance degradation arising from the emotions in s… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by APSIPA ASC 2023

  9. arXiv:2305.12121  [pdf, other

    cs.SD cs.LG eess.AS

    ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

    Authors: Jia Qi Yip, Tuan Truong, Dianwen Ng, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we buil… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  10. arXiv:2305.08542  [pdf, other

    eess.SY

    Design and Implementation of Emergency Simulated Lighting System Based on Tello UAV

    Authors: Yexin Pan, Yong Xu, Bo Ma, Chuanhuang Li

    Abstract: In recent years, with the increasing maturity of UAV technology, the application of UAV in the civilian field has seen explosive growth due to their low cost, high flexibility, and wide adaptability. In order to address the drawbacks of current tethered UAV lighting, which necessitates manual operation and coordination with tethered cables, this paper presents a rapid reaction and autonomous deplo… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: plan to submit to IEEE conference

  11. arXiv:2305.01170  [pdf, other

    cs.SD eess.AS

    Contrastive Speech Mixup for Low-resource Keyword Spotting

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted by ICASSP 2023

  12. arXiv:2303.03600  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Adaptive Knowledge Distillation between Text and Speech Pre-trained Models

    Authors: **jie Ni, Yukun Ma, Wen Wang, Qian Chen, Dianwen Ng, Han Lei, Trung Hieu Nguyen, Chong Zhang, Bin Ma, Erik Cambria

    Abstract: Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper st… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

  13. arXiv:2302.14597  [pdf, other

    cs.SD eess.AS

    deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, **jie Ni, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world… ▽ More

    Submitted 28 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP 2023

  14. arXiv:2302.11832  [pdf, other

    cs.SD eess.AS

    D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Map** for Monaural Speech Enhancement

    Authors: Shengkui Zhao, Bin Ma

    Abstract: Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of spe… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2023

  15. arXiv:2302.11824  [pdf, other

    cs.SD cs.LG eess.AS

    MossFormer: Pushing the Performance Limit of Monaural Speech Separation using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions

    Authors: Shengkui Zhao, Bin Ma

    Abstract: Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by propo… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2023

  16. arXiv:2212.00329  [pdf, ps, other

    eess.AS

    A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition

    Authors: Biao Ma, Chengben Xu, Ye Zhang

    Abstract: A novel speech feature fusion algorithm with independent vector analysis (IVA) and parallel convolutional neural network (PCNN) is proposed for text-independent speaker recognition. Firstly, some different feature types, such as the time domain (TD) features and the frequency domain (FD) features, can be extracted from a speaker's speech, and the TD and the FD features can be considered as the lin… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  17. arXiv:2210.16477  [pdf, other

    eess.SY

    Adaptive Fuzzy Tracking Control with Global Prescribed-Time Prescribed Performance for Uncertain Strict-Feedback Nonlinear Systems

    Authors: Bing Mao, Xiaoqun Wu, Hui Liu, Yuhua Xu, **hu Lü

    Abstract: Adaptive fuzzy control strategies are established to achieve global prescribed performance with prescribed-time convergence for strict-feedback systems with mismatched uncertainties and unknown nonlinearities. Firstly, to quantify the transient and steady performance constraints of the tracking error, a class of prescribed-time prescribed performance functions are designed, and a novel error trans… ▽ More

    Submitted 27 December, 2022; v1 submitted 28 October, 2022; originally announced October 2022.

  18. arXiv:2210.13756  [pdf, other

    eess.AS cs.SD

    Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

    Authors: Kun Zhou, Berrak Sisman, Carlos Busso, Bin Ma, Haizhou Li

    Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC f… ▽ More

    Submitted 17 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

  19. Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

    Authors: Lei Wang, Rong Tong, Cheung Chi Leung, Sunil Sivadas, Chongjia Ni, Bin Ma

    Abstract: This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017)

    ACM Class: I.2.7

  20. arXiv:2209.06360  [pdf, other

    cs.SD eess.AS

    I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization

    Authors: Dianwen Ng, Jia Qi Yip, Tanmay Surana, Zhao Yang, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  21. arXiv:2209.01740  [pdf

    eess.IV

    A Multi-scale Video Denoising Algorithm for Raw Image

    Authors: Bin Ma, Yueli Hu, Xianxian Lv, Kai Li

    Abstract: Video denoising for raw image has always been the difficulty of camera image processing. On the one hand, image denoising performance largely determines the image quality, moreover denoising effect in raw image will affect the accuracy of the following operations of ISP processing flow. On the other hand, compared with image, video have motion information in time sequence, thus motion estimation w… ▽ More

    Submitted 4 September, 2022; originally announced September 2022.

  22. arXiv:2208.09096  [pdf, other

    cs.SD cs.LG eess.AS

    Representation Learning for the Automatic Indexing of Sound Effects Libraries

    Authors: Alison B. Ma, Alexander Lerch

    Abstract: Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overc… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: Accepted at the 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 10 pages, 7 figures

  23. arXiv:2208.00935  [pdf, other

    q-bio.QM eess.AS

    Amino Acid Classification in 2D NMR Spectra via Acoustic Signal Embeddings

    Authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Konstantin Pervushin, Eng Siong Chng

    Abstract: Nuclear Magnetic Resonance (NMR) is used in structural biology to experimentally determine the structure of proteins, which is used in many areas of biology and is an important part of drug development. Unfortunately, NMR data can cost thousands of dollars per sample to collect and it can take a specialist weeks to assign the observed resonances to specific chemical groups. There has thus been gro… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  24. arXiv:2206.07293  [pdf, other

    cs.SD eess.AS

    FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement

    Authors: Shengkui Zhao, Bin Ma, Karn N. Watcharasupat, Woon-Seng Gan

    Abstract: Convolutional recurrent networks (CRN) integrating a convolutional encoder-decoder (CED) structure and a recurrent structure have achieved promising performance for monaural speech enhancement. However, feature representation across frequency context is highly constrained due to limited receptive fields in the convolutions of CED. In this paper, we propose a convolutional recurrent encoder-decoder… ▽ More

    Submitted 24 November, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: The paper has been accepted by ICASSP 2022. 5 pages, 2 figures, 5 tables

  25. arXiv:2204.01966  [pdf, other

    cs.IT cs.NI eess.SY

    Time Efficient Joint UAV-BS Deployment and User Association based on Machine Learning

    Authors: Bo Ma, Zitian Zhang, Jiliang Zhang, Jie Zhang

    Abstract: This paper proposes a time-efficient mechanism to decrease the on-line computing time of solving the joint unmanned aerial vehicle base station (UAV-BS) deployment and user/sensor association (UDUA) problem aiming at maximizing the downlink sum transmission throughput. The joint UDUA problem is decoupled into two sub-problems: one is the user association sub-problem, which gets the optimal matchin… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: 13 pages, this paper has been submitted to IEEE IoT Journal

  26. arXiv:2202.03647  [pdf, other

    cs.SD eess.AS

    Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More

    Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  27. arXiv:2111.07892  [pdf

    eess.IV cond-mat.mtrl-sci cs.CR cs.CV cs.LG

    Data privacy protection in microscopic image analysis for material data mining

    Authors: Boyuan Ma, Xiang Yin, Xiaojuan Ban, Haiyou Huang, Neng Zhang, Hao Wang, Weihua Xue

    Abstract: Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data has been extremely costly owing to the amount of human effort and expertise required. Therefore, material researchers are often reluctant to easily disclose their private data, which leads to the problem of data island, and it is difficult to collect a la… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: 14 pages

  28. arXiv:2110.08545  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Unified Speaker Adaptation Approach for ASR

    Authors: Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

    Abstract: Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the exi… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted by EMNLP 2021

  29. arXiv:2110.07393  [pdf, other

    cs.SD eess.AS

    M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

    Authors: Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition hav… ▽ More

    Submitted 25 February, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  30. arXiv:2110.00745  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    End-to-End Complex-Valued Multidilated Convolutional Neural Network for Joint Acoustic Echo Cancellation and Noise Suppression

    Authors: Karn N. Watcharasupat, Thi Ngoc Tho Nguyen, Woon-Seng Gan, Shengkui Zhao, Bin Ma

    Abstract: Echo and noise suppression is an integral part of a full-duplex communication system. Many recent acoustic echo cancellation (AEC) systems rely on a separate adaptive filtering module for linear echo suppression and a neural module for residual echo suppression. However, not only do adaptive filtering modules require convergence and remain susceptible to changes in acoustic environments, but this… ▽ More

    Submitted 22 January, 2022; v1 submitted 2 October, 2021; originally announced October 2021.

    Comments: To be presented at the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP)

    Journal ref: Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 656-660

  31. arXiv:2107.14051  [pdf

    cs.CV eess.IV

    Improvement of image classification by multiple optical scattering

    Authors: Xinyu Gao, Yi Li, Yanqing Qiu, Bangning Mao, Miaogen Chen, Yanlong Meng, Chunliu Zhao, Juan Kang, Yong Guo, Changyu Shen

    Abstract: Multiple optical scattering occurs when light propagates in a non-uniform medium. During the multiple scattering, images were distorted and the spatial information they carried became scrambled. However, the image information is not lost but presents in the form of speckle patterns (SPs). In this study, we built up an optical random scattering system based on an LCD and an RGB laser source. We fou… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

  32. Deep-learning-based Hyperspectral imaging through a RGB camera

    Authors: Xinyu Gao, Tianlang Wang, **g Yang, **chao Tao, Yanqing Qiu, Yanlong Meng, Banging Mao, Pengwei Zhou, Yi Li

    Abstract: Hyperspectral image (HSI) contains both spatial pattern and spectral information which has been widely used in food safety, remote sensing, and medical detection. However, the acquisition of hyperspectral images is usually costly due to the complicated apparatus for the acquisition of optical spectrum. Recently, it has been reported that HSI can be reconstructed from single RGB image using convolu… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

  33. arXiv:2102.01993  [pdf, other

    cs.SD cs.LG eess.AS

    Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses

    Authors: Shengkui Zhao, Trung Hieu Nguyen, Bin Ma

    Abstract: Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention mo… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: 5 pages, 4 figures, 2 tables, accepted by ICASSP 2021

  34. arXiv:2102.01991  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram

    Authors: Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma

    Abstract: Cross-lingual voice conversion (VC) is an important and challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. In this paper, we build upon the neural text-to-speech (TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new cross-lingual VC framework named FastSpeech-VC. We address the mismatches of the phonetic set and t… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: 5 pages, 2 figures, 4 tables, accepted by ICASSP 2021

  35. arXiv:2010.08136  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

    Authors: Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma

    Abstract: Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In t… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, INTERSPEECH 2020

  36. arXiv:2005.10407  [pdf, other

    eess.AS cs.LG cs.SD

    Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

    Authors: Zhi** Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

    Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More

    Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

  37. arXiv:2003.01084  [pdf, ps, other

    eess.SY

    Distributed Leader-Follower Formation Tracking Control of Multiple Quad-rotors

    Authors: Lixia Yan, Baoli Ma

    Abstract: The leader-follower formation control analysis for multiple quad-rotor systems is investigated in this paper. To achieve predefined formation in the three-dimensional air space ($x,y$ and $z$), a novel local tracking control law and a distributed observer are obtained. The local tracking control law starts with finding a bounded continuous yet greater-than-zero control in $z$, based on which follo… ▽ More

    Submitted 2 March, 2020; originally announced March 2020.

  38. arXiv:1912.00863  [pdf, other

    cs.CL eess.AS

    Independent language modeling architecture for end-to-end ASR

    Authors: Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

    Abstract: The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language mo… ▽ More

    Submitted 25 November, 2019; originally announced December 2019.

  39. arXiv:1905.04711  [pdf

    cond-mat.mtrl-sci cs.CV eess.IV

    Data augmentation in microscopic images for material data mining

    Authors: Boyuan Ma, Xiaoyan Wei, Chuni Liu, Xiaojuan Ban, Haiyou Huang, Hao Wang, Weihua Xue, Stephen Wu, Mingfei Gao, Qing Shen, Adnan Omer Abuassba, Haokai Shen, Yan**g Su

    Abstract: Recent progress in material data mining has been driven by high-capacity models trained on large datasets. However, collecting experimental data (real data) has been extremely costly since the amount of human effort and expertise required. Here, we develop a novel transfer learning strategy to address small or insufficient data problem. This strategy realizes the fusion of real and simulated data,… ▽ More

    Submitted 28 October, 2019; v1 submitted 12 May, 2019; originally announced May 2019.

    Comments: 17 pages, technical report

    Journal ref: npj computational materials 2020

  40. arXiv:1810.08906  [pdf

    eess.SP physics.app-ph

    Analog-to-digital conversion revolutionized by deep learning

    Authors: Shaofu Xu, Xiuting Zou, Bowen Ma, Jian** Chen, Lei Yu, Weiwen Zou

    Abstract: As the bridge between the analog world and digital computers, analog-to-digital converters are generally used in modern information systems such as radar, surveillance, and communications. For the configuration of analog-to-digital converters in future high-frequency broadband systems, we introduce a revolutionary architecture that adopts deep learning technology to overcome tradeoffs between band… ▽ More

    Submitted 21 October, 2018; originally announced October 2018.