Skip to main content

Showing 1–16 of 16 results for author: Ni, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.19666  [pdf, other

    cs.CV eess.IV

    CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion

    Authors: Chih-Chung Hsu, Chih-Chien Ni, Chia-Ming Lee, Li-Wei Kang

    Abstract: Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Submitted to TIP 2024

  2. arXiv:2406.02009  [pdf, other

    eess.AS cs.CL cs.SD

    Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

    Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

    Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More

    Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  3. arXiv:2312.11825  [pdf, other

    cs.SD eess.AS

    MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation

    Authors: Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, Bin Ma

    Abstract: Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to… ▽ More

    Submitted 18 December, 2023; originally announced December 2023.

    Comments: 5 pages, 3 figures, accepted by ICASSP 2024

  4. arXiv:2311.10261  [pdf, other

    cs.CV eess.SP

    Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving

    Authors: Yizhou Wang, Jen-Hao Cheng, Jui-Te Huang, Sheng-Yao Kuan, Qiqian Fu, Chiming Ni, Shengyu Hao, Gaoang Wang, Guanbin Xing, Hui Liu, Jenq-Neng Hwang

    Abstract: Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an eff… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  5. arXiv:2309.12608  [pdf, other

    eess.AS cs.SD

    SPGM: Prioritizing Local Features for enhanced speech separation performance

    Authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: This paper was accepted by ICASSP 2024

  6. arXiv:2309.09413  [pdf, other

    cs.SD eess.AS

    Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

    Authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  7. arXiv:2305.12121  [pdf, other

    cs.SD cs.LG eess.AS

    ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

    Authors: Jia Qi Yip, Tuan Truong, Dianwen Ng, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we buil… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  8. arXiv:2305.01170  [pdf, other

    cs.SD eess.AS

    Contrastive Speech Mixup for Low-resource Keyword Spotting

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted by ICASSP 2023

  9. arXiv:2303.15124  [pdf, other

    cs.CV cs.LG eess.IV

    Blind Inpainting with Object-aware Discrimination for Artificial Marker Removal

    Authors: Xuechen Guo, Wenhao Hu, Chiming Ni, Wenhao Chai, Shiyan Li, Gaoang Wang

    Abstract: Medical images often contain artificial markers added by doctors, which can negatively affect the accuracy of AI-based diagnosis. To address this issue and recover the missing visual contents, inpainting techniques are highly needed. However, existing inpainting methods require manual mask input, limiting their application scenarios. In this paper, we introduce a novel blind inpainting method that… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

  10. arXiv:2302.14597  [pdf, other

    cs.SD eess.AS

    deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, **jie Ni, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world… ▽ More

    Submitted 28 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP 2023

  11. Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

    Authors: Lei Wang, Rong Tong, Cheung Chi Leung, Sunil Sivadas, Chongjia Ni, Bin Ma

    Abstract: This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017)

    ACM Class: I.2.7

  12. arXiv:2209.06360  [pdf, other

    cs.SD eess.AS

    I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization

    Authors: Dianwen Ng, Jia Qi Yip, Tanmay Surana, Zhao Yang, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  13. arXiv:2205.03996  [pdf, other

    cs.AR cs.CV cs.LG eess.IV

    Hardware-Robust In-RRAM-Computing for Object Detection

    Authors: Yu-Hsiang Chiang, Cheng En Ni, Yun Sung, Tuo-Hung Hou, Tian-Sheuan Chang, Shyh Jye Jou

    Abstract: In-memory computing is becoming a popular architecture for deep-learning hardware accelerators recently due to its highly parallel computing, low power, and low area cost. However, in-RRAM computing (IRC) suffered from large device variation and numerous nonideal effects in hardware. Although previous approaches including these effects in model training successfully improved variation tolerance, t… ▽ More

    Submitted 8 May, 2022; originally announced May 2022.

    Comments: 10 pages, 18 figures

  14. arXiv:2110.08545  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Unified Speaker Adaptation Approach for ASR

    Authors: Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

    Abstract: Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the exi… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted by EMNLP 2021

  15. arXiv:2005.10407  [pdf, other

    eess.AS cs.LG cs.SD

    Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

    Authors: Zhi** Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

    Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More

    Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

  16. arXiv:1912.00863  [pdf, other

    cs.CL eess.AS

    Independent language modeling architecture for end-to-end ASR

    Authors: Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

    Abstract: The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language mo… ▽ More

    Submitted 25 November, 2019; originally announced December 2019.