Skip to main content

Showing 1–28 of 28 results for author: Ge, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2403.05772  [pdf, other

    cs.SD cs.NE eess.AS

    sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

    Authors: Qu Yang, Qianhui Liu, Nan Li, Meng Ge, Zeyang Song, Haizhou Li

    Abstract: Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a no… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP 2024

  2. arXiv:2401.09686  [pdf, other

    eess.AS cs.SD

    An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

    Authors: Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou Li

    Abstract: Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform… ▽ More

    Submitted 13 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  3. arXiv:2401.02626  [pdf, other

    cs.SD eess.AS

    Gradient weighting for speaker verification in extremely low Signal-to-Noise Ratio

    Authors: Yi Ma, Kong Aik Lee, Ville Hautamäki, Meng Ge, Haizhou Li

    Abstract: Speaker verification is hampered by background noise, particularly at extremely low Signal-to-Noise Ratio (SNR) under 0 dB. It is difficult to suppress noise without introducing unwanted artifacts, which adversely affects speaker verification. We proposed the mechanism called Gradient Weighting (Grad-W), which dynamically identifies and reduces artifact noise during prediction. The mechanism is ba… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  4. arXiv:2312.16002  [pdf, other

    eess.AS cs.AI

    The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge

    Authors: Meng Ge, Yizhou Peng, Yidi Jiang, **gru Lin, Junyi Ao, Mehmet Sinan Yildirim, Shuai Wang, Haizhou Li, Mengling Feng

    Abstract: This paper summarizes our team's efforts in both tracks of the ICMC-ASR Challenge for in-car multi-channel automatic speech recognition. Our submitted systems for ICMC-ASR Challenge include the multi-channel front-end enhancement and diarization, training data augmentation, speech recognition modeling with multi-channel branches. Tested on the offical Eval1 and Eval2 set, our best system achieves… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

    Comments: Technical Report. 2 pages. For ICMC-ASR-2023 Challenge

  5. arXiv:2312.11201  [pdf, other

    eess.AS cs.SD eess.SP

    A Refining Underlying Information Framework for Monaural Speech Enhancement

    Authors: Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

    Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t… ▽ More

    Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 5 pages

  6. arXiv:2311.04526  [pdf, other

    eess.AS

    Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech

    Authors: **gru Lin, Meng Ge, Wupeng Wang, Haizhou Li, Mengling Feng

    Abstract: Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech. With the idea of selective auditory attention, we propose a novel pre-training solution called Sele… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  7. arXiv:2309.10674  [pdf, other

    cs.SD eess.AS

    USED: Universal Speaker Extraction and Diarization

    Authors: Junyi Ao, Mehmet Sinan Yıldırım, Ruijie Tao, Meng Ge, Shuai Wang, Yanmin Qian, Haizhou Li

    Abstract: Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to hav… ▽ More

    Submitted 9 May, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

  8. arXiv:2309.08408  [pdf, other

    cs.SD eess.AS

    Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

    Authors: Junjie Li, Ruijie Tao, Zexu Pan, Meng Ge, Shuai Wang, Haizhou Li

    Abstract: Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  9. PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

    Authors: Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li

    Abstract: It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Interspeech 2023

    Journal ref: Proc. INTERSPEECH 2023, 3719-3723

  10. arXiv:2306.02625  [pdf, other

    cs.SD eess.AS

    Rethinking the visual cues in audio-visual speaker extraction

    Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

    Abstract: The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech 2023

  11. arXiv:2305.10821  [pdf, other

    eess.AS

    Locate and Beamform: Two-dimensional Locating All-neural Beamformer for Multi-channel Speech Separation

    Authors: Yanjie Fu, Meng Ge, Honglong Wang, Nan Li, Haoran Yin, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper, we propose an end-to-end beamforming network for… ▽ More

    Submitted 2 June, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2212.03401

  12. arXiv:2304.04154  [pdf, other

    astro-ph.IM eess.SY

    Review of X-ray pulsar spacecraft autonomous navigation

    Authors: Yidi Wang, Wei Zheng, Shuangnan Zhang, Minyu Ge, Liansheng Li, Kun Jiang, Xiaoqian Chen, Xiang Zhang, Shijie Zheng, Fangjun Lu

    Abstract: This article provides a review on X-ray pulsar-based navigation (XNAV). The review starts with the basic concept of XNAV, and briefly introduces the past, present and future projects concerning XNAV. This paper focuses on the advances of the key techniques supporting XNAV, including the navigation pulsar database, the X-ray detection system, and the pulse time of arrival estimation. Moreover, the… ▽ More

    Submitted 9 April, 2023; originally announced April 2023.

    Comments: has been accepted by Chinese Journal of Aeronautics

    Journal ref: Chinese Journal of Aeronautics, 2023

  13. arXiv:2212.03401  [pdf, other

    eess.AS cs.LG cs.SD

    MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

    Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  14. arXiv:2210.06177  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

    Authors: Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

    Abstract: Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

  15. MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources

    Authors: Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang

    Abstract: Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by map** the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin… ▽ More

    Submitted 16 November, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech 2022

  16. arXiv:2206.14580  [pdf, other

    cs.CL eess.AS

    Language-specific Characteristic Assistance for Code-switching Speech Recognition

    Authors: Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang

    Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili… ▽ More

    Submitted 11 July, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  17. arXiv:2206.12273  [pdf, other

    eess.AS cs.LG

    Iterative Sound Source Localization for Unknown Number of Sources

    Authors: Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

    Abstract: Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  18. arXiv:2203.16843  [pdf, other

    eess.AS cs.SD

    A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

    Authors: Zexu Pan, Meng Ge, Haizhou Li

    Abstract: The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for ti… ▽ More

    Submitted 20 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted by Interspeech2022

  19. arXiv:2202.09995  [pdf, other

    eess.AS cs.SD

    L-SpEx: Localized Target Speaker Extraction

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted in ICASSP 2022

  20. arXiv:2111.10596  [pdf, other

    eess.SP cs.LG physics.geo-ph

    Semi-supervised Impedance Inversion by Bayesian Neural Network Based on 2-d CNN Pre-training

    Authors: Muyang Ge, Wenlong Wang, Wangxiangming Zheng

    Abstract: Seismic impedance inversion can be performed with a semi-supervised learning algorithm, which only needs a few logs as labels and is less likely to get overfitted. However, classical semi-supervised learning algorithm usually leads to artifacts on the predicted impedance image. In this artical, we improve the semi-supervised learning from two aspects. First, by replacing 1-d convolutional neural n… ▽ More

    Submitted 20 November, 2021; originally announced November 2021.

  21. arXiv:2109.14831  [pdf, other

    eess.AS cs.SD

    USEV: Universal Speaker Extraction with Visual Cue

    Authors: Zexu Pan, Meng Ge, Haizhou Li

    Abstract: A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlap** ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be ab… ▽ More

    Submitted 30 August, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: Accepted by TASLP

  22. arXiv:2011.09624  [pdf, other

    eess.AS cs.LG

    Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr… ▽ More

    Submitted 2 April, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  23. arXiv:2006.12372  [pdf, other

    eess.SP

    Edge server deployment scheme of blockchain in IoVs

    Authors: Liya Xu, Mingzhu Ge, Weili Wu

    Abstract: With the development of intelligent vehicles, security and reliability communication between vehicles has become a key problem to be solved in Internet of vehicles(IoVs). Blockchain is considered as a feasible solution due to its advantages of decentralization, unforgeability and collective maintenance. However, the computing power of nodes in IoVs is limited, while the consensus mechanism of bloc… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

  24. arXiv:2005.04686  [pdf, other

    eess.AS cs.SD

    SpEx+: A Complete Time Domain Speaker Extraction Network

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-… ▽ More

    Submitted 17 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted in INTERSPEECH 2020

  25. arXiv:2001.04198  [pdf, ps, other

    eess.SY

    Predefined-time Terminal Sliding Mode Control of Robot Manipulators

    Authors: Chang-Duo Liang, Ming-Feng Ge, Zhi-Wei Liu, Yan-Wu Wang, Hamid Reza Karimi

    Abstract: In this paper, we present a new terminal sliding mode control to achieve predefined-time stability of robot manipulators. The proposed control is developed based on a novel predefined-time terminal sliding mode (PTSM) surface, on which the states are forced to reach the origin in a predefined time, i.e., the settling time is independent to the initial condition and can be explicitly user-defined v… ▽ More

    Submitted 25 April, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

    Comments: 10 pages, 9 figures, This draft is not intended for publication

  26. Task-space coordinated tracking of multiple heterogeneous manipulators via controller-estimator approaches

    Authors: Ming-Feng Ge, Zhi-Hong Guan, Chao Yang, Chao-Yang Chen, Ding-Fu Zheng, Ming Chi

    Abstract: This paper studies the task-space coordinated tracking of a time-varying leader for multiple heterogeneous manipulators (MHMs), containing redundant manipulators and nonredundant ones. Different from the traditional coordinated control, distributed controller-estimator algorithms (DCEA), which consist of local algorithms and networked algorithms, are developed for MHMs with parametric uncertaintie… ▽ More

    Submitted 26 July, 2016; originally announced July 2016.

    Comments: 17 pages, 7 figures, Journal of the Franklin Institute

  27. Time-varying formation tracking of multiple manipulators via distributed finite-time control

    Authors: Ming-Feng Ge, Zhi-Hong Guan, Chao Yang, Tao Li, Yan-Wu Wang

    Abstract: Comparing with traditional fixed formation for a group of dynamical systems, time-varying formation can produce the following benefits: i) covering the greater part of complex environments; ii) collision avoidance. This paper studies the time-varying formation tracking for multiple manipulator systems (MMSs) under fixed and switching directed graphs with a dynamic leader, whose acceleration cannot… ▽ More

    Submitted 26 July, 2016; originally announced July 2016.

    Journal ref: Neurocomputing, 2016, 202: 20-26

  28. Distributed controller-estimator for target tracking of networked robotic systems under sampled interaction

    Authors: Ming-Feng Ge, Zhi-Hong Guan, Bin Hu, Ding-Xin He, Rui-Quan Liao

    Abstract: This paper investigates the target tracking problem for networked robotic systems (NRSs) under sampled interaction. The target is assumed to be time-varying and described by a second-order oscillator. Two novel distributed controller-estimator algorithms (DCEA), which consist of both continuous and discontinuous signals, are presented. Based on the properties of small-value norms and Lyapunov stab… ▽ More

    Submitted 27 May, 2016; originally announced May 2016.

    Comments: 8 pages, 4 figures, Published in Automatica

    Journal ref: Automatica, 2016, 69: 410-417