Skip to main content

Showing 1–36 of 36 results for author: Gu, R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.10246  [pdf, other

    eess.IV cs.CV

    A Foundation Model for Brain Lesion Segmentation with Mixture of Modality Experts

    Authors: Xinru Zhang, Ni Ou, Berke Doga Basaran, Marco Visentin, Mengyun Qiao, Renyang Gu, Cheng Ouyang, Yaou Liu, Paul M. Matthew, Chuyang Ye, Wenjia Bai

    Abstract: Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: The work has been early accepted by MICCAI 2024

  2. arXiv:2404.04947  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Gull: A Generative Multifunctional Audio Codec

    Authors: Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng

    Abstract: We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recen… ▽ More

    Submitted 7 June, 2024; v1 submitted 7 April, 2024; originally announced April 2024.

    Comments: Demo page: https://yluo42.github.io/Gull/

  3. arXiv:2312.10381  [pdf, other

    cs.SD eess.AS

    SECap: Speech Emotion Captioning with Large Language Model

    Authors: Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, Rongzhi Gu

    Abstract: Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately… ▽ More

    Submitted 23 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024

  4. arXiv:2311.07033  [pdf, other

    eess.IV cs.CV

    TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival Prediction

    Authors: Ruiquan Ge, Xiangyang Hu, Rungen Huang, Gangyong Jia, Yaqi Wang, Renshu Gu, Changmiao Wang, Elazab Ahmed, Linyan Wang, Juan Ye, Ye Li

    Abstract: Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. H… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

  5. arXiv:2308.16892  [pdf, other

    eess.AS cs.AI cs.SD

    ReZero: Region-customizable Sound Extraction

    Authors: Rongzhi Gu, Yi Luo

    Abstract: We introduce region-customizable sound extraction (ReZero), a general and flexible framework for the multi-channel region-wise sound extraction (R-SE) task. R-SE task aims at extracting all active target sounds (e.g., human speech) within a specific, user-defined spatial region, which is different from conventional and existing tasks where a blind separation or a fixed, predefined spatial region a… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: 13 pages, 11 figures

  6. Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression

    Authors: Hangting Chen, Jianwei Yu, Yi Luo, Rongzhi Gu, Weihua Li, Zhuocheng Lu, Chao Weng

    Abstract: Echo cancellation and noise reduction are essential for full-duplex communication, yet most existing neural networks have high computational costs and are inflexible in tuning model complexity. In this paper, we introduce time-frequency dual-path compression to achieve a wide range of compression ratios on computational cost. Specifically, for frequency compression, trainable filters are used to r… ▽ More

    Submitted 10 October, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Proceedings of INTERSPEECH

  7. arXiv:2308.06981  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

    Authors: Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

    Abstract: This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most succes… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted for Transactions of the International Society for Music Information Retrieval

  8. arXiv:2304.08052  [pdf, other

    cs.SD eess.AS

    Fast Random Approximation of Multi-channel Room Impulse Response

    Authors: Yi Luo, Rongzhi Gu

    Abstract: Modern neural-network-based speech processing systems are typically required to be robust against reverberation, and the training of such systems thus needs a large amount of reverberant data. During the training of the systems, on-the-fly simulation pipeline is nowadays preferred as it allows the model to train on infinite number of data samples without pre-generating and saving them on harddisk.… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

  9. arXiv:2302.13462  [pdf, other

    cs.SD eess.AS

    3D Neural Beamforming for Multi-channel Speech Separation Against Location Uncertainty

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Dong Yu

    Abstract: Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  10. arXiv:2212.08348  [pdf, other

    cs.SD eess.AS

    Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Yuexian Zou, Dong Yu

    Abstract: Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and… ▽ More

    Submitted 23 December, 2022; v1 submitted 16 December, 2022; originally announced December 2022.

  11. arXiv:2212.07068  [pdf, other

    eess.AS

    Probing Deep Speaker Embeddings for Speaker-related Tasks

    Authors: Zifeng Zhao, Ding Pan, Junyi Peng, Rongzhi Gu

    Abstract: Deep speaker embeddings have shown promising results in speaker recognition, as well as in other speaker-related tasks. However, some issues are still under explored, for instance, the information encoded in these representations and their influence on downstream tasks. Four deep speaker embeddings are studied in this paper, namely, d-vector, x-vector, ResNetSE-34 and ECAPA-TDNN. Inspired by human… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

  12. arXiv:2212.00406  [pdf, other

    eess.AS

    High Fidelity Speech Enhancement with Band-split RNN

    Authors: Jianwei Yu, Yi Luo, Hangting Chen, Rongzhi Gu, Chao Weng

    Abstract: Despite the rapid progress in speech enhancement (SE) research, enhancing the quality of desired speech in environments with strong noise and interfering speakers remains challenging. In this paper, we extend the application of the recently proposed band-split RNN (BSRNN) model to full-band SE and personalized SE (PSE) tasks. To mitigate the effects of unstable high-frequency components in full-ba… ▽ More

    Submitted 6 June, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  13. arXiv:2210.16032  [pdf, other

    eess.AS cs.SD eess.SP

    Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

    Authors: Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP2023

  14. PyMIC: A deep learning toolkit for annotation-efficient medical image segmentation

    Authors: Guotai Wang, Xiangde Luo, Ran Gu, Shuojue Yang, Yijie Qu, Shuwei Zhai, Qianfei Zhao, Kang Li, Shaoting Zhang

    Abstract: Background and Objective: Open-source deep learning toolkits are one of the driving forces for develo** medical image segmentation models. Existing toolkits mainly focus on fully supervised segmentation and require full and accurate pixel-level annotations that are time-consuming and difficult to acquire for segmentation tasks, which makes learning from imperfect labels highly desired for reduci… ▽ More

    Submitted 4 February, 2023; v1 submitted 19 August, 2022; originally announced August 2022.

    Comments: 12 pages, 6 figures

    Journal ref: Computer Methods and Programs in Biomedicine, Volume 231, April 2023, 107398

  15. arXiv:2206.06813  [pdf, other

    eess.IV cs.CV cs.LG

    Learning towards Synchronous Network Memorizability and Generalizability for Continual Segmentation across Multiple Sites

    Authors: **gyang Zhang, Peng Xue, Ran Gu, Yuning Gu, Mianxin Liu, Yongsheng Pan, Zhiming Cui, Jiawei Huang, Lei Ma, Dinggang Shen

    Abstract: In clinical practice, a segmentation network is often required to continually learn on a sequential data stream from multiple sites rather than a consolidated set, due to the storage cost and privacy restriction. However, during the continual learning process, existing methods are usually restricted in either network memorizability on previous sites or generalizability on unseen sites. This paper… ▽ More

    Submitted 27 June, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: Early accepted in MICCAI2022

  16. arXiv:2205.14833  [pdf, other

    cs.LG cs.DC eess.SY

    Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

    Authors: Chengfei Lv, Chaoyue Niu, Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, **de Song, Bin Zou, Peng Lan, Guohuan Xu, Fei Wu, Shaojie Tang, Fan Wu, Guihai Chen

    Abstract: To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a c… ▽ More

    Submitted 29 May, 2022; originally announced May 2022.

    Comments: Accepted by OSDI 2022

  17. Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

    Authors: Xinmeng Xu, Rongzhi Gu, Yuexian Zou

    Abstract: Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

    Comments: Accepted by ICASSP 2022

  18. arXiv:2204.07375  [pdf, other

    eess.AS cs.SD

    Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

    Authors: Zifeng Zhao, Rongzhi Gu, Dongchao Yang, **chuan Tian, Yuexian Zou

    Abstract: Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

    Comments: 5 pages, 4 tables, 4 figures. Submitted to INTERSPEECH 2022

  19. arXiv:2204.01355  [pdf, other

    eess.AS cs.SD

    Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

    Authors: Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

    Abstract: Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 table, 5 figures. Submitted to INTERSPEECH 2022

  20. arXiv:2203.16772  [pdf, other

    cs.SD cs.AI eess.AS

    Learning Decoupling Features Through Orthogonality Regularization

    Authors: Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou

    Abstract: Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at ICASSP 2022

  21. arXiv:2111.10773  [pdf, other

    eess.IV cs.CV

    One-shot Weakly-Supervised Segmentation in Medical Images

    Authors: Wenhui Lei, Qi Su, Ran Gu, Na Wang, Xinglong Liu, Guotai Wang, Xiaofan Zhang, Shaoting Zhang

    Abstract: Deep neural networks usually require accurate and a large number of annotations to achieve outstanding performance in medical image segmentation. One-shot segmentation and weakly-supervised learning are promising research directions that lower labeling effort by learning a new class from only one annotated image and utilizing coarse labels instead, respectively. Previous works usually fail to leve… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

  22. arXiv:2111.10372  [pdf, other

    eess.IV cs.CV

    Resistance-Time Co-Modulated PointNet for Temporal Super-Resolution Simulation of Blood Vessel Flows

    Authors: Zhizheng Jiang, Fei Gao, Renshu Gu, **lan Xu, Gang Xu, Timon Rabczuk

    Abstract: In this paper, a novel deep learning framework is proposed for temporal super-resolution simulation of blood vessel flows, in which a high-temporal-resolution time-varying blood vessel flow simulation is generated from a low-temporal-resolution flow simulation result. In our framework, point-cloud is used to represent the complex blood vessel model, resistance-time aided PointNet model is proposed… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

  23. arXiv:2109.08852  [pdf, other

    eess.IV cs.CV

    Domain Composition and Attention for Unseen-Domain Generalizable Medical Image Segmentation

    Authors: Ran Gu, **gyang Zhang, Rui Huang, Wenhui Lei, Guotai Wang, Shaoting Zhang

    Abstract: Domain generalizable model is attracting increasing attention in medical image analysis since data is commonly acquired from different institutes with various imaging protocols and scanners. To tackle this challenging domain generalization problem, we propose a Domain Composition and Attention-based network (DCA-Net) to improve the ability of domain representation and generalization. First, we pre… ▽ More

    Submitted 18 September, 2021; originally announced September 2021.

    Comments: Accepted by MICCAI 2021

  24. arXiv:2108.07014  [pdf, ps, other

    cs.IT eess.SY

    Robust Beamforming Design for Rate Splitting Multiple Access-Aided MISO Visible Light Communications

    Authors: Shuai Ma, Guanjie Zhang, Zhi Zhang, Rongyan Gu

    Abstract: In this paper, we focus on the optimal beamformer design for rate splitting multiple access (RSMA)-aided multipleinput single-output (MISO) visible light communication (VLC) networks. First, we derive the closed-form lower bounds of the achievable rate of each user, which are the first theoretical bound of achievable rate for RSMA-aided VLC networks. Second, we investigate the optimal beamformer d… ▽ More

    Submitted 1 November, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

  25. arXiv:2105.02674  [pdf, other

    eess.IV cs.CV

    SS-CADA: A Semi-Supervised Cross-Anatomy Domain Adaptation for Coronary Artery Segmentation

    Authors: **gyang Zhang, Ran Gu, Guotai Wang, Hongzhi Xie, Lixu Gu

    Abstract: The segmentation of coronary arteries by convolutional neural network is promising yet requires a large amount of labor-intensive manual annotations. Transferring knowledge from retinal vessels in widely-available public labeled fundus images (FIs) has a potential to reduce the annotation requirement for coronary artery segmentation in X-ray angiograms (XAs) due to their common tubular structures.… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

  26. arXiv:2105.00812  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

    Authors: **chuan Tian, Rongzhi Gu, Helin Wang, Yuexian Zou

    Abstract: Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may encounter prohibitively high computational cost and large parameter budget. Although Parameter Sharing Strategy (PSS) proposed in ALBERT paves the way for parameter re… ▽ More

    Submitted 8 April, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, submit to Interspeech2021

  27. Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Yuexian Zou, Dong Yu

    Abstract: To date, mainstream target speech separation (TSS) approaches are formulated to estimate the complex ratio mask (cRM) of the target speech in time-frequency domain under supervised deep learning framework. However, the existing deep models for estimating cRM are designed in the way that the real and imaginary parts of the cRM are separately modeled using real-valued training data pairs. The resear… ▽ More

    Submitted 26 April, 2021; originally announced April 2021.

    Comments: 5 pages, 3 figures

  28. arXiv:2102.01897  [pdf, other

    eess.IV cs.CV

    Automatic Segmentation of Organs-at-Risk from Head-and-Neck CT using Separable Convolutional Neural Network with Hard-Region-Weighted Loss

    Authors: Wenhui Lei, Haochen Mei, Zhengwentai Sun, Shan Ye, Ran Gu, Huan Wang, Rui Huang, Shichuan Zhang, Shaoting Zhang, Guotai Wang

    Abstract: Nasopharyngeal Carcinoma (NPC) is a leading form of Head-and-Neck (HAN) cancer in the Arctic, China, Southeast Asia, and the Middle East/North Africa. Accurate segmentation of Organs-at-Risk (OAR) from Computed Tomography (CT) images with uncertainty information is critical for effective planning of radiation therapy for NPC treatment. Despite the stateof-the-art performance achieved by Convolutio… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: Accepted by Neurocomputing

  29. arXiv:2101.11254  [pdf, other

    eess.IV cs.CV

    Automatic Segmentation of Gross Target Volume of Nasopharynx Cancer using Ensemble of Multiscale Deep Neural Networks with Spatial Attention

    Authors: Haochen Mei, Wenhui Lei, Ran Gu, Shan Ye, Zhengwentai Sun, Shichuan Zhang, Guotai Wang

    Abstract: Radiotherapy is the main treatment modality for nasopharynx cancer. Delineation of Gross Target Volume (GTV) from medical images such as CT and MRI images is a prerequisite for radiotherapy. As manual delineation is time-consuming and laborious, automatic segmentation of GTV has a potential to improve this process. Currently, most of the deep learning-based automatic delineation methods of GTV are… ▽ More

    Submitted 27 January, 2021; originally announced January 2021.

  30. CA-Net: Comprehensive Attention Convolutional Neural Networks for Explainable Medical Image Segmentation

    Authors: Ran Gu, Guotai Wang, Tao Song, Rui Huang, Michael Aertsen, Jan Deprest, Sébastien Ourselin, Tom Vercauteren, Shaoting Zhang

    Abstract: Accurate medical image segmentation is essential for diagnosis and treatment planning of diseases. Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance for automatic medical image segmentation. However, they are still challenged by complicated conditions where the segmentation target has large variations of position, shape and scale, and existing CNNs have a poor explain… ▽ More

    Submitted 22 September, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

  31. arXiv:2005.08571  [pdf, other

    eess.AS cs.CL cs.SD

    Audio-visual Multi-channel Recognition of Overlapped Speech

    Authors: Jianwei Yu, Bo Wu, Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu. Meng Yu, Dan Su, Dong Yu, Xunying Liu, Helen Meng

    Abstract: Automatic speech recognition (ASR) of overlapped speech remains a highly challenging task to date. To this end, multi-channel microphone array data are widely used in state-of-the-art ASR systems. Motivated by the invariance of visual modality to acoustic signal corruption, this paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separatio… ▽ More

    Submitted 18 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to Interspeech 2020

  32. arXiv:2003.07032  [pdf, other

    eess.AS cs.SD eess.IV

    Multi-modal Multi-channel Target Speech Separation

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lianwu Chen, Yuexian Zou, Dong Yu

    Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial locat… ▽ More

    Submitted 23 October, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

    Comments: accepted in IEEE Journal of Selcted Topics in Signal Processing

  33. arXiv:2003.03927  [pdf, other

    eess.AS cs.SD

    Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

    Abstract: Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the mult… ▽ More

    Submitted 13 March, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

    Comments: accepted in ICASSP 2020

  34. arXiv:2001.00391  [pdf, other

    cs.SD cs.LG eess.AS

    Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

    Authors: Rongzhi Gu, Yuexian Zou

    Abstract: Target speech separation refers to extracting the target speaker's speech from mixed signals. Despite the recent advances in deep learning based close-talk speech separation, the applications to real-world are still an open issue. Two main challenges are the complex acoustic environment and the real-time processing requirement. To address these challenges, we propose a temporal-spatial neural filt… ▽ More

    Submitted 2 January, 2020; originally announced January 2020.

  35. arXiv:1905.07497  [pdf, other

    cs.SD cs.LG eess.AS

    A comprehensive study of speech separation: spectrogram vs waveform separation

    Authors: Fahimeh Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu

    Abstract: Speech separation has been studied widely for single-channel close-talk microphone recordings over the past few years; developed solutions are mostly in frequency-domain. Recently, a raw audio waveform separation network (TasNet) is introduced for single-channel data, with achieving high Si-SNR (scale-invariant source-to-noise ratio) and SDR (source-to-distortion ratio) comparing against the state… ▽ More

    Submitted 23 July, 2019; v1 submitted 17 May, 2019; originally announced May 2019.

    Comments: INTERSPEECH 2019

  36. arXiv:1905.06286  [pdf, other

    cs.SD cs.LG eess.AS

    End-to-End Multi-Channel Speech Separation

    Authors: Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu

    Abstract: The end-to-end approach for single-channel speech separation has been studied recently and shown promising results. This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation. The primary contributions of this work include 1) an integrated waveform-in waveform-out separation system in a single neural network architecture. 2) We reformulate the… ▽ More

    Submitted 27 May, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

    Comments: submitted to interspeech 2019