Skip to main content

Showing 1–50 of 161 results for author: Yu, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.11175  [pdf, other

    cs.SD eess.AS

    SMRU: Split-and-Merge Recurrent-based UNet for Acoustic Echo Cancellation and Noise Suppression

    Authors: Zhihang Sun, Andong Li, Rilin Chen, Hao Zhang, Meng Yu, Yi Zhou, Dong Yu

    Abstract: The proliferation of deep neural networks has spawned the rapid development of acoustic echo cancellation and noise suppression, and plenty of prior arts have been proposed, which yield promising performance. Nevertheless, they rarely consider the deployment generality in different processing scenarios, such as edge devices, and cloud processing. To this end, this paper proposes a general model, t… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  2. arXiv:2406.09589  [pdf, other

    eess.AS

    Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

    Authors: Yiwen Shao, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, Daniel Povey, Sanjeev Khudanpur

    Abstract: In the field of multi-channel, multi-speaker Automatic Speech Recognition (ASR), the task of discerning and accurately transcribing a target speaker's speech within background noise remains a formidable challenge. Traditional approaches often rely on microphone array configurations and the information of the target speaker's location or voiceprint. This study introduces the Solo Spatial Feature (S… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024

  3. arXiv:2406.04350  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Prompt-guided Precise Audio Editing with Diffusion Models

    Authors: Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

    Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusi… ▽ More

    Submitted 11 May, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  4. arXiv:2406.00976  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

    Authors: Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

    Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accept in ACL2024-main

  5. arXiv:2404.08549  [pdf

    eess.IV cs.CV physics.bio-ph

    Benchmarking the Cell Image Segmentation Models Robustness under the Microscope Optical Aberrations

    Authors: Boyuan Peng, Jiaju Chen, Qihui Ye, Minjiang Chen, Peiwu Qin, Chenggang Yan, Dongmei Yu, Zhenglin Chen

    Abstract: Cell segmentation is essential in biomedical research for analyzing cellular morphology and behavior. Deep learning methods, particularly convolutional neural networks (CNNs), have revolutionized cell segmentation by extracting intricate features from images. However, the robustness of these methods under microscope optical aberrations remains a critical challenge. This study comprehensively evalu… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  6. arXiv:2404.08453  [pdf, other

    cs.LG eess.SY

    Lightweight Multi-System Multivariate Interconnection and Divergence Discovery

    Authors: Mulugeta Weldezgina Asres, Christian Walter Omlin, Jay Dittmann, Pavel Parygin, Joshua Hiltbrand, Seth I. Cooper, Grace Cummings, David Yu

    Abstract: Identifying outlier behavior among sensors and subsystems is essential for discovering faults and facilitating diagnostics in large systems. At the same time, exploring large systems with numerous multivariate data sets is challenging. This study presents a lightweight interconnection and divergence discovery mechanism (LIDD) to identify abnormal behavior in multi-system environments. The approach… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 8 pages, 12 figures

  7. arXiv:2404.08285  [pdf

    cs.CV cs.AI eess.SY

    A Survey of Neural Network Robustness Assessment in Image Recognition

    Authors: Jie Wang, Jun Ai, Minyan Lu, Haoran Su, Dan Yu, Yutao Zhang, Junda Zhu, **gyu Liu

    Abstract: In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models.… ▽ More

    Submitted 15 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

    Comments: Corrected typos and grammatical errors in Section 5

  8. arXiv:2403.02307  [pdf, other

    eess.IV cs.CV

    Harnessing Intra-group Variations Via a Population-Level Context for Pathology Detection

    Authors: P. Bilha Githinji, Xi Yuan, Zhenglin Chen, Ijaz Gul, Dingqi Shang, Wen Liang, Jianming Deng, Dan Zeng, Dongmei yu, Chenggang Yan, Peiwu Qin

    Abstract: Realizing sufficient separability between the distributions of healthy and pathological samples is a critical obstacle for pathology detection convolutional models. Moreover, these models exhibit a bias for contrast-based images, with diminished performance on texture-based medical images. This study introduces the notion of a population-level context for pathology detection and employs a graph th… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  9. arXiv:2402.01828  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Retrieval Augmented End-to-End Spoken Dialog Models

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain dom… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Journal ref: Proc. ICASSP 2024

  10. arXiv:2312.06101  [pdf, other

    eess.IV cs.CV

    Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution

    Authors: Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong

    Abstract: Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup ta… ▽ More

    Submitted 8 May, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

  11. arXiv:2311.13075  [pdf, other

    eess.AS

    Deep Audio Zooming: Beamwidth-Controllable Neural Beamformer

    Authors: Meng Yu, Dong Yu

    Abstract: Audio zooming, a signal processing technique, enables selective focusing and enhancement of sound signals from a specified region, attenuating others. While traditional beamforming and neural beamforming techniques, centered on creating a directional array, necessitate the designation of a singular target direction, they often overlook the concept of a field of view (FOV), that defines an angular… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 6 pages, 5 figures

  12. arXiv:2311.07202  [pdf, other

    cs.LG cs.CE eess.SY

    Real-Time Machine-Learning-Based Optimization Using Input Convex LSTM

    Authors: Zihao Wang, Donghan Yu, Zhe Wu

    Abstract: Neural network-based optimization and control have gradually supplanted first-principles model-based approaches in energy and manufacturing systems due to their efficient, data-driven process modeling that requires fewer resources. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-time decision-making processes. To… ▽ More

    Submitted 26 June, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

  13. arXiv:2311.00146  [pdf, other

    eess.AS cs.AI

    RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

    Authors: Yiwen Shao, Shi-Xiong Zhang, Dong Yu

    Abstract: Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that… ▽ More

    Submitted 11 June, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

    Comments: Accepted for presentation at Interspeech 2024

  14. arXiv:2310.16367  [pdf, other

    eess.AS

    UniX-Encoder: A Universal $X$-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

    Authors: Zili Huang, Yiwen Shao, Shi-Xiong Zhang, Dong Yu

    Abstract: The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel sp… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  15. arXiv:2310.11954  [pdf, other

    cs.CL cs.MM eess.AS

    MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

    Authors: Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

    Abstract: AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data… ▽ More

    Submitted 25 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  16. arXiv:2310.10992  [pdf, other

    cs.SD eess.AS

    A High Fidelity and Low Complexity Neural Audio Coding

    Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

    Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  17. arXiv:2310.01292  [pdf, other

    cs.CV cs.LG eess.IV

    Efficient Remote Sensing Segmentation With Generative Adversarial Transformer

    Authors: Luyi Qiu, Dayu Yu, Xiaofeng Zhang, Chenxiao Zhang

    Abstract: Most deep learning methods that achieve high segmentation accuracy require deep network architectures that are too heavy and complex to run on embedded devices with limited storage and memory space. To address this issue, this paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation while maintaining an extremely efficient size. The… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  18. arXiv:2310.00900  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models

    Authors: Muqiao Yang, Chunlei Zhang, Yong Xu, Zhongweiyang Xu, Heming Wang, Bhiksha Raj, Dong Yu

    Abstract: Speech enhancement aims to improve the quality of speech signals in terms of quality and intelligibility, and speech editing refers to the process of editing the speech according to specific user needs. In this paper, we propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Specifically, by p… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  19. arXiv:2310.00230  [pdf, other

    cs.CL cs.SD eess.AS

    SLM: Bridge the thin gap between speech and text foundation models

    Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

    Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  20. arXiv:2309.16049  [pdf, other

    eess.AS cs.SD eess.SP

    Neural Network Augmented Kalman Filter for Robust Acoustic Howling Suppression

    Authors: Yixuan Zhang, Hao Zhang, Meng Yu, Dong Yu

    Abstract: Acoustic howling suppression (AHS) is a critical challenge in audio communication systems. In this paper, we propose a novel approach that leverages the power of neural networks (NN) to enhance the performance of traditional Kalman filter algorithms for AHS. Specifically, our method involves the integration of NN modules into the Kalman filter, enabling refining reference signal, a key factor in e… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  21. arXiv:2309.16048  [pdf, other

    eess.AS cs.SD eess.SP

    Advancing Acoustic Howling Suppression through Recursive Training of Neural Networks

    Authors: Hao Zhang, Yixuan Zhang, Meng Yu, Dong Yu

    Abstract: In this paper, we introduce a novel training framework designed to comprehensively address the acoustic howling issue by examining its fundamental formation process. This framework integrates a neural network (NN) module into the closed-loop system during training with signals generated recursively on the fly to closely mimic the streaming process of acoustic howling suppression (AHS). The propose… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  22. arXiv:2309.09028  [pdf, other

    eess.AS cs.SD

    Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

    Authors: Heming Wang, Meng Yu, Hao Zhang, Chunlei Zhang, Zhongweiyang Xu, Muqiao Yang, Yixuan Zhang, Dong Yu

    Abstract: Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynth… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: Paper in submission

  23. arXiv:2309.07432  [pdf, other

    cs.SD eess.AS

    SpatialCodec: Neural Spatial Speech Coding

    Authors: Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu

    Abstract: In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our app… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Paper in Submission

  24. arXiv:2309.07416  [pdf, other

    cs.SD eess.AS

    M3-AUDIODEC: Multi-channel multi-speaker multi-spatial audio codec

    Authors: Anton Ratnarajah, Shi-Xiong Zhang, Dong Yu

    Abstract: We introduce M3-AUDIODEC, an innovative neural spatial audio codec designed for efficient compression of multi-channel (binaural) speech in both single and multi-speaker scenarios, while retaining the spatial location information of each speaker. This model boasts versatility, allowing configuration and training tailored to a predetermined set of multi-channel, multi-speaker, and multi-spatial ove… ▽ More

    Submitted 22 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: More results and source code are available at https://anton-jeran.github.io/MAD/

  25. Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

    Authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng

    Abstract: Map** two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not… ▽ More

    Submitted 7 October, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437

  26. arXiv:2308.16551  [pdf

    eess.IV cs.CV

    Object Detection for Caries or Pit and Fissure Sealing Requirement in Children's First Permanent Molars

    Authors: Chenyao Jiang, Shiyao Zhai, Hengrui Song, Yuqing Ma, Yachen Fan, Yancheng Fang, Dongmei Yu, Canyang Zhang, Sanyang Han, Runming Wang, Yong Liu, Jianbo Li, Peiwu Qin

    Abstract: Dental caries is one of the most common oral diseases that, if left untreated, can lead to a variety of oral problems. It mainly occurs inside the pits and fissures on the occlusal/buccal/palatal surfaces of molars and children are a high-risk group for pit and fissure caries in permanent molars. Pit and fissure sealing is one of the most effective methods that is widely used in prevention of pit… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  27. arXiv:2308.04126  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation

    Authors: Dongyang Yu, Shihao Wang, Yuan Fang, Wangpeng An

    Abstract: This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text. Our crafted algorithm lev… ▽ More

    Submitted 17 August, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

  28. arXiv:2306.07944  [pdf, other

    eess.AS cs.AI cs.CL

    Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, w… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  29. arXiv:2305.19467  [pdf

    eess.IV cs.CV

    Synthetic CT Generation from MRI using 3D Transformer-based Denoising Diffusion Model

    Authors: Shaoyan Pan, Elham Abouei, Jacob Wynne, Tonghe Wang, Richard L. J. Qiu, Yuheng Li, Chih-Wei Chang, Junbo Peng, Justin Roper, Pretesh Patel, David S. Yu, Hui Mao, Xiaofeng Yang

    Abstract: Magnetic resonance imaging (MRI)-based synthetic computed tomography (sCT) simplifies radiation therapy treatment planning by eliminating the need for CT simulation and error-prone image registration, ultimately reducing patient radiation dose and setup uncertainty. We propose an MRI-to-CT transformer-based denoising diffusion probabilistic model (MC-DDPM) to transform MRI into high-quality sCT to… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  30. arXiv:2305.19269  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Make-A-Voice: Unified Voice Synthesis With Discrete Representation

    Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Lu** Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu

    Abstract: Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speak… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  31. arXiv:2305.02583  [pdf, other

    eess.AS cs.SD

    Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression

    Authors: Hao Zhang, Meng Yu, Yuzhong Wu, Tao Yu, Dong Yu

    Abstract: Deep learning has been recently introduced for efficient acoustic howling suppression (AHS). However, the recurrent nature of howling creates a mismatch between offline training and streaming inference, limiting the quality of enhanced speech. To address this limitation, we propose a hybrid method that combines a Kalman filter with a self-attentive recurrent neural network (SARNN) to leverage thei… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: submitted to INTERSPEECH 2023. arXiv admin note: text overlap with arXiv:2302.09252

  32. arXiv:2305.01637  [pdf, other

    eess.AS cs.SD

    Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings

    Authors: Hao Zhang, Meng Yu, Dong Yu

    Abstract: Hybrid meetings have become increasingly necessary during the post-COVID period and also brought new challenges for solving audio-related problems. In particular, the interplay between acoustic echo and acoustic howling in a hybrid meeting makes the joint suppression of them difficult. This paper proposes a deep learning approach to tackle this problem by formulating a recurrent feedback suppressi… ▽ More

    Submitted 4 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  33. arXiv:2304.11029  [pdf, other

    cs.SD cs.IR eess.AS

    CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic Music Information Retrieval

    Authors: Shangda Wu, Dingyao Yu, Xu Tan, Maosong Sun

    Abstract: We introduce CLaMP: Contrastive Language-Music Pre-training, which learns cross-modal representations between natural language and symbolic music using a music encoder and a text encoder trained jointly with a contrastive loss. To pre-train CLaMP, we collected a large dataset of 1.4 million music-text pairs. It employed text dropout as a data augmentation technique and bar patching to efficiently… ▽ More

    Submitted 18 October, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: 11 pages, 5 figures, 5 tables, accepted by ISMIR 2023

  34. arXiv:2303.02573  [pdf, ps, other

    eess.SP cs.IT cs.LG

    Learning Decentralized Power Control in Cell-Free Massive MIMO Networks

    Authors: Daesung Yu, Hoon Lee, Seung-Eun Hong, Seok-Hwan Park

    Abstract: This paper studies learning-based decentralized power control methods for cell-free massive multiple-input multiple-output (MIMO) systems where a central processor (CP) controls access points (APs) through fronthaul coordination. To determine the transmission policy of distributed APs, it is essential to develop a network-wide collaborative optimization mechanism. To address this challenge, we des… ▽ More

    Submitted 4 March, 2023; originally announced March 2023.

    Comments: accepted for publication on IEEE Transactions on Vehicular Technology

  35. arXiv:2302.13462  [pdf, other

    cs.SD eess.AS

    3D Neural Beamforming for Multi-channel Speech Separation Against Location Uncertainty

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Dong Yu

    Abstract: Multi-channel speech separation using speaker's directional information has demonstrated significant gains over blind speech separation. However, it has two limitations. First, substantial performance degradation is observed when the coming directions of two sounds are close. Second, the result highly relies on the precise estimation of the speaker's direction. To overcome these issues, this paper… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  36. arXiv:2302.09252  [pdf, other

    eess.AS cs.SD

    Deep AHS: A Deep Learning Approach to Acoustic Howling Suppression

    Authors: Hao Zhang, Meng Yu, Dong Yu

    Abstract: In this paper, we formulate acoustic howling suppression (AHS) as a supervised learning problem and propose a deep learning approach, called Deep AHS, to address it. Deep AHS is trained in a teacher forcing way which converts the recurrent howling suppression process into an instantaneous speech separation process to simplify the problem and accelerate the model training. The proposed method utili… ▽ More

    Submitted 17 August, 2023; v1 submitted 18 February, 2023; originally announced February 2023.

    Comments: Accepted for publication in 2023 ICASSP

  37. Neural Target Speech Extraction: An Overview

    Authors: Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

    Abstract: Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characte… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: Submitted to IEEE Signal Processing Magazine on Apr. 25, 2022, and accepted on Jan. 12, 2023

  38. arXiv:2301.12363  [pdf, other

    eess.AS cs.SD

    NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation

    Authors: Yixuan Zhang, Meng Yu, Hao Zhang, Dong Yu, DeLiang Wang

    Abstract: The robustness of the Kalman filter to double talk and its rapid convergence make it a popular approach for addressing acoustic echo cancellation (AEC) challenges. However, the inability to model nonlinearity and the need to tune control parameters cast limitations on such adaptive filtering algorithms. In this paper, we integrate the frequency domain Kalman filter (FDKF) and deep neural networks… ▽ More

    Submitted 26 December, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: The term of the algorithm is renamed because it conflicts with an existing KalmanNet algorithm proposed by Revach et. al. (arXiv:2107.10043); Accepted by ASRU 2023

  39. arXiv:2301.00656  [pdf, other

    eess.AS cs.CL cs.LG

    TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR

    Authors: Lixin Cao, Jun Wang, Ben Yang, Dan Su, Dong Yu

    Abstract: Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen te… ▽ More

    Submitted 14 March, 2023; v1 submitted 12 December, 2022; originally announced January 2023.

    Comments: Accepted by ICASSP 2023

  40. arXiv:2212.08348  [pdf, other

    cs.SD eess.AS

    Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Yuexian Zou, Dong Yu

    Abstract: Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and… ▽ More

    Submitted 23 December, 2022; v1 submitted 16 December, 2022; originally announced December 2022.

  41. arXiv:2211.12590  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Deep Neural Mel-Subband Beamformer for In-car Speech Separation

    Authors: Vinay Kothapally, Yong Xu, Meng Yu, Shi-Xiong Zhang, Dong Yu

    Abstract: While current deep learning (DL)-based beamforming techniques have been proved effective in speech separation, they are often designed to process narrow-band (NB) frequencies independently which results in higher computational costs and inference times, making them unsuitable for real-world use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer to perform speech separation… ▽ More

    Submitted 11 March, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  42. arXiv:2211.05910  [pdf, other

    eess.IV cs.CV

    Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

    Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, **gang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, **woo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

    Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

  43. arXiv:2210.07553  [pdf, other

    cs.RO cs.LG eess.SY

    Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

    Authors: Dongjie Yu, Wenjun Zou, Yujie Yang, Haitong Ma, Shengbo Eben Li, **gliang Duan, Jianyu Chen

    Abstract: Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an i… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: 12 pages, 6 figures

  44. arXiv:2210.07499  [pdf, other

    cs.CL cs.SD eess.AS

    Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

    Authors: **chuan Tian, Brian Yan, Jianwei Yu, Chao Weng, Dong Yu, Shinji Watanabe

    Abstract: Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target… ▽ More

    Submitted 31 January, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Journal ref: International Conference on Learning Representations (ICLR), 2023

  45. C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification

    Authors: Chunlei Zhang, Dong Yu

    Abstract: Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: Accepted to IEEE Journal of Selected Topics in Signal Processing

  46. arXiv:2207.09983  [pdf, other

    cs.SD cs.AI eess.AS

    Diffsound: Discrete Diffusion Model for Text-to-sound Generation

    Authors: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t… ▽ More

    Submitted 28 April, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted by TASLP2022

  47. arXiv:2206.07956  [pdf, other

    cs.SD cs.CL eess.AS

    Automatic Prosody Annotation with Pre-Trained Text-Speech Model

    Authors: Ziqian Dai, Jianwei Yu, Yan Wang, Nuo Chen, Yanyao Bian, Guangzhi Li, Deng Cai, Dong Yu

    Abstract: Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: accepted by INTERSPEECH2022

  48. arXiv:2206.02512  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder

    Authors: Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, Dong Yu

    Abstract: In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre fea… ▽ More

    Submitted 11 October, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: Under Review

  49. arXiv:2205.12007  [pdf, other

    eess.AS cs.SD

    PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

    Authors: Hui Zhang, Tian Yuan, Junkun Chen, Xintong Li, Renjie Zheng, Yuxin Huang, Xiaojie Chen, Enlei Gong, Zeyu Chen, Xiaoguang Hu, Dianhai Yu, Yanjun Ma, Liang Huang

    Abstract: PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves co… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

  50. arXiv:2205.10401  [pdf, other

    eess.AS cs.SD

    NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement

    Authors: Meng Yu, Yong Xu, Chunlei Zhang, Shi-Xiong Zhang, Dong Yu

    Abstract: Acoustic echo cancellation (AEC) plays an important role in the full-duplex speech communication as well as the front-end speech enhancement for recognition in the conditions when the loudspeaker plays back. In this paper, we present an all-deep-learning framework that implicitly estimates the second order statistics of echo/noise and target speech, and jointly solves echo and noise suppression th… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022