Skip to main content

Showing 1–50 of 74 results for author: Kong, Q

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.11462  [pdf, other

    cs.MM cs.GR cs.SD eess.AS

    MusicScore: A Dataset for Music Score Modeling and Generation

    Authors: Yuheng Lin, Zheqi Dai, Qiuqiang Kong

    Abstract: Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are ma… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Dataset paper, dataset link: https://huggingface.co/datasets/ZheqiDAI/MusicScore

  2. arXiv:2406.02233  [pdf, other

    eess.AS

    Towards Out-of-Distribution Detection in Vocoder Recognition via Latent Feature Reconstruction

    Authors: Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao

    Abstract: Advancements in synthesized speech have created a growing threat of impersonation, making it crucial to develop deepfake algorithm recognition. One significant aspect is out-of-distribution (OOD) detection, which has gained notable attention due to its important role in deepfake algorithm recognition. However, most of the current approaches for detecting OOD in deepfake algorithm recognition rely… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 5 pages, 4 figures

  3. arXiv:2403.09527  [pdf, other

    eess.AS

    WavCraft: Audio Editing and Generation with Large Language Models

    Authors: **hua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos

    Abstract: We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decompo… ▽ More

    Submitted 10 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  4. arXiv:2312.16422  [pdf, other

    eess.AS cs.SD

    Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

    Comments: 13 pages, 11 figures

  5. arXiv:2310.10159  [pdf, other

    cs.SD cs.CL eess.AS

    Joint Music and Language Attention Models for Zero-shot Music Tagging

    Authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong

    Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: \begin{keywords} Music tagging, joint music and language attention models, Music Foundation Model. \end{keywords}

  6. arXiv:2310.09853  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning

    Authors: Dichucheng Li, Yinghao Ma, Weixing Wei, Qiuqiang Kong, Yulun Wu, Ming** Che, Fan Xia, Emmanouil Benetos, Wei Li

    Abstract: Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresse… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: submitted to ICASSP 2024

  7. arXiv:2310.08950  [pdf, ps, other

    cs.SD eess.AS

    Transformer-based Autoencoder with ID Constraint for Unsupervised Anomalous Sound Detection

    Authors: Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang

    Abstract: Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detectin… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: Accepted by EURASIP Journal on Audio, Speech, and Music Processing

  8. arXiv:2309.02612  [pdf, other

    cs.SD eess.AS

    Music Source Separation with Band-Split RoPE Transformer

    Authors: Wei-Tsung Lu, Ju-Chiang Wang, Qiuqiang Kong, Yun-Ning Hung

    Abstract: Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoP… ▽ More

    Submitted 9 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: This paper explains the SAMI-ByteDance MSS system submitted to Sound Demixing Challenge (SDX23) Music Separation Track. Version 2 of paper fixed some typos

  9. arXiv:2308.05734  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

    Authors: Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yu** Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

    Abstract: Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learn… ▽ More

    Submitted 11 May, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. Project page is https://audioldm.github.io/audioldm2

  10. arXiv:2308.05037  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Separate Anything You Describe

    Authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

    Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instr… ▽ More

    Submitted 27 October, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

    Comments: Code, benchmark and pre-trained models: https://github.com/Audio-AGI/AudioSep

  11. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, **hua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  12. arXiv:2305.10666  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    A unified front-end framework for English text-to-speech synthesis

    Authors: Zelin Ying, Chen Li, Yu Dong, Qiuqiang Kong, Qiao Tian, Yuanyuan Huo, Yuxuan Wang

    Abstract: The front-end is a critical component of English text-to-speech (TTS) systems, responsible for extracting linguistic features that are essential for a text-to-speech model to synthesize speech, such as prosodies and phonemes. The English TTS front-end typically consists of a text normalization (TN) module, a prosody word prosody phrase (PWPP) module, and a grapheme-to-phoneme (G2P) module. However… ▽ More

    Submitted 25 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted in ICASSP 2024

  13. arXiv:2305.07447  [pdf, other

    cs.SD eess.AS

    Universal Source Separation with Weakly Labelled Data

    Authors: Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley

    Abstract: Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  14. arXiv:2305.07204  [pdf, other

    eess.AS cs.SD

    Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

    Authors: Zhichao Wang, Liumeng Xue, Qiuqiang Kong, Lei Xie, Yuanzhe Chen, Qiao Tian, Yu** Wang

    Abstract: Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook… ▽ More

    Submitted 18 May, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: Submitted to TASLP

  15. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  16. arXiv:2302.00286  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Ju-Chiang Wang, Yun-Ning Hung, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilize… ▽ More

    Submitted 1 February, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

    Comments: arXiv admin note: text overlap with arXiv:2206.10805

  17. arXiv:2211.12195  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Ontology-aware Learning and Evaluation for Audio Tagging

    Authors: Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley

    Abstract: This study defines a new evaluation metric for audio tagging tasks to overcome the limitation of the conventional mean average precision (mAP) metric, which treats different kinds of sound as independent classes without considering their relations. Also, due to the ambiguities in sound labeling, the labels in the training and evaluation set are not guaranteed to be accurate and exhaustive, which p… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. The code is open-sourced at https://github.com/haoheliu/ontology-aware-audio-tagging

    Journal ref: Proc. Interspeech 2023

  18. arXiv:2211.04258  [pdf, other

    eess.SP eess.SY

    MetaLoc: Learning to Learn Wireless Localization

    Authors: Jun Gao, Dongze Wu, Feng Yin, Qinglei Kong, Lexi Xu, Shuguang Cui

    Abstract: Existing localization methods that intensively leverage the environment-specific received signal strength (RSS) or channel state information (CSI) of wireless signals are rather accurate in certain environments. However, these methods, whether based on pure statistical signal processing or data-driven approaches, often struggle to generalize to new environments, which results in considerable time… ▽ More

    Submitted 29 August, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: to be published in IEEE JSAC (Special Issue: 5G/6G Precise Positioning on Cooperative Intelligent Transportation Systems (C-ITS) and Connected Automated Vehicles (CAV))

  19. arXiv:2211.02301  [pdf, other

    cs.SD cs.AI eess.AS

    Binaural Rendering of Ambisonic Signals by Neural Networks

    Authors: Yin Zhu, Qiuqiang Kong, Junjie Shi, Shilei Liu, Xuzhou Ye, Ju-chiang Wang, Jun** Zhang

    Abstract: Binaural rendering of ambisonic signals is of broad interest to virtual reality and immersive media. Conventional methods often require manually measured Head-Related Transfer Functions (HRTFs). To address this issue, we collect a paired ambisonic-binaural dataset and propose a deep learning framework in an end-to-end manner. Experimental results show that neural networks outperform the convention… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  20. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  21. arXiv:2210.15158  [pdf, other

    eess.AS cs.SD

    Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

    Authors: Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yu** Wang, Yuxuan Wang

    Abstract: Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leak… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: The paper has been submitted to ICASSP2023

  22. arXiv:2210.12345  [pdf, other

    cs.SD eess.AS

    Neural Sound Field Decomposition with Super-resolution of Sound Direction

    Authors: Qiuqiang Kong, Shilei Liu, Junjie Shi, Xuzhou Ye, Yin Cao, Qiaoxi Zhu, Yong Xu, Yuxuan Wang

    Abstract: Sound field decomposition predicts waveforms in arbitrary directions using signals from a limited number of microphones as inputs. Sound field decomposition is fundamental to downstream tasks, including source localization, source separation, and spatial audio reproduction. Conventional sound field decomposition methods such as Ambisonics have limited spatial decomposition resolution. This paper p… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

    Comments: 12 pages

  23. arXiv:2210.01719  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    Learning Temporal Resolution in Spectrogram for Audio Classification

    Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not al… ▽ More

    Submitted 12 January, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted by the 38th Annual AAAI Conference on Artificial Intelligence

  24. arXiv:2210.00943  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Simple Pooling Front-ends For Efficient Audio Classification

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang

    Abstract: Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram… ▽ More

    Submitted 6 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  25. arXiv:2209.01802  [pdf, other

    eess.AS cs.SD

    Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentat… ▽ More

    Submitted 9 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Submitted to DCASE 2022 Workshop. Code is available at https://github.com/**bo-Hu/DCASE2022-TASK3

  26. arXiv:2207.10547  [pdf, other

    cs.SD eess.AS

    Surrey System for DCASE 2022 Task 5: Few-shot Bioacoustic Event Detection with Segment-level Metric Learning

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inferenc… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: Technical Report of the system that ranks 2nd in the DCASE Challenge Task 5. arXiv admin note: text overlap with arXiv:2207.07773

  27. arXiv:2207.07773  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model o… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: 2nd place in the DCASE 2022 Challenge Task 5. Submitted to the DCASE 2022 workshop

  28. arXiv:2206.10805  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

    Authors: Kin Wai Cheuk, Keunwoo Choi, Qiuqiang Kong, Bochen Li, Minz Won, Amy Hung, Ju-Chiang Wang, Dorien Herremans

    Abstract: In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utiliz… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: Submitted to ISMIR

  29. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

    Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang

    Abstract: Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a single type of distortion, such as speech denoising or dereverberation. However, speech signals can be degraded by several different distortions simultaneously in the real world. It is thus important to extend speech restoration models to deal with multiple distortions. In this paper, we introduce Voic… ▽ More

    Submitted 17 April, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

    Journal ref: Proc. Interspeech 2022

  30. arXiv:2203.15147  [pdf, other

    eess.AS cs.AI cs.CL cs.SD eess.SP

    Separate What You Describe: Language-Queried Audio Source Separation

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, **zheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 3 figures

  31. arXiv:2203.14941  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Neural Vocoder is All You Need for Speech Super-resolution

    Authors: Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, DeLiang Wang

    Abstract: Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. These strong constraints can potentially lead to poor generalization ability in mismatched real-world cases. In this paper, we propose a neural vocoder based speech super-resol… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022

    Journal ref: Proc. Interspeech 2022

  32. arXiv:2203.10228  [pdf, other

    cs.SD eess.AS

    A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In this paper, a track-wise ensemble event independent network with a novel data augmentation method is proposed. The proposed model is based on our previous proposed Event-Independent Network V2 and is extended by conformer blocks and dense… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: 6 pages, 2 figures, submitted to IEEE ICASSP 2022

  33. arXiv:2112.04685  [pdf, other

    cs.SD cs.AI eess.AS

    CWS-PResUNet: Music Source Separation with Channel-wise Subband Phase-aware ResUNet

    Authors: Haohe Liu, Qiuqiang Kong, Jiafeng Liu

    Abstract: Music source separation (MSS) shows active progress with deep learning models in recent years. Many MSS models perform separations on spectrograms by estimating bounded ratio masks and reusing the phases of the mixture. When using convolutional neural networks (CNN), weights are usually shared within a spectrogram during convolution regardless of the different patterns between frequency bands. In… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

    Comments: Published at MDX Workshop @ ISMIR 2021

  34. arXiv:2109.13731  [pdf, other

    cs.SD eess.AS

    VoiceFixer: Toward General Speech Restoration with Neural Vocoder

    Authors: Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang

    Abstract: Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on single-task speech restoration (SSR), such as speech denoising or speech declip**. However, SSR systems only focus on one task and do not address the general speech restoration problem. In addition, previous SSR systems show limited performance in some speech restoration tasks such as speech super-reso… ▽ More

    Submitted 5 October, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

  35. arXiv:2109.05418  [pdf, other

    cs.SD eess.AS

    Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

    Authors: Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, Yuxuan Wang

    Abstract: Deep neural network based methods have been successfully applied to music source separation. They typically learn a map** from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-f… ▽ More

    Submitted 11 September, 2021; originally announced September 2021.

    Comments: 6 pages

    Journal ref: International Society for Music Information Retrieval (ISMIR) 2021

  36. arXiv:2108.03456  [pdf, other

    cs.SD cs.AI eess.AS

    A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

    Authors: Liwei Lin, Qiuqiang Kong, Junyan Jiang, Gus Xia

    Abstract: We propose a unified model for three inter-related tasks: 1) to \textit{separate} individual sound sources from a mixed music audio, 2) to \textit{transcribe} each sound source to MIDI notes, and 3) to\textit{ synthesize} new pieces based on the timbre of separated sources. The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different i… ▽ More

    Submitted 7 August, 2021; originally announced August 2021.

    Comments: Accepted by ISMIR2021

  37. arXiv:2104.01161  [pdf, ps, other

    cs.SD eess.AS

    An Audio-Based Deep Learning Framework For BBC Television Programme Classification

    Authors: Lam Pham, Chris Baume, Qiuqiang Kong, Tassadaq Hussain, Wenwu Wang, Mark Plumbley

    Abstract: This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabilities and detected sound events are then calculated… ▽ More

    Submitted 11 February, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

  38. arXiv:2103.16149  [pdf, other

    cs.SD cs.LG eess.AS

    Time-domain Speech Enhancement with Generative Adversarial Learning

    Authors: Feiyang Xiao, Jian Guan, Qiuqiang Kong, Wenwu Wang

    Abstract: Speech enhancement aims to obtain speech signals with high intelligibility and quality from noisy speech. Recent work has demonstrated the excellent performance of time-domain deep learning methods, such as Conv-TasNet. However, these methods can be degraded by the arbitrary scales of the waveform induced by the scale-invariant signal-to-noise ratio (SI-SNR) loss. This paper proposes a new framewo… ▽ More

    Submitted 19 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

  39. arXiv:2103.10134  [pdf, other

    cs.LG eess.SY

    Recent Advances in Data-Driven Wireless Communication Using Gaussian Processes: A Comprehensive Survey

    Authors: Kai Chen, Qinglei Kong, Yijue Dai, Yue Xu, Feng Yin, Lexi Xu, Shuguang Cui

    Abstract: Data-driven paradigms are well-known and salient demands of future wireless communication. Empowered by big data and machine learning, next-generation data-driven communication systems will be intelligent with the characteristics of expressiveness, scalability, interpretability, and especially uncertainty modeling, which can confidently involve diversified latent demands and personalized services… ▽ More

    Submitted 31 May, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

  40. arXiv:2102.09971  [pdf, other

    cs.SD eess.AS

    Speech enhancement with weakly labelled data from AudioSet

    Authors: Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang

    Abstract: Speech enhancement is a task to improve the intelligibility and perceptual quality of degraded speech signal. Recently, neural networks based methods have been applied to speech enhancement. However, many neural network based methods require noisy and clean speech pairs for training. We propose a speech enhancement framework that can be trained with large-scale weakly labelled AudioSet dataset. We… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 5 pages

  41. arXiv:2102.09966  [pdf, ps, other

    cs.SD eess.AS

    CatNet: music source separation system with mix-audio augmentation

    Authors: Xuchen Song, Qiuqiang Kong, Xingjian Du, Yuxuan Wang

    Abstract: Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. Recently, neural network based methods have been applied to address the MSS problem, and can be categorized into spectrogram and time-domain based methods. However, there is a lack of research of using complementary information of spectrogram and time-domain inputs for M… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 5 pages

  42. CAA-Net: Conditional Atrous CNNs with Attention for Explainable Device-robust Acoustic Scene Classification

    Authors: Zhao Ren, Qiuqiang Kong, **g Han, Mark D. Plumbley, Björn W. Schuller

    Abstract: Acoustic Scene Classification (ASC) aims to classify the environment in which the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs) have been successfully applied to ASC. However, the data distributions of the audio signals recorded with multiple devices are different. There has been little research on the training of robust neural networks on acoustic scene datasets recor… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

    Comments: IEEE Transactions on Multimedia

  43. arXiv:2010.14805  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Large-Scale MIDI-based Composer Classification

    Authors: Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang

    Abstract: Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate… ▽ More

    Submitted 28 October, 2020; originally announced October 2020.

  44. arXiv:2010.13092  [pdf, other

    cs.SD eess.AS

    An Improved Event-Independent Network for Polyphonic Sound Event Localization and Detection

    Authors: Yin Cao, Turab Iqbal, Qiuqiang Kong, Fengyan An, Wenwu Wang, Mark D. Plumbley

    Abstract: Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study the SELD task from a multi-task learning perspective. Two open problems are addressed in this paper. Firstly, to detect overlapp… ▽ More

    Submitted 10 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

  45. arXiv:2010.07061  [pdf, other

    cs.IR cs.SD eess.AS

    GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

    Authors: Qiuqiang Kong, Bochen Li, Jitong Chen, Yuxuan Wang

    Abstract: Symbolic music datasets are important for music information retrieval and musical analysis. However, there is a lack of large-scale symbolic datasets for classical piano music. In this article, we create a GiantMIDI-Piano (GP) dataset containing 38,700,838 transcribed notes and 10,855 unique solo piano works composed by 2,786 composers. We extract the names of music works and the names of composer… ▽ More

    Submitted 21 April, 2022; v1 submitted 10 October, 2020; originally announced October 2020.

    Comments: 11 pages, 13 figures

  46. arXiv:2010.01815  [pdf, other

    cs.SD eess.AS

    High-resolution Piano Transcription with Pedals by Regressing Onset and Offset Times

    Authors: Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, Yuxuan Wang

    Abstract: Automatic music transcription (AMT) is the task of transcribing audio recordings into symbolic representations. Recently, neural network-based methods have been applied to AMT, and have achieved state-of-the-art results. However, many previous systems only detect the onset and offset of notes frame-wise, so the transcription resolution is limited to the frame hop size. There is a lack of research… ▽ More

    Submitted 31 July, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: 12 pages

  47. arXiv:2010.00140  [pdf, other

    eess.AS cs.SD

    Event-Independent Network for Polyphonic Sound Event Localization and Detection

    Authors: Yin Cao, Turab Iqbal, Qiuqiang Kong, Yue Zhong, Wenwu Wang, Mark D. Plumbley

    Abstract: Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlap**-event cases, which include two events of the same type wit… ▽ More

    Submitted 30 September, 2020; originally announced October 2020.

    Comments: conference

  48. arXiv:2007.12864  [pdf, other

    cs.SD cs.LG eess.AS

    DD-CNN: Depthwise Disout Convolutional Neural Network for Low-complexity Acoustic Scene Classification

    Authors: **gqiao Zhao, Zhen-Hua Feng, Qiuqiang Kong, Xiaoning Song, Xiao-Jun Wu

    Abstract: This paper presents a Depthwise Disout Convolutional Neural Network (DD-CNN) for the detection and classification of urban acoustic scenes. Specifically, we use log-mel as feature representations of acoustic signals for the inputs of our network. In the proposed DD-CNN, depthwise separable convolution is used to reduce the network complexity. Besides, SpecAugment and Disout are used for further pe… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

  49. arXiv:2007.08165  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Audio Tagging by Cross Filtering Noisy Labels

    Authors: Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, Yuxing Peng

    Abstract: High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

  50. arXiv:2003.03697  [pdf, other

    cs.DC cs.LG eess.SP eess.SY stat.AP

    FedLoc: Federated Learning Framework for Data-Driven Cooperative Localization and Location Data Processing

    Authors: Feng Yin, Zhidi Lin, Yue Xu, Qinglei Kong, Deshi Li, Sergios Theodoridis, Shuguang, Cui

    Abstract: In this overview paper, data-driven learning model-based cooperative localization and location data processing are considered, in line with the emerging machine learning and big data methods. We first review (1) state-of-the-art algorithms in the context of federated learning, (2) two widely used learning models, namely the deep neural network model and the Gaussian process model, and (3) various… ▽ More

    Submitted 25 May, 2020; v1 submitted 7 March, 2020; originally announced March 2020.