Skip to main content

Showing 1–50 of 79 results for author: Plumbley, M D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.00233  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

    Authors: Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, Mark D. Plumbley

    Abstract: Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient language modelling. Addressing these chal… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: Demo and code: https://haoheliu.github.io/SemantiCodec/

  2. arXiv:2404.17806  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

    Authors: Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang

    Abstract: Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introd… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

    Comments: Preprint submitted to IEEE MLSP 2024

  3. arXiv:2403.09527  [pdf, other

    eess.AS

    WavCraft: Audio Editing and Generation with Large Language Models

    Authors: **hua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos

    Abstract: We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decompo… ▽ More

    Submitted 10 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  4. arXiv:2402.02694  [pdf, other

    eess.AS cs.LG cs.SD

    Description on IEEE ICME 2024 Grand Challenge: Semi-supervised Acoustic Scene Classification under Domain Shift

    Authors: Jisheng Bai, Mou Wang, Haohe Liu, Han Yin, Yafei Jia, Siwei Huang, Yutong Du, Dongzhe Zhang, Dongyuan Shi, Woon-Seng Gan, Mark D. Plumbley, Susanto Rahardja, Bin Xiang, Jianfeng Chen

    Abstract: Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is the domain shift between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Althoug… ▽ More

    Submitted 28 February, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  5. arXiv:2312.16422  [pdf, other

    eess.AS cs.SD

    Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

    Comments: 13 pages, 11 figures

  6. arXiv:2312.00249  [pdf, other

    eess.AS

    Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

    Authors: **hua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos

    Abstract: The auditory system plays a substantial role in sha** the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of vision and language understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capacity. In this work, we… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  7. arXiv:2309.08051  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Retrieval-Augmented Text-to-Audio Generation

    Authors: Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer… ▽ More

    Submitted 5 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  8. arXiv:2309.07314  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioSR: Versatile Audio Super-resolution at Scale

    Authors: Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley

    Abstract: Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR,… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Under review. Demo and code: https://audioldm.github.io/audiosr

  9. arXiv:2308.08847  [pdf, other

    eess.AS cs.SD

    META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark D. Plumbley, Jun Yang

    Abstract: For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Submitted to DCASE 2023 Workshop

  10. arXiv:2308.05734  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

    Authors: Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yu** Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

    Abstract: Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learn… ▽ More

    Submitted 11 May, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. Project page is https://audioldm.github.io/audioldm2

  11. arXiv:2308.05037  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Separate Anything You Describe

    Authors: Xubo Liu, Qiuqiang Kong, Yan Zhao, Haohe Liu, Yi Yuan, Yuzhuo Liu, Rui Xia, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

    Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instr… ▽ More

    Submitted 27 October, 2023; v1 submitted 9 August, 2023; originally announced August 2023.

    Comments: Code, benchmark and pre-trained models: https://github.com/Audio-AGI/AudioSep

  12. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, **hua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  13. arXiv:2306.10359  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Text-Driven Foley Sound Generation With Latent Diffusion Model

    Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang

    Abstract: Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale… ▽ More

    Submitted 18 September, 2023; v1 submitted 17 June, 2023; originally announced June 2023.

    Comments: Submit to DCASE-workshop 2023, an extension and supersedes the previous technical report arXiv:2305.15905

  14. arXiv:2306.09106  [pdf, other

    cs.SD cs.AI eess.AS eess.SY

    Audio Tagging on an Embedded Hardware Platform

    Authors: Gabriel Bibbo, Arshdeep Singh, Mark D. Plumbley

    Abstract: Convolutional neural networks (CNNs) have exhibited state-of-the-art performance in various audio classification tasks. However, their real-time deployment remains a challenge on resource-constrained devices like embedded systems. In this paper, we analyze how the performance of large-scale pretrained audio neural networks designed for audio pattern recognition changes when deployed on a hardware… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Submitted to DCASE 2023 Workshop

  15. arXiv:2305.18753  [pdf, other

    eess.AS cs.SD

    Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model calle… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: INTERSPEECH 2023. arXiv admin note: substantial text overlap with arXiv:2210.05037

  16. arXiv:2305.18665  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    E-PANNs: Sound Recognition Using Efficient Pre-trained Audio Neural Networks

    Authors: Arshdeep Singh, Haohe Liu, Mark D. Plumbley

    Abstract: Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs),… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted in Internoise 2023 conference

  17. arXiv:2305.17719  [pdf, other

    eess.AS cs.SD

    Adapting Language-Audio Models as Few-Shot Audio Learners

    Authors: **hua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang

    Abstract: We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

  18. arXiv:2305.15905  [pdf, other

    cs.SD cs.MM eess.AS

    Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

    Authors: Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang

    Abstract: Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry pr… ▽ More

    Submitted 15 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: DCASE 2023 task 7 technical report, ranked 1st in the challenge

  19. arXiv:2305.07447  [pdf, other

    cs.SD eess.AS

    Universal Source Separation with Weakly Labelled Data

    Authors: Qiuqiang Kong, Ke Chen, Haohe Liu, Xingjian Du, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Mark D. Plumbley

    Abstract: Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  20. arXiv:2305.03391  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Compressing audio CNNs with graph centrality based filter pruning

    Authors: James A King, Arshdeep Singh, Mark D. Plumbley

    Abstract: Convolutional neural networks (CNNs) are commonplace in high-performing solutions to many real-world problems, such as audio classification. CNNs have many parameters and filters, with some having a larger impact on the performance than others. This means that networks may contain many unnecessary filters, increasing a CNN's computation and memory requirements while providing limited performance b… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

  21. arXiv:2304.02319  [pdf, other

    cs.LG cs.AI cs.CV eess.SP

    Efficient CNNs via Passive Filter Pruning

    Authors: Arshdeep Singh, Mark D. Plumbley

    Abstract: Convolutional neural networks (CNNs) have shown state-of-the-art performance in various applications. However, CNNs are resource-hungry due to their requirement of high computational complexity and memory storage. Recent efforts toward achieving computational efficiency in CNNs involve filter pruning methods that eliminate some of the filters in CNNs based on the \enquote{importance} of the filter… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

  22. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  23. arXiv:2303.03857  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Leveraging Pre-trained AudioLDM for Text to Sound Generation: A Benchmark Study

    Authors: Yi Yuan, Haohe Liu, **hua Liang, Xubo Liu, Mark D. Plumbley, Wenwu Wang

    Abstract: Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the… ▽ More

    Submitted 11 March, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

    Comments: EUSIPCO 2023

  24. arXiv:2301.12503  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

    Authors: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley

    Abstract: Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLA… ▽ More

    Submitted 9 September, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: Accepted by ICML 2023. Demo and implementation at https://audioldm.github.io. Evaluation toolbox at https://github.com/haoheliu/audioldm_eval

  25. arXiv:2212.02033  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Towards Generating Diverse Audio Captions via Adversarial Training

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g.… ▽ More

    Submitted 28 June, 2024; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: Accepted to TASLP

  26. arXiv:2211.13189  [pdf, other

    cs.SD cs.CV eess.AS

    ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

    Authors: Sara Atito, Muhammad Awais, Wenwu Wang, Mark D Plumbley, Josef Kittler

    Abstract: Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet… ▽ More

    Submitted 10 March, 2024; v1 submitted 23 November, 2022; originally announced November 2022.

  27. arXiv:2211.12195  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Ontology-aware Learning and Evaluation for Audio Tagging

    Authors: Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley

    Abstract: This study defines a new evaluation metric for audio tagging tasks to overcome the limitation of the conventional mean average precision (mAP) metric, which treats different kinds of sound as independent classes without considering their relations. Also, due to the ambiguities in sound labeling, the labels in the training and evaluation set are not guaranteed to be accurate and exhaustive, which p… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. The code is open-sourced at https://github.com/haoheliu/ontology-aware-audio-tagging

    Journal ref: Proc. Interspeech 2023

  28. arXiv:2210.17416  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Efficient Similarity-based Passive Filter Pruning for Compressing CNNs

    Authors: Arshdeep Singh, Mark D. Plumbley

    Abstract: Convolution neural networks (CNNs) have shown great success in various applications. However, the computational complexity and memory storage of CNNs is a bottleneck for their deployment on resource-constrained devices. Recent efforts towards reducing the computation cost and the memory overhead of CNNs involve similarity-based passive filter pruning methods. Similarity-based passive filter prunin… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  29. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  30. arXiv:2210.05037  [pdf, other

    cs.SD cs.LG eess.AS

    Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang

    Abstract: Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of var… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  31. arXiv:2210.01719  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    Learning Temporal Resolution in Spectrogram for Audio Classification

    Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not al… ▽ More

    Submitted 12 January, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted by the 38th Annual AAAI Conference on Artificial Intelligence

  32. arXiv:2210.00943  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Simple Pooling Front-ends For Efficient Audio Classification

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang

    Abstract: Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram… ▽ More

    Submitted 6 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  33. arXiv:2209.01802  [pdf, other

    eess.AS cs.SD

    Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Sound event localization and detection (SELD) is a joint task of sound event detection and direction-of-arrival estimation. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) with a novel data augmentat… ▽ More

    Submitted 9 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Submitted to DCASE 2022 Workshop. Code is available at https://github.com/**bo-Hu/DCASE2022-TASK3

  34. arXiv:2208.01555  [pdf, other

    eess.AS cs.LG cs.SD

    Low-complexity CNNs for Acoustic Scene Classification

    Authors: Arshdeep Singh, James A King, Xubo Liu, Wenwu Wang, Mark D. Plumbley

    Abstract: This technical report describes the SurreyAudioTeam22s submission for DCASE 2022 ASC Task 1, Low-Complexity Acoustic Scene Classification (ASC). The task has two rules, (a) the ASC framework should have maximum 128K parameters, and (b) there should be a maximum of 30 millions multiply-accumulate operations (MACs) per inference. In this report, we present low-complexity systems for ASC that follow… ▽ More

    Submitted 2 August, 2022; originally announced August 2022.

    Comments: Technical Report DCASE 2022 TASK 1. arXiv admin note: substantial text overlap with arXiv:2207.11529

  35. arXiv:2207.11529  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Low-complexity CNNs for Acoustic Scene Classification

    Authors: Arshdeep Singh, Mark D. Plumbley

    Abstract: This paper presents a low-complexity framework for acoustic scene classification (ASC). Most of the frameworks designed for ASC use convolutional neural networks (CNNs) due to their learning ability and improved performance compared to hand-engineered features. However, CNNs are resource hungry due to their large size and high computational complexity. Therefore, CNNs are difficult to deploy on re… ▽ More

    Submitted 23 July, 2022; originally announced July 2022.

    Comments: Submitted to DCASE 2022 Workshop

  36. arXiv:2207.10547  [pdf, other

    cs.SD eess.AS

    Surrey System for DCASE 2022 Task 5: Few-shot Bioacoustic Event Detection with Segment-level Metric Learning

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inferenc… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: Technical Report of the system that ranks 2nd in the DCASE Challenge Task 5. arXiv admin note: text overlap with arXiv:2207.07773

  37. arXiv:2207.07773  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model o… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: 2nd place in the DCASE 2022 Challenge Task 5. Submitted to the DCASE 2022 workshop

  38. arXiv:2207.07429  [pdf, other

    cs.SD cs.AI eess.AS

    Continual Learning For On-Device Environmental Sound Classification

    Authors: Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang

    Abstract: Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for the training by measuring the per-sample classifi… ▽ More

    Submitted 18 July, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: The first two authors contributed equally, 5 pages one figure, submitted to DCASE2022 Workshop

  39. arXiv:2205.05949  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Automated Audio Captioning: An Overview of Recent Progress and New Challenges

    Authors: Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural ne… ▽ More

    Submitted 26 September, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: Accepted by EURASIP Journal on Audio Speech and Music Processing

  40. arXiv:2203.15751  [pdf, other

    eess.AS cs.AI cs.CC cs.LG

    A Passive Similarity based CNN Filter Pruning for Efficient Acoustic Scene Classification

    Authors: Arshdeep Singh, Mark D. Plumbley

    Abstract: We present a method to develop low-complexity convolutional neural networks (CNNs) for acoustic scene classification (ASC). The large size and high computational complexity of typical CNNs is a bottleneck for their deployment on resource-constrained devices. We propose a passive filter pruning framework, where a few convolutional filters from the CNNs are eliminated to yield compressed CNNs. Our h… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022 conference

  41. arXiv:2203.15537  [pdf, ps, other

    eess.AS cs.SD

    On Metric Learning for Audio-Text Cross-Modal Retrieval

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models… ▽ More

    Submitted 30 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, accepted to InterSpeech2022

  42. arXiv:2203.15147  [pdf, other

    eess.AS cs.AI cs.CL cs.SD eess.SP

    Separate What You Describe: Language-Queried Audio Source Separation

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, **zheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 3 figures

  43. arXiv:2203.10228  [pdf, other

    cs.SD eess.AS

    A Track-Wise Ensemble Event Independent Network for Polyphonic Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In this paper, a track-wise ensemble event independent network with a novel data augmentation method is proposed. The proposed model is based on our previous proposed Event-Independent Network V2 and is extended by conformer blocks and dense… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

    Comments: 6 pages, 2 figures, submitted to IEEE ICASSP 2022

  44. arXiv:2203.03436  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Deep Neural Decision Forest for Acoustic Scene Classification

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, **zheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-net… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: Submitted to the 30th European Signal Processing Conference (EUSIPCO), 5 pages, 2 figures

  45. arXiv:2203.02838  [pdf, other

    eess.AS cs.AI cs.SD

    Leveraging Pre-trained BERT for Audio Captioning

    Authors: Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, **zheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring… ▽ More

    Submitted 27 March, 2022; v1 submitted 5 March, 2022; originally announced March 2022.

    Comments: Submitted to the 30th European Signal Processing Conference (EUSIPCO), 5 pages, 2 figures

  46. arXiv:2110.06691  [pdf, other

    eess.AS cs.SD

    Diverse Audio Captioning via Adversarial Training

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE),which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using… ▽ More

    Submitted 29 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2022

  47. arXiv:2109.09227  [pdf, other

    cs.SD cs.LG eess.AS

    ARCA23K: An audio dataset for investigating open-set label noise

    Authors: Turab Iqbal, Yin Cao, Andrew Bailey, Mark D. Plumbley, Wenwu Wang

    Abstract: The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically Retrieved and Curated Audio dataset comprised of… ▽ More

    Submitted 27 February, 2022; v1 submitted 19 September, 2021; originally announced September 2021.

    Comments: Accepted to the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)

  48. arXiv:2108.02752  [pdf, other

    eess.AS cs.SD

    An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

    Authors: Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, **gqian Wu, Yusong Wu, **zheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced t… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: 5 pages, 1 figure, submitted to DCASE 2021 workshop

  49. arXiv:2107.10880  [pdf, other

    cs.SD eess.AS stat.CO

    Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions

    Authors: Andres Fernandez, Mark D. Plumbley

    Abstract: The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S task presented at the 2021 edition of the Detection a… ▽ More

    Submitted 15 October, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

    Comments: Accepted at the DCASE2021 Workshop

  50. arXiv:2107.09998  [pdf, other

    eess.AS cs.AI cs.SD

    Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

    Authors: Xubo Liu, Turab Iqbal, **zheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention, despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is… ▽ More

    Submitted 6 October, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: Accepted by IEEE 31st International Worlshop on Machine Learning for Signal Processing (MLSP) 2021, 6 pages, 1 figure