Skip to main content

Showing 1–29 of 29 results for author: Mei, X

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.01034  [pdf

    eess.IV cs.CV

    VISION-MAE: A Foundation Model for Medical Image Segmentation and Classification

    Authors: Zelong Liu, Andrew Tieu, Nikhil Patel, Alexander Zhou, George Soultanidis, Zahi A. Fayad, Timothy Deyer, Xueyan Mei

    Abstract: Artificial Intelligence (AI) has the potential to revolutionize diagnosis and segmentation in medical imaging. However, development and clinical implementation face multiple challenges including limited data availability, lack of generalizability, and the necessity to incorporate multi-modal data effectively. A foundation model, which is a large-scale pre-trained AI model, offers a versatile base… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  2. arXiv:2402.01031  [pdf

    eess.IV cs.CV

    MRAnnotator: A Multi-Anatomy Deep Learning Model for MRI Segmentation

    Authors: Alexander Zhou, Zelong Liu, Andrew Tieu, Nikhil Patel, Sean Sun, Anthony Yang, Peter Choi, Valentin Fauveau, George Soultanidis, Mingqian Huang, Amish Doshi, Zahi A. Fayad, Timothy Deyer, Xueyan Mei

    Abstract: Purpose To develop a deep learning model for multi-anatomy and many-class segmentation of diverse anatomic structures on MRI imaging. Materials and Methods In this retrospective study, two datasets were curated and annotated for model development and evaluation. An internal dataset of 1022 MRI sequences from various clinical sites within a health system and an external dataset of 264 MRI sequenc… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  3. arXiv:2312.05953  [pdf

    eess.IV cs.CV cs.LG

    RadImageGAN -- A Multi-modal Dataset-Scale Generative AI for Medical Imaging

    Authors: Zelong Liu, Alexander Zhou, Arnold Yang, Alara Yilmaz, Maxwell Yoo, Mikey Sullivan, Catherine Zhang, James Grant, Daiqing Li, Zahi A. Fayad, Sean Huver, Timothy Deyer, Xueyan Mei

    Abstract: Deep learning in medical imaging often requires large-scale, high-quality data or initiation with suitably pre-trained weights. However, medical datasets are limited by data availability, domain-specific knowledge, and privacy concerns, and the creation of large and diverse radiologic databases like RadImageNet is highly resource-intensive. To address these limitations, we introduce RadImageGAN, t… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  4. arXiv:2310.14173  [pdf, other

    cs.SD eess.AS

    First-Shot Unsupervised Anomalous Sound Detection With Unknown Anomalies Estimated by Metadata-Assisted Audio Generation

    Authors: He**g Zhang, Qiaoxi Zhu, Jian Guan, Haohe Liu, Feiyang Xiao, Jiantong Tian, Xinhao Mei, Xubo Liu, Wenwu Wang

    Abstract: First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it become… ▽ More

    Submitted 11 March, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

    Comments: Accepted at ICASSP 2024

  5. arXiv:2309.10537  [pdf, other

    eess.AS cs.MM cs.SD

    FoleyGen: Visually-Guided Audio Generation

    Authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  6. arXiv:2309.08773  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Enhance audio generation controllability through representation similarity regularization

    Authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

    Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regula… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages

  7. arXiv:2308.05734  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

    Authors: Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yu** Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

    Abstract: Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learn… ▽ More

    Submitted 11 May, 2024; v1 submitted 10 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing. Project page is https://audioldm.github.io/audioldm2

  8. arXiv:2307.15208  [pdf, other

    eess.IV cs.CV

    Generative AI for Medical Imaging: extending the MONAI Framework

    Authors: Walter H. L. Pinaya, Mark S. Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F. da Costa, Ashay Patel, Hyung** Chung, Can Zhao, Wei Peng, Zelong Liu, Xueyan Mei, Oeslle Lucena, Jong Chul Ye, Sotirios A. Tsaftaris, Prerna Dogra, Andrew Feng, Marc Modat, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso

    Abstract: Recent advances in generative AI have brought incredible breakthroughs in several areas, including medical imaging. These generative models have tremendous potential not only to help safely share medical data via synthetic datasets but also to perform an array of diverse applications, such as anomaly detection, image-to-image translation, denoising, and MRI reconstruction. However, due to the comp… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

  9. arXiv:2305.18753  [pdf, other

    eess.AS cs.SD

    Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, Volkan Kılıç, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model calle… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: INTERSPEECH 2023. arXiv admin note: substantial text overlap with arXiv:2210.05037

  10. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  11. arXiv:2301.12503  [pdf, other

    cs.SD cs.AI cs.MM eess.AS eess.SP

    AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

    Authors: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley

    Abstract: Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLA… ▽ More

    Submitted 9 September, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: Accepted by ICML 2023. Demo and implementation at https://audioldm.github.io. Evaluation toolbox at https://github.com/haoheliu/audioldm_eval

  12. arXiv:2212.02033  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Towards Generating Diverse Audio Captions via Adversarial Training

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g.… ▽ More

    Submitted 28 June, 2024; v1 submitted 5 December, 2022; originally announced December 2022.

    Comments: Accepted to TASLP

  13. arXiv:2211.12195  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Ontology-aware Learning and Evaluation for Audio Tagging

    Authors: Haohe Liu, Qiuqiang Kong, Xubo Liu, Xinhao Mei, Wenwu Wang, Mark D. Plumbley

    Abstract: This study defines a new evaluation metric for audio tagging tasks to overcome the limitation of the conventional mean average precision (mAP) metric, which treats different kinds of sound as independent classes without considering their relations. Also, due to the ambiguities in sound labeling, the labels in the training and evaluation set are not guaranteed to be accurate and exhaustive, which p… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. The code is open-sourced at https://github.com/haoheliu/ontology-aware-audio-tagging

    Journal ref: Proc. Interspeech 2023

  14. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  15. arXiv:2210.05037  [pdf, other

    cs.SD cs.LG eess.AS

    Automated Audio Captioning via Fusion of Low- and High- Dimensional Features

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, Mark D. Plumbley, Volkan Kilic, Wenwu Wang

    Abstract: Automated audio captioning (AAC) aims to describe the content of an audio clip using simple sentences. Existing AAC methods are developed based on an encoder-decoder architecture that success is attributed to the use of a pre-trained CNN10 called PANNs as the encoder to learn rich audio representations. AAC is a highly challenging task due to its high-dimensional talent space involves audio of var… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  16. arXiv:2210.00943  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Simple Pooling Front-ends For Efficient Audio Classification

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, Mark D. Plumbley, Wenwu Wang

    Abstract: Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram… ▽ More

    Submitted 6 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: ICASSP 2023

  17. arXiv:2207.10547  [pdf, other

    cs.SD eess.AS

    Surrey System for DCASE 2022 Task 5: Few-shot Bioacoustic Event Detection with Segment-level Metric Learning

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inferenc… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: Technical Report of the system that ranks 2nd in the DCASE Challenge Task 5. arXiv admin note: text overlap with arXiv:2207.07773

  18. arXiv:2207.07773  [pdf, other

    eess.AS cs.AI cs.SD eess.SP

    Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

    Authors: Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

    Abstract: Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model o… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: 2nd place in the DCASE 2022 Challenge Task 5. Submitted to the DCASE 2022 workshop

  19. arXiv:2205.05949  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Automated Audio Captioning: An Overview of Recent Progress and New Challenges

    Authors: Xinhao Mei, Xubo Liu, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural ne… ▽ More

    Submitted 26 September, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: Accepted by EURASIP Journal on Audio Speech and Music Processing

  20. arXiv:2203.15537  [pdf, ps, other

    eess.AS cs.SD

    On Metric Learning for Audio-Text Cross-Modal Retrieval

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models… ▽ More

    Submitted 30 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, accepted to InterSpeech2022

  21. arXiv:2203.15147  [pdf, other

    eess.AS cs.AI cs.CL cs.SD eess.SP

    Separate What You Describe: Language-Queried Audio Source Separation

    Authors: Xubo Liu, Haohe Liu, Qiuqiang Kong, Xinhao Mei, **zheng Zhao, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language description and its relation with the audio sources. To… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 3 figures

  22. arXiv:2203.03436  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    Deep Neural Decision Forest for Acoustic Scene Classification

    Authors: Jianyuan Sun, Xubo Liu, Xinhao Mei, **zheng Zhao, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-net… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: Submitted to the 30th European Signal Processing Conference (EUSIPCO), 5 pages, 2 figures

  23. arXiv:2203.02838  [pdf, other

    eess.AS cs.AI cs.SD

    Leveraging Pre-trained BERT for Audio Captioning

    Authors: Xubo Liu, Xinhao Mei, Qiushi Huang, Jianyuan Sun, **zheng Zhao, Haohe Liu, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring… ▽ More

    Submitted 27 March, 2022; v1 submitted 5 March, 2022; originally announced March 2022.

    Comments: Submitted to the 30th European Signal Processing Conference (EUSIPCO), 5 pages, 2 figures

  24. arXiv:2110.06691  [pdf, other

    eess.AS cs.SD

    Diverse Audio Captioning via Adversarial Training

    Authors: Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D. Plumbley, Wenwu Wang

    Abstract: Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE),which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using… ▽ More

    Submitted 29 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2022

  25. arXiv:2108.02752  [pdf, other

    eess.AS cs.SD

    An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

    Authors: Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, **gqian Wu, Yusong Wu, **zheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced t… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: 5 pages, 1 figure, submitted to DCASE 2021 workshop

  26. arXiv:2107.09990  [pdf, other

    eess.AS cs.AI cs.SD

    CL4AC: A Contrastive Loss for Audio Captioning

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded in… ▽ More

    Submitted 22 November, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: The first two authors contributed equally, 5 pages, 3 figures, accepted by DCASE2021 Workshop

  27. arXiv:2107.09817  [pdf, other

    eess.AS cs.LG cs.SD

    Audio Captioning Transformer

    Authors: Xinhao Mei, Xubo Liu, Qiushi Huang, Mark D. Plumbley, Wenwu Wang

    Abstract: Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio encoder. However, CNNs can be limited in modelling… ▽ More

    Submitted 20 July, 2021; originally announced July 2021.

    Comments: 5 pages, 1 figure

  28. arXiv:2001.07712  [pdf, other

    eess.IV cs.LG stat.ML

    SMAPGAN: Generative Adversarial Network Based Semi-Supervised Styled Map Tiles Generating Method

    Authors: X. Chen, S. Chen, T. Xu, B. Yin, X. Mei, J. Peng, H. Li

    Abstract: Traditional online map tiles, widely used on the Internet such as Google Map and Baidu Map, are rendered from vector data. Timely updating online map tiles from vector data, of which the generating is time-consuming, is a difficult mission. It is a shortcut to generate map tiles in time from remote sensing images, which can be acquired timely without vector data. However, this mission used to be c… ▽ More

    Submitted 1 April, 2021; v1 submitted 20 January, 2020; originally announced January 2020.

    Comments: in IEEE Transactions on Geoscience and Remote Sensing

  29. arXiv:2001.01377  [pdf, other

    cs.RO cs.LG eess.SY

    High-speed Autonomous Drifting with Deep Reinforcement Learning

    Authors: Peide Cai, Xiaodong Mei, Lei Tai, Yuxiang Sun, Ming Liu

    Abstract: Drifting is a complicated task for autonomous vehicle control. Most traditional methods in this area are based on motion equations derived by the understanding of vehicle dynamics, which is difficult to be modeled precisely. We propose a robust drift controller without explicit motion equations, which is based on the latest model-free deep reinforcement learning algorithm soft actor-critic. The dr… ▽ More

    Submitted 5 January, 2020; originally announced January 2020.