Skip to main content

Showing 1–23 of 23 results for author: Dinkel, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.13275  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

    Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

    Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More

    Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2406.07012  [pdf, other

    cs.SD cs.CL eess.AS

    Bridging Language Gaps in Audio-Text Retrieval

    Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More

    Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: interspeech2024

  3. arXiv:2406.06992  [pdf, other

    cs.SD eess.AS

    Scaling up masked audio encoder learning for general audio classification

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  4. arXiv:2308.11957  [pdf, other

    cs.SD eess.AS

    CED: Consistent ensemble distillation for audio tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training fram… ▽ More

    Submitted 7 September, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

  5. arXiv:2306.16241  [pdf, other

    cs.SD eess.AS

    Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

    Authors: Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor,… ▽ More

    Submitted 7 October, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Proc. INTERSPEECH 2023, 2488-2492, doi: 10.21437/Interspeech.2023-218

  6. arXiv:2306.14170  [pdf, other

    cs.MM cs.SD eess.AS

    AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

    Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

    Abstract: Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted by ICASSP2023

  7. arXiv:2305.18794  [pdf, other

    cs.SD eess.AS

    Understanding temporally weakly supervised training: A case study for keyword spotting

    Authors: Heinrich Dinkel, Weiji Zhuang, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporal… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  8. arXiv:2305.17834  [pdf, other

    cs.SD eess.AS

    Streaming Audio Transformers for Online Audio Tagging

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audi… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech2024

  9. arXiv:2303.01812  [pdf, other

    cs.SD eess.AS

    Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Keyword spotting (KWS) is a core human-machine-interaction front-end task for most modern intelligent assistants. Recently, a unified (UniKW-AT) framework has been proposed that adds additional capabilities in the form of audio tagging (AT) to a KWS model. However, previous work did not consider the real-world deployment of a UniKW-AT model, where factors such as model size and inference speed are… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  10. An empirical study of weakly supervised audio tagging embeddings for general audio representations

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning meth… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: Odyssey 2022

  11. UniKW-AT: Unified Keyword Spotting and Audio Tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training bot… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  12. arXiv:2205.14340  [pdf, other

    cs.RO eess.SY

    Insights from an Industrial Collaborative Assembly Project: Lessons in Research and Collaboration

    Authors: Tan Chen, Zhe Huang, James Motes, Junyi Geng, Quang Minh Ta, Holly Dinkel, Hameed Abdul-Rashid, Jessica Myers, Ye-Ji Mun, Wei-che Lin, Yuan-yung Huang, Sizhe Liu, Marco Morales, Nancy M. Amato, Katherine Driggs-Campbell, Timothy Bretl

    Abstract: Significant progress in robotics reveals new opportunities to advance manufacturing. Next-generation industrial automation will require both integration of distinct robotic technologies and their application to challenging industrial environments. This paper presents lessons from a collaborative assembly project between three academic research groups and an industry partner. The goal of the projec… ▽ More

    Submitted 28 May, 2022; originally announced May 2022.

    Comments: Spotlight presentation at ICRA 2022 Workshop on Collaborative Robots and the Work of the Future (ICRA 2022 CoR-WotF); see the spotlight presentation at https://sites.google.com/view/icra22ws-cor-wotf/accepted-papers?authuser=0

  13. arXiv:2204.13430  [pdf, other

    cs.SD eess.AS

    Pseudo strong labels for large scale weakly supervised audio tagging

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: Large-scale audio tagging datasets inevitably contain imperfect labels, such as clip-wise annotated (temporally weak) tags with no exact on- and offsets, due to a high manual labeling cost. This work proposes pseudo strong labels (PSL), a simple label augmentation framework that enhances the supervision quality for large-scale weakly supervised audio tagging. A machine annotator is first trained o… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: Accepted by ICASSP 2022

  14. Voice activity detection in the wild: A data-driven approach using teacher-student training

    Authors: Heinrich Dinkel, Shuai Wang, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised d… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1542-1555, 2021

  15. arXiv:2102.11474  [pdf, other

    cs.SD eess.AS

    Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audioca… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

  16. arXiv:2102.11457  [pdf, other

    cs.SD eess.AS

    Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Zeyu Xie, Kai Yu

    Abstract: Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedd… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

  17. Towards duration robust weakly supervised sound event detection

    Authors: Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Sound event detection (SED) is the task of tagging the absence or presence of audio events and their corresponding interval within a given audio clip. While SED can be done using supervised machine learning, where training data is fully labeled with access to per event timestamps and duration, our work focuses on weakly-supervised sound event detection (WSSED), where prior knowledge about an event… ▽ More

    Submitted 4 February, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

  18. End-to-end spoofing detection with raw waveform CLDNNs

    Authors: Heinrich Dinkel, Nanxin Chen, Yanmin Qian, Kai Yu

    Abstract: Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classif… ▽ More

    Submitted 26 July, 2020; originally announced July 2020.

    Comments: 5 pages

    Journal ref: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  19. arXiv:2003.12222  [pdf, other

    cs.SD eess.AS

    Voice activity detection in the wild via weakly supervised sound event detection

    Authors: Heinrich Dinkel, Yefei Chen, Mengyue Wu, Kai Yu

    Abstract: Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-… ▽ More

    Submitted 16 August, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: Accepted in Interspeech 2020

  20. arXiv:1910.13028  [pdf, other

    cs.HC cs.SD eess.AS

    DEPA: Self-Supervised Audio Embedding for Depression Detection

    Authors: **yue Zhang, Mengyue Wu, Heinrich Dinkel, Kai Yu

    Abstract: Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently, self-supervised learning has seen success in pretraining text embeddings and has been applied broadly on related tasks with sparse data, while pretrained audio embeddings based on self-supervised learning are rarely investigate… ▽ More

    Submitted 28 October, 2021; v1 submitted 28 October, 2019; originally announced October 2019.

    Journal ref: In Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 2021

  21. arXiv:1905.13448  [pdf, other

    cs.SD cs.CL eess.AS

    Audio Caption in a Car Setting with a Sentence-Level Loss

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the… ▽ More

    Submitted 23 October, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

  22. Duration robust weakly supervised sound event detection

    Authors: Heinrich Dinkel, Kai Yu

    Abstract: Task 4 of the DCASE2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) clips. This work aims to investigate the performance impact of fixed-sized window median filter post-processing and advocate the… ▽ More

    Submitted 26 January, 2020; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Accepted by ICASSP2020

  23. arXiv:1902.09254  [pdf, other

    cs.SD cs.CL eess.AS

    Audio Caption: Listen and Tell

    Authors: Mengyue Wu, Heinrich Dinkel, Kai Yu

    Abstract: Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field… ▽ More

    Submitted 30 May, 2019; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: accepted by ICASSP2019