Skip to main content

Showing 1–10 of 10 results for author: Berghi, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.16058  [pdf, other

    eess.AS

    Text-Queried Target Sound Event Localization

    Authors: **zheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

    Abstract: Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted by EUSIPCO 2024

  2. arXiv:2406.00495  [pdf, other

    eess.AS cs.CV cs.SD

    Audio-Visual Talker Localization in Video for Spatial Sound Reproduction

    Authors: Davide Berghi, Philip J. B. Jackson

    Abstract: Object-based audio production requires the positional metadata to be defined for each point-source object, including the key elements in the foreground of the sound scene. In many media production use cases, both cameras and microphones are employed to make recordings, and the human voice is often a key element. In this research, we detect and locate the active speaker in the video, facilitating t… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  3. arXiv:2312.14021  [pdf, other

    eess.AS cs.LG cs.SD eess.IV eess.SP

    Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

    Authors: Davide Berghi, Philip J. B. Jackson

    Abstract: Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features ext… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  4. arXiv:2312.09034  [pdf, other

    eess.AS cs.SD eess.IV

    Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

    Authors: Davide Berghi, Peipei Wu, **zheng Zhao, Wenwu Wang, Philip J. B. Jackson

    Abstract: Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore th… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  5. arXiv:2310.14778  [pdf, other

    cs.MM cs.SD eess.AS

    Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

    Authors: **zheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang

    Abstract: Audio-visual speaker tracking has drawn increasing attention over the past few years due to its academic values and wide application. Audio and visual modalities can provide complementary information for localization and tracking. With audio and visual information, the Bayesian-based filter can solve the problem of data association, audio-visual fusion and track management. In this paper, we condu… ▽ More

    Submitted 17 December, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

  6. arXiv:2307.14739  [pdf, other

    eess.AS eess.SP

    Audio Inputs for Active Speaker Detection and Localization via Microphone Array

    Authors: Davide Berghi, Philip J. B. Jackson

    Abstract: This study considers the problem of detecting and locating an active talker's horizontal position from multichannel audio captured by a microphone array. We refer to this as active speaker detection and localization (ASDL). Our goal was to investigate the performance of spatial acoustic features extracted from the multichannel audio as the input of a convolutional recurrent neural network (CRNN),… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

  7. arXiv:2212.01892  [pdf, other

    eess.AS cs.MM cs.SD

    Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

    Authors: Davide Berghi, Marco Volino, Philip J. B. Jackson

    Abstract: 3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio re… ▽ More

    Submitted 4 December, 2022; originally announced December 2022.

  8. arXiv:2203.03291  [pdf, other

    eess.AS cs.SD eess.IV

    Visually Supervised Speaker Detection and Localization via Microphone Array

    Authors: Davide Berghi, Adrian Hilton, Philip J. B. Jackson

    Abstract: Active speaker detection (ASD) is a multi-modal task that aims to identify who, if anyone, is speaking from a set of candidates. Current audio-visual approaches for ASD typically rely on visually pre-extracted face tracks (sequences of consecutive face crops) and the respective monaural audio. However, their recall rate is often low as only the visible faces are included in the set of candidates.… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: Erratum: Due to a bug in the evaluation script, the correct average distance (aD) metric is here reported in yellow. The analysis remains unchanged from the original paper as the trend between the old and new measures are perfectly monotonic. The bug was caused by an incorrect normalization factor

    Journal ref: IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), 2021

  9. arXiv:2105.00641  [pdf, other

    cs.MM cs.SD eess.AS

    Naturalistic audio-visual volumetric sequences dataset of sounding actions for six degree-of-freedom interaction

    Authors: Hanne Stenzel, Davide Berghi, Marco Volino, Philip J. B. Jackson

    Abstract: As audio-visual systems increasingly bring immersive and interactive capabilities into our work and leisure activities, so the need for naturalistic test material grows. New volumetric datasets have captured high-quality 3D video, but accompanying audio is often neglected, making it hard to test an integrated bimodal experience. Designed to cover diverse sound types and features, the presented vol… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

    Comments: for dataset visit cvssp.org/data/navvs; accepted as poster in IEEE VR 2021

  10. arXiv:2003.06656  [pdf, other

    eess.AS cs.SD eess.IV

    Audio-Visual Spatial Aligment Requirements of Central and Peripheral Object Events

    Authors: Davide Berghi, Hanne Stenzel, Marco Volino, Adrian Hilton, Philip J. B. Jackson

    Abstract: Immersive audio-visual perception relies on the spatial integration of both auditory and visual information which are heterogeneous sensing modalities with different fields of reception and spatial resolution. This study investigates the perceived coherence of audiovisual object events presented either centrally or peripherally with horizontally aligned/misaligned sound. Various object events were… ▽ More

    Submitted 14 March, 2020; originally announced March 2020.

    Comments: Two-pages poster abstract

    Journal ref: IEEE VR 2020