Skip to main content

Showing 1–13 of 13 results for author: Germain, F

.
  1. arXiv:2406.04212  [pdf, ps, other

    eess.AS cs.SD

    Sound Event Bounding Boxes

    Authors: Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

    Abstract: Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-l… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted for publication at Interspeech 2024

  2. arXiv:2404.02252  [pdf, other

    cs.SD eess.AS

    SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

    Authors: Junghyun Koo, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of dr… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  3. arXiv:2402.18407  [pdf, other

    eess.AS

    Why does music source separation benefit from cacophony?

    Authors: Chang-Bin Jeon, Gordon Wichern, François G. Germain, Jonathan Le Roux

    Abstract: In music source separation, a standard training data augmentation procedure is to create new training samples by randomly combining instrument stems from different songs. These random mixes have mismatched characteristics compared to real music, e.g., the different stems do not have consistent beat or tonality, resulting in a cacophony. In this work, we investigate why random mixing is effective w… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: ICASSP 2024 Workshop on Explainable AI for Speech and Audio

  4. arXiv:2402.17907  [pdf, other

    eess.AS cs.SD

    NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

    Authors: Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Head-related transfer functions (HRTFs) are important for immersive audio, and their spatial interpolation has been studied to upsample finite measurements. Recently, neural fields (NFs) which map from sound source direction to HRTF have gained attention. Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  5. arXiv:2312.07513  [pdf, other

    eess.AS cs.SD

    NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

    Authors: Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

    Abstract: Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of t… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  6. arXiv:2310.19644  [pdf, other

    eess.AS cs.MM

    Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

    Authors: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker a… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  7. arXiv:2310.10604  [pdf, other

    eess.AS cs.SD

    Generation or Replication: Auscultating Audio Latent Diffusion Models

    Authors: Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  8. arXiv:2309.17352  [pdf, other

    cs.SD eess.AS

    Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

    Authors: Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

    Abstract: Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this w… ▽ More

    Submitted 9 January, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICASSP 2024 camera-ready paper. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

  9. arXiv:2304.02160  [pdf, other

    cs.SD cs.LG eess.AS

    Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

    Authors: Ke Chen, Gordon Wichern, François G. Germain, Jonathan Le Roux

    Abstract: In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for mu… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: 5 pages, 2 figures, 3 tables

  10. arXiv:2211.02527  [pdf, other

    eess.AS cs.SD

    Cold Diffusion for Speech Enhancement

    Authors: Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux

    Abstract: Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties o… ▽ More

    Submitted 23 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: 5 pages, 1 figure, 1 table, 3 algorithms. To appear in ICASSP 2023. With corrected references

  11. arXiv:2211.01299  [pdf, other

    eess.AS cs.CL cs.SD

    Late Audio-Visual Fusion for In-The-Wild Speaker Diarization

    Authors: Zexu Pan, Gordon Wichern, François G. Germain, Aswin Subramanian, Jonathan Le Roux

    Abstract: Speaker diarization is well studied for constrained audios but little explored for challenging in-the-wild videos, which have more speakers, shorter utterances, and inconsistent on-screen speakers. We address this gap by proposing an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system… ▽ More

    Submitted 27 September, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  12. arXiv:1806.10522  [pdf, other

    eess.AS cs.SD

    Speech Denoising with Deep Feature Losses

    Authors: Francois G. Germain, Qifeng Chen, Vladlen Koltun

    Abstract: We present an end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly. Given input audio containing speech corrupted by an additive background signal, the system aims to produce a processed signal that contains only the speech content. Recent approaches have shown promising results using various deep network architectures. In this paper, we propose to… ▽ More

    Submitted 14 September, 2018; v1 submitted 27 June, 2018; originally announced June 2018.

    Comments: Code can be found at https://github.com/francoisgermain/SpeechDenoisingWithDeepFeatureLosses . Sound examples can be found at https://ccrma.stanford.edu/~francois/SpeechDenoisingWithDeepFeatureLosses/

  13. DPA on quasi delay insensitive asynchronous circuits: formalization and improvement

    Authors: G. F. Bouesse, M. Renaudin, S. Dumont, F. Germain

    Abstract: The purpose of this paper is to formally specify a flow devoted to the design of Differential Power Analysis (DPA) resistant QDI asynchronous circuits. The paper first proposes a formal modeling of the electrical signature of QDI asynchronous circuits. The DPA is then applied to the formal model in order to identify the source of leakage of this type of circuits. Finally, a complete design flow… ▽ More

    Submitted 18 October, 2007; originally announced October 2007.

    Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

    Journal ref: Dans Design, Automation and Test in Europe - DATE'05, Munich : Allemagne (2005)