Skip to main content

Showing 1–28 of 28 results for author: Hershey, J R

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05629  [pdf, other

    cs.CV cs.CL cs.IR cs.LG cs.SD eess.AS

    Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

    Authors: Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

    Abstract: We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Computer Vision and Pattern Recognition 2024

  2. arXiv:2402.01413  [pdf, other

    cs.SD cs.LG eess.AS

    Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

    Authors: Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, Léonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker

    Abstract: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the U… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  3. arXiv:2308.10415  [pdf, other

    cs.SD cs.LG eess.AS

    TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

    Authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey

    Abstract: We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023, project webpage with audio demos at https://google-research.github.io/sound-separation/papers/tokensplit

  4. The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

    Authors: Simon Leglaive, Léonie Borne, Efthymios Tzinis, Mostafa Sadeghi, Matthieu Fraticelli, Scott Wisdom, Manuel Pariente, Daniel Pressnitzer, John R. Hershey

    Abstract: Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. This paper introduces the unsupervised domain adaptation for conversational speech enhanceme… ▽ More

    Submitted 2 October, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

    Journal ref: The 7th International Workshop on Speech Processing in Everyday Environments (CHiME), Dublin, Ireland, 2023

  5. arXiv:2305.11151  [pdf, other

    cs.SD eess.AS

    Unsupervised Multi-channel Separation and Adaptation

    Authors: Cong Han, Kevin Wilson, Scott Wisdom, John R. Hershey

    Abstract: A key challenge in machine learning is to generalize from training data to an application domain of interest. This work generalizes the recently-proposed mixture invariant training (MixIT) algorithm to perform unsupervised learning in the multi-channel setting. We use MixIT to train a model on far-field microphone array recordings of overlap** reverberant and noisy speech from the AMI Corpus. Th… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  6. arXiv:2305.05591  [pdf, other

    cs.SD cs.CV eess.AS

    AudioSlots: A slot-centric generative model for audio separation

    Authors: Pradyumna Reddy, Scott Wisdom, Klaus Greff, John R. Hershey, Thomas Kipf

    Abstract: In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transforme… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at the Self-supervision in Audio, Speech and Beyond (SASB) Workshop at ICASSP 2023

  7. arXiv:2207.10141  [pdf, other

    cs.SD cs.CV eess.AS

    AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

    Authors: Efthymios Tzinis, Scott Wisdom, Tal Remez, John R. Hershey

    Abstract: We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence o… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  8. arXiv:2207.00562  [pdf, other

    cs.SD eess.AS

    Distance-Based Sound Separation

    Authors: Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

    Abstract: We propose the novel task of distance-based sound separation, where sounds are separated based only on their distance from a single microphone. In the context of assisted listening devices, proximity provides a simple criterion for sound selection in noisy environments that would allow the user to focus on sounds relevant to a local conversation. We demonstrate the feasibility of this approach by… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted for publication at Interspeech 2022

  9. arXiv:2203.15652  [pdf, other

    eess.AS cs.SD

    CycleGAN-Based Unpaired Speech Dereverberation

    Authors: Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey

    Abstract: Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained on large amounts of data and a variety of room impulse responses when the data is synthetically reverberated, since acquiring real paired data is costly. In thi… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  10. arXiv:2110.10739  [pdf, other

    cs.SD eess.AS

    Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

    Authors: Aswin Sivaraman, Scott Wisdom, Hakan Erdogan, John R. Hershey

    Abstract: The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models in the sense that it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlap** reverberant and noisy speech data from the AMI Corpus. The models are tested on real A… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

  11. arXiv:2110.03209  [pdf, other

    eess.AS

    Improving Bird Classification with Unsupervised Sound Separation

    Authors: Tom Denton, Scott Wisdom, John R. Hershey

    Abstract: This paper addresses the problem of species classification in bird song recordings. The massive amount of available field recordings of birds presents an opportunity to use machine learning to automatically track bird populations. However, it also poses a problem: such field recordings typically contain significant environmental noise and overlap** vocalizations that interfere with classificatio… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures. Examples available at https://bird-mixit.github.io

  12. arXiv:2106.15813  [pdf, other

    eess.AS cs.SD

    DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

    Authors: Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani

    Abstract: Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequ… ▽ More

    Submitted 5 August, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: 5 pages, 2 figure. accepted for WASPAA 2021

  13. arXiv:2106.00847  [pdf, other

    eess.AS cs.SD

    Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

    Authors: Scott Wisdom, Aren Jansen, Ron J. Weiss, Hakan Erdogan, John R. Hershey

    Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. F… ▽ More

    Submitted 16 October, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure. WASPAA 2021

  14. arXiv:2105.02132  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Supervised Learning from Automatically Separated Sound Scenes

    Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

    Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this… ▽ More

    Submitted 14 September, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

  15. arXiv:2105.02096  [pdf, other

    cs.SD cs.LG eess.AS

    End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings

    Authors: Soumi Maiti, Hakan Erdogan, Kevin Wilson, Scott Wisdom, Shinji Watanabe, John R. Hershey

    Abstract: We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers,… ▽ More

    Submitted 5 May, 2021; originally announced May 2021.

    Comments: 5 pages, 2 figures, ICASSP 2021

    Journal ref: ICASSP 2021, SPE-54.1

  16. arXiv:2012.09727  [pdf, other

    eess.AS cs.SD eess.SP

    Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

    Authors: Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

    Abstract: Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all th… ▽ More

    Submitted 18 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

  17. arXiv:2011.02014  [pdf, other

    eess.AS cs.SD

    Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

    Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, **yu Li, Scott Wisdom, John R. Hershey

    Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  18. arXiv:2011.01143  [pdf, other

    cs.SD cs.CV eess.AS

    Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

    Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez, Daniel P. W. Ellis, John R. Hershey

    Abstract: Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Pri… ▽ More

    Submitted 29 May, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: ICLR 2021, 27 pages

  19. arXiv:2006.12701  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised Sound Separation Using Mixture Invariant Training

    Authors: Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron J. Weiss, Kevin Wilson, John R. Hershey

    Abstract: In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Accepted for spotlight presentation at NeurIPS 2020

  20. arXiv:1911.07953  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

    Authors: Zhong-Qiu Wang, Hakan Erdogan, Scott Wisdom, Kevin Wilson, Desh Raj, Shinji Watanabe, Zhuo Chen, John R. Hershey

    Abstract: This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, incl… ▽ More

    Submitted 3 November, 2020; v1 submitted 18 November, 2019; originally announced November 2019.

    Comments: 7 pages, 7 figures, IEEE SLT 2021 (slt2020.org)

  21. arXiv:1911.07951  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Improving Universal Sound Separation Using Sound Classification

    Authors: Efthymios Tzinis, Scott Wisdom, John R. Hershey, Aren Jansen, Daniel P. W. Ellis

    Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic s… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

    Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  22. arXiv:1905.03330  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Universal Sound Separation

    Authors: Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin Wilson, Jonathan Le Roux, John R. Hershey

    Abstract: Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a… ▽ More

    Submitted 2 August, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

    Comments: 5 pages, accepted to WASPAA 2019

  23. arXiv:1811.08521  [pdf, other

    cs.SD eess.AS

    Differentiable Consistency Constraints for Improved Deep Speech Enhancement

    Authors: Scott Wisdom, John R. Hershey, Kevin Wilson, Jeremy Thorpe, Michael Chinen, Brian Patton, Rif A. Saurous

    Abstract: In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglec… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

  24. arXiv:1811.02508  [pdf, other

    cs.SD eess.AS

    SDR - half-baked or well done?

    Authors: Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, John R. Hershey

    Abstract: In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total disto… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

  25. arXiv:1810.01395  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    Phasebook and Friends: Leveraging Discrete Representations for Source Separation

    Authors: Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sarroff, John R. Hershey

    Abstract: Deep learning based speech enhancement and source separation systems have recently reached unprecedented levels of quality, to the point that performance is reaching a new ceiling. Most systems rely on estimating the magnitude of a target source by estimating a real-valued mask to be applied to a time-frequency representation of the mixture signal. A limiting factor in such approaches is a lack of… ▽ More

    Submitted 7 March, 2019; v1 submitted 2 October, 2018; originally announced October 2018.

  26. arXiv:1805.05826  [pdf, other

    cs.SD cs.CL eess.AS stat.ML

    A Purely End-to-end System for Multi-speaker Speech Recognition

    Authors: Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

    Abstract: Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence fr… ▽ More

    Submitted 15 May, 2018; originally announced May 2018.

    Comments: ACL 2018

  27. arXiv:1804.10204  [pdf, other

    cs.SD cs.CL cs.LG eess.AS stat.ML

    End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

    Authors: Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, John R. Hershey

    Abstract: This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This… ▽ More

    Submitted 26 April, 2018; originally announced April 2018.

    Comments: Submitted to Interspeech 2018

  28. arXiv:1711.08016  [pdf, other

    eess.AS cs.CL cs.SD

    Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

    Authors: Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

    Abstract: Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate re… ▽ More

    Submitted 21 November, 2017; originally announced November 2017.

    Comments: in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Journal ref: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 271-275