Skip to main content

Showing 1–16 of 16 results for author: Mandel, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.04615  [pdf, other

    eess.AS cs.CL cs.SD

    What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models

    Authors: Enis Berk Çoban, Michael I. Mandel, Johanna Devaney

    Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM comp… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 9 pages

  2. arXiv:2309.16867  [pdf, other

    eess.AS cs.SD

    Towards High Resolution Weather Monitoring with Sound Data

    Authors: Enis Berk Çoban, Megan Perra, Michael I. Mandel

    Abstract: Across various research domains, remotely-sensed weather products are valuable for answering many scientific questions; however, their temporal and spatial resolutions are often too coarse to answer many questions. For instance, in wildlife research, it's crucial to have fine-scaled, highly localized weather observations when studying animal movement and behavior. This paper harnesses acoustic dat… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: 5 pages, submitted to ICASSP 2024

  3. arXiv:2211.12232  [pdf, other

    cs.SD cs.LG eess.AS

    AERO: Audio Super Resolution in the Spectral Domain

    Authors: Moshe Mandel, Or Tal, Yossi Adi

    Abstract: We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain. AERO is based on an encoder-decoder architecture with U-Net like skip connections. We optimize the model using both time and frequency domain loss functions. Specifically, we consider a set of reconstruction losses together with perceptual ones in the form of adversarial and feature disc… ▽ More

    Submitted 26 February, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

  4. arXiv:2206.11000  [pdf, other

    eess.AS cs.LG cs.SD

    A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement

    Authors: Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi

    Abstract: Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: Published @ Interspeech 2022

  5. arXiv:2112.07156  [pdf, ps, other

    eess.AS cs.LG cs.SD

    ImportantAug: a data augmentation agent for speech

    Authors: Viet Anh Trinh, Hassan Salami Kavaki, Michael I Mandel

    Abstract: We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of… ▽ More

    Submitted 19 February, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

    Comments: To appear in Proceeding of ICASSP 2022, May 2022

  6. arXiv:2012.03388  [pdf, other

    cs.SD cs.LG eess.AS

    Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

    Authors: Felix Grezes, Zhaoheng Ni, Viet Anh Trinh, Michael Mandel

    Abstract: Recurrent neural networks using the LSTM architecture can achieve significant single-channel noise reduction. It is not obvious, however, how to apply them to multi-channel inputs in a way that can generalize to new microphone configurations. In contrast, spatial clustering techniques can achieve such generalization, but lack a strong signal model. This paper combines the two approaches to attain… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: text overlap with arXiv:2012.01576, arXiv:2012.02191

  7. arXiv:2012.02191  [pdf, other

    cs.SD cs.LG eess.AS

    Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks

    Authors: Zhaoheng Ni, Felix Grezes, Viet Anh Trinh, Michael I. Mandel

    Abstract: Spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations, but have difficulty incorporating a detailed speech/noise model. In contrast, LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-chan… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:2012.01576

  8. arXiv:2012.01576  [pdf, other

    cs.SD cs.LG eess.AS

    Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

    Authors: Felix Grezes, Zhaoheng Ni, Viet Anh Trinh, Michael Mandel

    Abstract: Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Ou… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

  9. arXiv:2011.09162  [pdf, other

    eess.AS cs.SD

    WPD++: An Improved Neural Beamformer for Simultaneous Speech Separation and Dereverberation

    Authors: Zhaoheng Ni, Yong Xu, Meng Yu, Bo Wu, Shixiong Zhang, Dong Yu, Michael I Mandel

    Abstract: This paper aims at eliminating the interfering speakers' speech, additive noise, and reverberation from the noisy multi-talker speech mixture that benefits automatic speech recognition (ASR) backend. While the recently proposed Weighted Power minimization Distortionless response (WPD) beamformer can perform separation and dereverberation simultaneously, the noise cancellation component still has t… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

    Comments: accepted by SLT 2021

  10. Large scale evaluation of importance maps in automatic speech recognition

    Authors: Viet Anh Trinh, Michael I Mandel

    Abstract: In this paper, we propose a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is al… ▽ More

    Submitted 21 May, 2020; originally announced May 2020.

    Comments: submitted to INTERSPEECH 2020

    Journal ref: Proceedings of Interspeech 2020

  11. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  12. arXiv:1911.06266  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement

    Authors: Soumi Maiti, Michael I Mandel

    Abstract: Traditional speech enhancement systems produce speech with compromised quality. Here we propose to use the high quality speech generation capability of neural vocoders for better quality speech enhancement. We term this parametric resynthesis (PR). In previous work, we showed that PR systems generate high quality speech for a single speaker using two neural vocoders, WaveNet and WaveGlow. Both the… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

  13. arXiv:1911.00982  [pdf, other

    eess.AS cs.SD

    Onssen: an open-source speech separation and enhancement library

    Authors: Zhaoheng Ni, Michael I Mandel

    Abstract: Speech separation is an essential task for multi-talker speech recognition. Recently many deep learning approaches are proposed and have been constantly refreshing the state-of-the-art performances. The lack of algorithm implementations limits researchers to use the same dataset for comparison. Building a generic platform can benefit researchers by easily implementing novel separation algorithms a… ▽ More

    Submitted 3 November, 2019; originally announced November 2019.

    Comments: Submitted to ICASSP 2020

  14. arXiv:1906.06762  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Parametric Resynthesis with neural vocoders

    Authors: Soumi Maiti, Michael I Mandel

    Abstract: Noise suppression systems generally produce output speech with compromised quality. We propose to utilize the high quality speech generation capability of neural vocoders for noise suppression. We use a neural network to predict clean mel-spectrogram features from noisy speech and then compare two neural vocoders, WaveNet and WaveGlow, for synthesizing clean speech from the predicted mel spectrogr… ▽ More

    Submitted 14 November, 2019; v1 submitted 16 June, 2019; originally announced June 2019.

  15. arXiv:1904.01537  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Speech denoising by parametric resynthesis

    Authors: Soumi Maiti, Michael I Mandel

    Abstract: This work proposes the use of clean speech vocoder parameters as the target for a neural network performing speech enhancement. These parameters have been designed for text-to-speech synthesis so that they both produce high-quality resyntheses and also are straightforward to model with neural networks, but have not been utilized in speech enhancement until now. In comparison to a matched text-to-s… ▽ More

    Submitted 2 April, 2019; originally announced April 2019.

  16. arXiv:1103.2832  [pdf, other

    cs.LG cs.IR cs.SD

    Autotagging music with conditional restricted Boltzmann machines

    Authors: Michael Mandel, Razvan Pascanu, Hugo Larochelle, Yoshua Bengio

    Abstract: This paper describes two applications of conditional restricted Boltzmann machines (CRBMs) to the task of autotagging music. The first consists of training a CRBM to predict tags that a user would apply to a clip of a song based on tags already applied by other users. By learning the relationships between tags, this model is able to pre-process training data to significantly improve the performanc… ▽ More

    Submitted 14 March, 2011; originally announced March 2011.