Skip to main content

Showing 1–25 of 25 results for author: Tagliasacchi, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.10419  [pdf, other

    eess.AS cs.CL

    MAD Speech: Measures of Acoustic Diversity of Speech

    Authors: Matthieu Futeral, Andrea Agostinelli, Marco Tagliasacchi, Neil Zeghidour, Eugene Kharitonov

    Abstract: Generative spoken language models produce speech in a wide range of voices, prosody, and recording conditions, seemingly approaching the diversity of natural speech. However, the extent to which generated speech is acoustically diverse remains unclear due to a lack of appropriate metrics. We address this gap by develo** lightweight metrics of acoustic diversity, which we collectively refer to as… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  2. arXiv:2308.10415  [pdf, other

    cs.SD cs.LG eess.AS

    TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

    Authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey

    Abstract: We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on transcripts and audio token sequences and achieves multiple tasks through masking of inputs. The model is a sequence-to-sequence encoder-decoder model that uses… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023, project webpage with audio demos at https://google-research.github.io/sound-separation/papers/tokensplit

  3. arXiv:2306.12925  [pdf, other

    cs.CL cs.AI cs.SD eess.AS stat.ML

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

    Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: Technical report

  4. arXiv:2305.09636  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStorm: Efficient Parallel Audio Generation

    Authors: Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consist… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

  5. arXiv:2303.12984  [pdf, other

    cs.SD eess.AS

    LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

    Authors: Teerapat Jenrungrot, Michael Chinen, W. Bastiaan Kleijn, Jan Skoglund, Zalán Borsos, Neil Zeghidour, Marco Tagliasacchi

    Abstract: We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the tran… ▽ More

    Submitted 22 March, 2023; originally announced March 2023.

    Comments: 5 pages, accepted to ICASSP 2023, project page: https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec

  6. arXiv:2302.03540  [pdf, other

    cs.SD eess.AS

    Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

    Authors: Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  7. arXiv:2301.11325  [pdf, other

    cs.SD cs.LG eess.AS

    MusicLM: Generating Music From Text

    Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

    Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

  8. arXiv:2209.03143  [pdf, other

    cs.SD cs.LG eess.AS

    AudioLM: a Language Modeling Approach to Audio Generation

    Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati… ▽ More

    Submitted 25 July, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

  9. arXiv:2204.05738  [pdf, other

    eess.AS cs.SD

    Text-Driven Separation of Arbitrary Sounds

    Authors: Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

    Abstract: We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model, SoundWords, is trained to jointly embed both an audio clip and its textual description to the same embedding in a shared representation. The second model, Sound… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  10. arXiv:2203.15652  [pdf, other

    eess.AS cs.SD

    CycleGAN-Based Unpaired Speech Dereverberation

    Authors: Hannah Muckenhirn, Aleksandr Safin, Hakan Erdogan, Felix de Chaumont Quitry, Marco Tagliasacchi, Scott Wisdom, John R. Hershey

    Abstract: Typically, neural network-based speech dereverberation models are trained on paired data, composed of a dry utterance and its corresponding reverberant utterance. The main limitation of this approach is that such models can only be trained on large amounts of data and a variety of room impulse responses when the data is synthetically reverberated, since acquiring real paired data is costly. In thi… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  11. arXiv:2203.15578  [pdf, other

    cs.SD cs.LG eess.AS

    Disentangling speech from surroundings with neural embeddings

    Authors: Ahmed Omran, Neil Zeghidour, Zalán Borsos, Félix de Chaumont Quitry, Malcolm Slaney, Marco Tagliasacchi

    Abstract: We present a method to separate speech signals from noisy environments in the embedding space of a neural audio codec. We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors, where one part of the embedding vector represents the speech signal, and the rest represent the environment. We achieve this by partitioning t… ▽ More

    Submitted 4 June, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at ICASSP 2023

  12. arXiv:2203.00756  [pdf, other

    eess.AS cs.SD

    Real time spectrogram inversion on mobile phone

    Authors: Oleg Rybakov, Marco Tagliasacchi, Yunpeng Li, Liyang Jiang, Xia Zhang, Fadi Biadsy

    Abstract: We present two methods of real time magnitude spectrogram inversion: streaming Griffin Lim(GL) and streaming MelGAN. We demonstrate the impact of looking ahead on perceptual quality of MelGAN. As little as one hop size (12.5ms) of lookahead is able to significantly improve perceptual quality in comparison to its causal version. We compare streaming GL with the streaming MelGAN and show different t… ▽ More

    Submitted 24 May, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

  13. arXiv:2202.07273  [pdf, other

    cs.SD cs.LG eess.AS

    SpeechPainter: Text-conditioned Speech Inpainting

    Authors: Zalán Borsos, Matt Sharifi, Marco Tagliasacchi

    Abstract: We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed… ▽ More

    Submitted 30 March, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Submitted to Interspeech 2022

  14. arXiv:2107.03312  [pdf, other

    cs.SD cs.LG eess.AS

    SoundStream: An End-to-End Neural Audio Codec

    Authors: Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi

    Abstract: We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and s… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

  15. arXiv:2105.02132  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Supervised Learning from Automatically Separated Sound Scenes

    Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra

    Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this… ▽ More

    Submitted 14 September, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

  16. arXiv:2101.08596  [pdf, other

    cs.SD cs.LG eess.AS

    LEAF: A Learnable Frontend for Audio Classification

    Authors: Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi

    Abstract: Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: Accepted at ICLR 2021

  17. arXiv:2011.02421  [pdf, other

    eess.AS

    One-shot conditional audio filtering of arbitrary sounds

    Authors: Beat Gfeller, Dominik Roblek, Marco Tagliasacchi

    Abstract: We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source. Using SoundFilter, a wave-to-wave neural network architecture, we can train a model without using any sound class labels. Using a conditioning encoder model which is learned jointly with the source separation network, the trained model can be "configured… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

  18. arXiv:2010.10677  [pdf, other

    eess.AS cs.SD

    Real-time Speech Frequency Bandwidth Extension

    Authors: Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, Dominik Roblek

    Abstract: In this paper we propose a lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which uses a combination of… ▽ More

    Submitted 9 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

  19. arXiv:2010.09658  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    MicAugment: One-shot Microphone Style Transfer

    Authors: Zalán Borsos, Yunpeng Li, Beat Gfeller, Marco Tagliasacchi

    Abstract: A crucial aspect for the successful deployment of audio-based models "in-the-wild" is the robustness to the transformations introduced by heterogeneous acquisition conditions. In this work, we propose a method to perform one-shot microphone style transfer. Given only a few seconds of audio recorded by a target device, MicAugment identifies the transformations associated to the input acquisition pi… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

  20. arXiv:2009.02095  [pdf, other

    eess.AS cs.LG cs.SD

    SEANet: A Multi-modal Speech Enhancement Network

    Authors: Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek

    Abstract: We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAnceme… ▽ More

    Submitted 1 October, 2020; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: Accepted to INTERSPEECH 2020

  21. arXiv:2008.02027  [pdf, other

    eess.AS cs.LG

    Learning to Denoise Historical Music

    Authors: Yunpeng Li, Beat Gfeller, Marco Tagliasacchi, Dominik Roblek

    Abstract: We propose an audio-to-audio neural network model that learns to denoise old music recordings. Our model internally converts its input into a time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting complex spectrogram using a convolutional neural network. The network is trained with both reconstruction and adversarial objectives on a synthetic n… ▽ More

    Submitted 16 June, 2022; v1 submitted 5 August, 2020; originally announced August 2020.

    Comments: ISMIR 2020

  22. arXiv:2002.12764  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Towards Learning a Universal Non-Semantic Representation of Speech

    Authors: Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, Yinnon Haviv

    Abstract: The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a… ▽ More

    Submitted 6 August, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of INTERSPEECH 2020

  23. arXiv:1910.11910  [pdf, other

    eess.AS cs.LG cs.SD

    Learning audio representations via phase prediction

    Authors: Félix de Chaumont Quitry, Marco Tagliasacchi, Dominik Roblek

    Abstract: We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of the input waveform to a lower dimensional embedding. A convolutional decoder is then used to predict the instantaneous frequency (i.e., the temporal rate of ch… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  24. arXiv:1910.11664  [pdf, other

    eess.AS cs.LG cs.SD

    SPICE: Self-supervised Pitch Estimation

    Authors: Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, Mihajlo Velimirović

    Abstract: We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. Th… ▽ More

    Submitted 4 September, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE Transactions on Audio, Speech and Language Processing

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1118-1128, 2020

  25. arXiv:1905.11796  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Self-supervised audio representation learning for mobile devices

    Authors: Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, Dominik Roblek

    Abstract: We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular te… ▽ More

    Submitted 24 May, 2019; originally announced May 2019.