Skip to main content

Showing 1–13 of 13 results for author: Roblek, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2209.03143  [pdf, other

    cs.SD cs.LG eess.AS

    AudioLM: a Language Modeling Approach to Audio Generation

    Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

    Abstract: We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenizati… ▽ More

    Submitted 25 July, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

  2. arXiv:2010.10677  [pdf, other

    eess.AS cs.SD

    Real-time Speech Frequency Bandwidth Extension

    Authors: Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, Dominik Roblek

    Abstract: In this paper we propose a lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which uses a combination of… ▽ More

    Submitted 9 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

  3. arXiv:2009.02095  [pdf, other

    eess.AS cs.LG cs.SD

    SEANet: A Multi-modal Speech Enhancement Network

    Authors: Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek

    Abstract: We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAnceme… ▽ More

    Submitted 1 October, 2020; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: Accepted to INTERSPEECH 2020

  4. arXiv:2008.02027  [pdf, other

    eess.AS cs.LG

    Learning to Denoise Historical Music

    Authors: Yunpeng Li, Beat Gfeller, Marco Tagliasacchi, Dominik Roblek

    Abstract: We propose an audio-to-audio neural network model that learns to denoise old music recordings. Our model internally converts its input into a time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting complex spectrogram using a convolutional neural network. The network is trained with both reconstruction and adversarial objectives on a synthetic n… ▽ More

    Submitted 16 June, 2022; v1 submitted 5 August, 2020; originally announced August 2020.

    Comments: ISMIR 2020

  5. arXiv:2002.01322  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Training Keyword Spotters with Limited and Synthesized Speech Data

    Authors: James Lin, Kevin Kilgour, Dominik Roblek, Matthew Sharifi

    Abstract: With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term… ▽ More

    Submitted 31 January, 2020; originally announced February 2020.

  6. arXiv:1910.11910  [pdf, other

    eess.AS cs.LG cs.SD

    Learning audio representations via phase prediction

    Authors: Félix de Chaumont Quitry, Marco Tagliasacchi, Dominik Roblek

    Abstract: We learn audio representations by solving a novel self-supervised learning task, which consists of predicting the phase of the short-time Fourier transform from its magnitude. A convolutional encoder is used to map the magnitude spectrum of the input waveform to a lower dimensional embedding. A convolutional decoder is then used to predict the instantaneous frequency (i.e., the temporal rate of ch… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  7. arXiv:1910.11664  [pdf, other

    eess.AS cs.LG cs.SD

    SPICE: Self-supervised Pitch Estimation

    Authors: Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, Mihajlo Velimirović

    Abstract: We propose a model to estimate the fundamental frequency in monophonic audio, often referred to as pitch estimation. We acknowledge the fact that obtaining ground truth annotations at the required temporal and frequency resolution is a particularly daunting task. Therefore, we propose to adopt a self-supervised learning technique, which is able to estimate pitch without any form of supervision. Th… ▽ More

    Submitted 4 September, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE Transactions on Audio, Speech and Language Processing

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1118-1128, 2020

  8. arXiv:1905.11796  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Self-supervised audio representation learning for mobile devices

    Authors: Marco Tagliasacchi, Beat Gfeller, Félix de Chaumont Quitry, Dominik Roblek

    Abstract: We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular te… ▽ More

    Submitted 24 May, 2019; originally announced May 2019.

  9. arXiv:1905.10240  [pdf, other

    cs.CV cs.AI cs.LG

    From Here to There: Video Inbetweening Using Direct 3D Convolutions

    Authors: Yunpeng Li, Dominik Roblek, Marco Tagliasacchi

    Abstract: We consider the problem of generating plausible and diverse video sequences, when we are only given a start and an end frame. This task is also known as inbetweening, and it belongs to the broader area of stochastic video generation, which is generally approached by means of recurrent neural networks (RNN). In this paper, we propose instead a fully convolutional model to generate video sequences d… ▽ More

    Submitted 4 June, 2019; v1 submitted 24 May, 2019; originally announced May 2019.

  10. arXiv:1812.08466  [pdf, other

    eess.AS cs.SD

    Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

    Authors: Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, Matthew Sharifi

    Abstract: We propose the Fréchet Audio Distance (FAD), a novel, reference-free evaluation metric for music enhancement algorithms. We demonstrate how typical evaluation metrics for speech enhancement and blind source separation can fail to accurately measure the perceived effect of a wide variety of distortions. As an alternative, we propose adapting the Fréchet Inception Distance (FID) metric used to evalu… ▽ More

    Submitted 17 January, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

  11. arXiv:1812.07909  [pdf, other

    stat.ML cs.AI cs.LG

    An Empirical Study of Generative Models with Encoders

    Authors: Paul K. Rubenstein, Yunpeng Li, Dominik Roblek

    Abstract: Generative adversarial networks (GANs) are capable of producing high quality image samples. However, unlike variational autoencoders (VAEs), GANs lack encoders that provide the inverse map** for the generators, i.e., encode images back to the latent space. In this work, we consider adversarially learned generative models that also have encoders. We evaluate models based on their ability to produ… ▽ More

    Submitted 19 December, 2018; originally announced December 2018.

  12. arXiv:1811.00006  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition

    Authors: David B. Ramsay, Kevin Kilgour, Dominik Roblek, Matthew Sharifi

    Abstract: Low power digital signal processors (DSPs) typically have a very limited amount of memory in which to cache data. In this paper we develop efficient bottleneck feature (BNF) extractors that can be run on a DSP, and retrain a baseline large-vocabulary continuous speech recognition (LVCSR) system to use these BNFs with only a minimal loss of accuracy. The small BNFs allow the DSP chip to cache more… ▽ More

    Submitted 31 October, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019

  13. arXiv:1711.10958  [pdf, other

    cs.SD cs.AI eess.AS

    Now Playing: Continuous low-power music recognition

    Authors: Blaise Agüera y Arcas, Beat Gfeller, Ruiqi Guo, Kevin Kilgour, Sanjiv Kumar, James Lyon, Julian Odell, Marvin Ritter, Dominik Roblek, Matthew Sharifi, Mihajlo Velimirović

    Abstract: Existing music recognition applications require a connection to a server that performs the actual recognition. In this paper we present a low-power music recognizer that runs entirely on a mobile device and automatically recognizes music without user interaction. To reduce battery consumption, a small music detector runs continuously on the mobile device's DSP chip and wakes up the main applicatio… ▽ More

    Submitted 29 November, 2017; originally announced November 2017.

    Comments: Authors are listed in alphabetical order by last name