Skip to main content

Showing 1–23 of 23 results for author: Richard, G

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.11540  [pdf, other

    eess.SP

    Model-Based Deep Learning for Music Information Research

    Authors: Gael Richard, Vincent Lostanlen, Yi-Hsuan Yang, Meinard Müller

    Abstract: In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a diff erentiable computing framework. In music, prior knowledge for ins… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: IEEE Signal Processing Magazine, In press

  2. arXiv:2406.04706  [pdf, other

    cs.LG cs.NE eess.SP math.PR stat.ML

    Winner-takes-all learners are geometry-aware conditional density estimators

    Authors: Victor Letzelter, David Perera, Cédric Rommel, Mathieu Fontaine, Slim Essid, Gael Richard, Patrick Pérez

    Abstract: Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution to predict. However, the best use of these hypothe… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: International Conference on Machine Learning, Jul 2024, Vienne (Autriche), Austria

  3. arXiv:2402.15516  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model

    Authors: Haocheng Liu, Teysir Baoueb, Mathieu Fontaine, Jonathan Le Roux, Gael Richard

    Abstract: Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

    Comments: Accepted at ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea

  4. arXiv:2402.13301  [pdf, other

    cs.SD cs.AI eess.AS

    Structure-informed Positional Encoding for Music Generation

    Authors: Manvi Agarwal, Changhong Wang, Gaël Richard

    Abstract: Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary po… ▽ More

    Submitted 28 February, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024, Seoul, South Korea

  5. arXiv:2402.01753  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

    Authors: Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard

    Abstract: Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the… ▽ More

    Submitted 30 January, 2024; originally announced February 2024.

    Comments: Accepted at ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea

  6. arXiv:2401.16837  [pdf, other

    eess.SP

    A fully differentiable model for unsupervised singing voice separation

    Authors: Gael Richard, Pierre Chouteau, Bernardo Torres

    Abstract: A novel model was recently proposed by Schulze-Forster et al. in [1] for unsupervised music source separation. This model allows to tackle some of the major shortcomings of existing source separation frameworks. Specifically, it eliminates the need for isolated sources during training, performs efficiently with limited data, and can handle homogeneous sources (such as singing voice). But, this mod… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr 2024, Seoul, South Korea

  7. arXiv:2401.05064  [pdf, other

    cs.SD cs.LG eess.AS

    Singer Identity Representation Learning using Self-Supervised Techniques

    Authors: Bernardo Torres, Stefan Lattner, Gaël Richard

    Abstract: Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: Accepted at the ISMIR conference, Milan, Italy, 2023

    Journal ref: Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy

  8. arXiv:2312.14507  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport

    Authors: Bernardo Torres, Geoffroy Peeters, Gaël Richard

    Abstract: In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory th… ▽ More

    Submitted 15 January, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted in ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul, South Korea

  9. arXiv:2307.10834  [pdf, other

    eess.AS cs.SD

    Transfer Learning and Bias Correction with Pre-trained Audio Embeddings

    Authors: Changhong Wang, Gaël Richard, Brian McFee

    Abstract: Deep neural network models have become the dominant approach to a large variety of tasks within music information retrieval (MIR). These models generally require large amounts of (annotated) training data to achieve high accuracy. Because not all applications in MIR have sufficient quantities of training data, it is becoming increasingly common to transfer models across domains. This approach allo… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: 7 pages, 3 figures, accepted to the conference of the International Society for Music Information Retrieval (ISMIR 2023)

  10. arXiv:2306.07187  [pdf, other

    cs.MM cs.IR cs.LG cs.SD eess.AS

    Video-to-Music Recommendation using Temporal Alignment of Segments

    Authors: Laure Prétet, Gaël Richard, Clément Souchier, Geoffroy Peeters

    Abstract: We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Journal ref: IEEE Transactions on Multimedia, 18 February 2022

  11. arXiv:2305.07132  [pdf, other

    cs.SD cs.LG eess.AS

    Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization

    Authors: Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Gaël Richard, Florence d'Alché-Buc

    Abstract: This paper tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable model with high performance. To this end, we propose a n… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

    Comments: Under submission at IEEE/ACM TASLP. arXiv admin note: text overlap with arXiv:2202.11479

  12. arXiv:2211.07250  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Device and Audio Data to Tag Music with User-Aware Listening Contexts

    Authors: Karim M. Ibrahim, Elena V. Epure, Geoffroy Peeters, Gaël Richard

    Abstract: As music has become more available especially on music streaming platforms, people have started to have distinct preferences to fit to their varying listening situations, also known as context. Hence, there has been a growing interest in considering the user's situation when recommending music to users. Previous works have proposed user-aware autotaggers to infer situation-related tags from music… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: Published in ISMIR

  13. arXiv:2202.11479  [pdf, other

    cs.SD cs.LG eess.AS

    Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

    Authors: Jayneel Parekh, Sanjeel Parekh, Pavlo Mozharovskyi, Florence d'Alché-Buc, Gaël Richard

    Abstract: This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden la… ▽ More

    Submitted 24 October, 2022; v1 submitted 23 February, 2022; originally announced February 2022.

    Comments: Accepted at NeurIPS 2022

  14. arXiv:2201.09592  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Unsupervised Music Source Separation Using Differentiable Parametric Source Models

    Authors: Kilian Schulze-Forster, Gaël Richard, Liam Kelley, Clement S. J. Doire, Roland Badeau

    Abstract: Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical s… ▽ More

    Submitted 31 January, 2023; v1 submitted 24 January, 2022; originally announced January 2022.

    Comments: Revised version of the submission

  15. arXiv:2108.01216  [pdf, other

    cs.SD eess.AS

    DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

    Authors: Javier Nistal, Stefan Lattner, Gaël Richard

    Abstract: Generative Adversarial Networks (GANs) have achieved excellent audio synthesis quality in the last years. However, making them operable with semantically meaningful controls remains an open challenge. An obvious approach is to control the GAN by conditioning it on metadata contained in audio datasets. Unfortunately, audio datasets often lack the desired annotations, especially in the musical domai… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: 9 pages, 3 figures, 2 tables, accepted to ISMIR2021

    Journal ref: 22nd International Society for Music Information Retrieval (ISMIR 2021)

  16. arXiv:2105.08399  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Relative Positional Encoding for Transformers with Linear Complexity

    Authors: Antoine Liutkus, Ondřej Cífka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang, Gaël Richard

    Abstract: Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requir… ▽ More

    Submitted 10 June, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: ICML 2021 (long talk) camera-ready. 24 pages

  17. arXiv:2105.01531  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

    Authors: Javier Nistal, Cyran Aouameur, Stefan Lattner, Gaël Richard

    Abstract: Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio… ▽ More

    Submitted 30 July, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: 5 pages, 1 figure, 1 table; accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

    Journal ref: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021

  18. arXiv:2102.05749  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    Self-Supervised VQ-VAE for One-Shot Music Style Transfer

    Authors: Ondřej Cífka, Alexey Ozerov, Umut Şimşekli, Gaël Richard

    Abstract: Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot'… ▽ More

    Submitted 10 June, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

    Comments: ICASSP 2021. Website: https://adasp.telecom-paris.fr/s/ss-vq-vae

    Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (2021) 96-100

  19. arXiv:2008.12073  [pdf, other

    eess.AS cs.SD

    DrumGAN: Synthesis of Drum Sounds With Timbral Feature Conditioning Using Generative Adversarial Networks

    Authors: J. Nistal, S. Lattner, G. Richard

    Abstract: Synthetic creation of drum sounds (e.g., in drum machines) is commonly performed using analog or digital synthesis, allowing a musician to sculpt the desired timbre modifying various parameters. Typically, such parameters control low-level features of the sound and often have no musical meaning or perceptual correspondence. With the rise of Deep Learning, data-driven processing of audio emerges as… ▽ More

    Submitted 28 June, 2022; v1 submitted 27 August, 2020; originally announced August 2020.

    Comments: 8 pages, 1 figure, 3 tables, accepted in Proc. of the 21st International Society for Music Information Retrieval (ISMIR2020)

  20. arXiv:2006.09266  [pdf, other

    eess.AS cs.SD

    Comparing Representations for Audio Synthesis Using Generative Adversarial Networks

    Authors: Javier Nistal, Stefan Lattner, Gaël Richard

    Abstract: In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiments on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wasserstein GAN. We perform experiments both in a full… ▽ More

    Submitted 17 June, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: 5 pages, 1 figure, 5 tables, to be published in European Signal Processing Conference (EUSIPCO)

  21. arXiv:2005.12977  [pdf, other

    cs.IR cs.CV cs.SD eess.AS

    Learning to rank music tracks using triplet loss

    Authors: Laure Prétet, Gaël Richard, Geoffroy Peeters

    Abstract: Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the music tracks. To that aim, we propose several strat… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  22. arXiv:1907.02265  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Supervised Symbolic Music Style Translation Using Synthetic Data

    Authors: Ondřej Cífka, Umut Şimşekli, Gaël Richard

    Abstract: Research on style transfer and domain translation has clearly demonstrated the ability of deep learning-based algorithms to manipulate images in terms of artistic style. More recently, several attempts have been made to extend such approaches to music (both symbolic and audio) in order to enable transforming musical style in a similar manner. In this study, we focus on symbolic music with the goal… ▽ More

    Submitted 4 July, 2019; originally announced July 2019.

    Comments: ISMIR 2019 camera-ready

    Journal ref: Proceedings of the 20th International Society for Music Information Retrieval Conference (2019) 588-595

  23. arXiv:1804.07345  [pdf, other

    cs.CV cs.SD eess.AS

    Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

    Authors: Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard

    Abstract: Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is traine… ▽ More

    Submitted 9 July, 2018; v1 submitted 19 April, 2018; originally announced April 2018.