Search | arXiv e-print repository

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Authors: Théodor Lemerle, Nicolas Obin, Axel Roebel

Abstract: Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-c… ▽ More Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skip** issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at https://github.com/theodorblackbird/lina-speech. △ Less

Submitted 11 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: Interspeech

arXiv:2310.18320 [pdf, ps, other]

AI (r)evolution -- where are we heading? Thoughts about the future of music and sound technologies in the era of deep learning

Authors: Giovanni Bindi, Nils Demerlé, Rodrigo Diaz, David Genova, Aliénor Golvet, Ben Hayes, Jiawen Huang, Lele Liu, Vincent Martos, Sarah Nabi, Teresa Pelinski, Lenny Renault, Saurjya Sarkar, Pedro Sarmento, Cyrus Vahidi, Lewis Wolstanholme, Yixiao Zhang, Axel Roebel, Nick Bryan-Kinns, Jean-Louis Giavitto, Mathieu Barthet

Abstract: Artificial Intelligence (AI) technologies such as deep learning are evolving very quickly bringing many changes to our everyday lives. To explore the future impact and potential of AI in the field of music and sound technologies a doctoral day was held between Queen Mary University of London (QMUL, UK) and Sciences et Technologies de la Musique et du Son (STMS, France). Prompt questions about curr… ▽ More Artificial Intelligence (AI) technologies such as deep learning are evolving very quickly bringing many changes to our everyday lives. To explore the future impact and potential of AI in the field of music and sound technologies a doctoral day was held between Queen Mary University of London (QMUL, UK) and Sciences et Technologies de la Musique et du Son (STMS, France). Prompt questions about current trends in AI and music were generated by academics from QMUL and STMS. Students from the two institutions then debated these questions. This report presents a summary of the student debates on the topics of: Data, Impact, and the Environment; Responsible Innovation and Creative Practice; Creativity and Bias; and From Tools to the Singularity. The students represent the future generation of AI and music researchers. The academics represent the incumbent establishment. The student debates reported here capture visions, dreams, concerns, uncertainties, and contentious issues for the future of AI and music as the establishment is rightfully challenged by the next generation. △ Less

Submitted 20 September, 2023; originally announced October 2023.

arXiv:2310.03444 [pdf, other]

VaSAB: The variable size adaptive information bottleneck for disentanglement on speech and singing voice

Authors: Frederik Bous, Axel Roebel

Abstract: The information bottleneck auto-encoder is a tool for disentanglement commonly used for voice transformation. The successful disentanglement relies on the right choice of bottleneck size. Previous bottleneck auto-encoders created the bottleneck by the dimension of the latent space or through vector quantization and had no means to change the bottleneck size of a specific model. As the bottleneck r… ▽ More The information bottleneck auto-encoder is a tool for disentanglement commonly used for voice transformation. The successful disentanglement relies on the right choice of bottleneck size. Previous bottleneck auto-encoders created the bottleneck by the dimension of the latent space or through vector quantization and had no means to change the bottleneck size of a specific model. As the bottleneck removes information from the disentangled representation, the choice of bottleneck size is a trade-off between disentanglement and synthesis quality. We propose to build the information bottleneck using dropout which allows us to change the bottleneck through the dropout rate and investigate adapting the bottleneck size depending on the context. We experimentally explore into using the adaptive bottleneck for pitch transformation and demonstrate that the adaptive bottleneck leads to improved disentanglement of the F0 parameter for both, speech and singing voice leading to improved synthesis quality. Using the variable bottleneck size, we were able to achieve disentanglement for singing voice including extremely high pitches and create a universal voice model, that works on both speech and singing voice with improved synthesis quality. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: Submitted to ICASSP 2024

arXiv:2210.02647 [pdf, other]

Ensemble Kalman Filtering for Glacier Modeling

Authors: Emily Corcoran, Logan Knudsen, Talea Mayo, Hannah Park-Kaufmann, Alexander Robel

Abstract: Working with a two-stage ice sheet model, we explore how statistical data assimilation methods can be used to improve predictions of glacier melt and relatedly, sea level rise. We find that the EnKF improves model runs initialized using incorrect initial conditions or parameters, providing us with better models of future glacier melt. We explore the necessary number of observations needed to produ… ▽ More Working with a two-stage ice sheet model, we explore how statistical data assimilation methods can be used to improve predictions of glacier melt and relatedly, sea level rise. We find that the EnKF improves model runs initialized using incorrect initial conditions or parameters, providing us with better models of future glacier melt. We explore the necessary number of observations needed to produce an accurate model run. Further, we determine that the deviations from the truth in output that stem from having few data points in the pre-satellite era can be corrected with modern observation data. Finally, using data derived from our improved model we calculate sea level rise and model storm surges to understand the affect caused by sea level rise. △ Less

Submitted 20 May, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

arXiv:2204.04006 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095740

Analysis and transformations of voice level in singing voice

Authors: Frederik Bous, Axel Roebel

Abstract: We introduce a neural auto-encoder that transforms the musical dynamic in recordings of singing voice via changes in voice level. Since most recordings of singing voice are not annotated with voice level we propose a means to estimate the voice level from the signal's timbre using a neural voice level estimator. We introduce the recording factor that relates the voice level to the recorded signal… ▽ More We introduce a neural auto-encoder that transforms the musical dynamic in recordings of singing voice via changes in voice level. Since most recordings of singing voice are not annotated with voice level we propose a means to estimate the voice level from the signal's timbre using a neural voice level estimator. We introduce the recording factor that relates the voice level to the recorded signal power as a proportionality constant. This unknown constant depends on the recording conditions and the post-processing and may thus be different for each recording (but is constant across each recording). We provide two approaches to estimate the voice level without knowing the recording factor. The unknown recording factor can either be learned alongside the weights of the voice level estimator, or a special loss function based on the scalar product can be used to only match the contour of the recorded signal's power. The voice level models are used to condition a previously introduced bottleneck auto-encoder that disentangles its input, the mel-spectrogram, from the voice level. We evaluate the voice level models on recordings annotated with musical dynamic and by their ability to provide useful information to the auto-encoder. A perceptive test is carried out that evaluates the perceived change in voice level in transformed recordings and the synthesis quality. The perceptive test confirms that changing the conditional input changes the perceived voice level accordingly thus suggesting that the proposed voice level models encode information about the true voice level. △ Less

Submitted 22 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to ICASSP 2023

arXiv:2204.00907 [pdf, other]

doi 10.5281/zenodo.6573360

StyleWaveGAN: Style-based synthesis of drum sounds with extensive controls using generative adversarial networks

Authors: Antoine Lavault, Axel Roebel, Matthieu Voiry

Abstract: In this paper we introduce StyleWaveGAN, a style-based drum sound generator that is a variation of StyleGAN, a state-of-the-art image generator. By conditioning StyleWaveGAN on both the type of drum and several audio descriptors, we are able to synthesize waveforms faster than real-time on a GPU directly in CD quality up to a duration of 1.5s while retaining a considerable amount of control over t… ▽ More In this paper we introduce StyleWaveGAN, a style-based drum sound generator that is a variation of StyleGAN, a state-of-the-art image generator. By conditioning StyleWaveGAN on both the type of drum and several audio descriptors, we are able to synthesize waveforms faster than real-time on a GPU directly in CD quality up to a duration of 1.5s while retaining a considerable amount of control over the generation. We also introduce an alternative to the progressive growing of GANs and experimented on the effect of dataset balancing for generative tasks. The experiments are carried out on an augmented subset of a publicly available dataset comprised of different drums and cymbals. We evaluate against two recent drum generators, WaveGAN and NeuroDrum, demonstrating significantly improved generation quality (measured with the Frechet Audio Distance) and interesting results with perceptual features. △ Less

Submitted 2 April, 2022; originally announced April 2022.

Comments: Accepted for publication in Sound and Music Computing 2022

arXiv:2202.05718 [pdf, other]

Audio Defect Detection in Music with Deep Networks

Authors: Daniel Wolff, Rémi Mignot, Axel Roebel

Abstract: With increasing amounts of music being digitally transferred from production to distribution, automatic means of determining media quality are needed. Protection mechanisms in digital audio processing tools have not eliminated the need of production entities located downstream the distribution chain to assess audio quality and detect defects inserted further upstream. Such analysis often relies on… ▽ More With increasing amounts of music being digitally transferred from production to distribution, automatic means of determining media quality are needed. Protection mechanisms in digital audio processing tools have not eliminated the need of production entities located downstream the distribution chain to assess audio quality and detect defects inserted further upstream. Such analysis often relies on the received audio and scarce meta-data alone. Deliberate use of artefacts such as clicks in popular music as well as more recent defects stemming from corruption in modern audio encodings call for data-centric and context sensitive solutions for detection. We present a convolutional network architecture following end-to-end encoder decoder configuration to develop detectors for two exemplary audio defects. A click detector is trained and compared to a traditional signal processing method, with a discussion on context sensitivity. Additional post-processing is used for data augmentation and workflow simulation. The ability of our models to capture variance is explored in a detector for artefacts from decompression of corrupted MP3 compressed audio. For both tasks we describe the synthetic generation of artefacts for controlled detector training and evaluation. We evaluate our detectors on the large open-source Free Music Archive (FMA) and genre-specific datasets. △ Less

Submitted 11 February, 2022; originally announced February 2022.

Comments: 6 pages

Journal ref: Proceedings of the 22nd International Society for Music Information Retrieval Conference, Online, 2021

arXiv:2110.03744 [pdf, other]

Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

Authors: Frederik Bous, Laurent Benaroya, Nicolas Obin, Axel Roebel

Abstract: This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conv… ▽ More This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified so as to synchronize the converted speech with the source speech by mean of phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the one of the converted speech. Besides, an adversarial learning of conversions is integrated within the S2S-VC architecture so as to exploit both advantages of reconstruction of original speech and converted speech with manipulated attributes during training and then reducing the inconsistency between training and conversion. An experimental evaluation on the VCTK speech database shows that the speech prosody can be efficiently preserved during conversion, and that the proposed adversarial learning consistently improves the conversion and the naturalness of the reenacted speech. △ Less

Submitted 31 May, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: arXiv admin note: text overlap with arXiv:2107.12346

arXiv:2110.03329 [pdf, other]

Towards Universal Neural Vocoding with a Multi-band Excited WaveNet

Authors: Axel Roebel, Frederik Bous

Abstract: This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices. It aims to advance the state of the art towards an universal neural vocoder, which is a model that can generate voice signals from arbitrary mel spectrograms extracted from voice signals. Following the success of the DDSP model and following the development of the recently proposed excitation voc… ▽ More This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices. It aims to advance the state of the art towards an universal neural vocoder, which is a model that can generate voice signals from arbitrary mel spectrograms extracted from voice signals. Following the success of the DDSP model and following the development of the recently proposed excitation vocoders we propose a vocoder structure consisting of multiple specialized DNN that are combined with dedicated signal processing components. All components are implemented as differentiable operators and therefore allow joined optimization of the model parameters. To prove the capacity of the model to reproduce high quality voice signals we evaluate the model on single and multi speaker/singer datasets. We conduct a subjective evaluation demonstrating that the models support a wide range of domain variations (unseen voices, languages, expressivity) achieving perceptive quality that compares with a state of the art universal neural vocoder, however using significantly smaller training datasets and significantly less parameters. We also demonstrate remaining limits of the universality of neural vocoders e.g. the creation of saturated singing voices. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2107.12346 [pdf, other]

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

Authors: Laurent Benaroya, Nicolas Obin, Axel Roebel

Abstract: Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity and prese… ▽ More Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity and presents a neural architecture that allows the manipulation of voice attributes (e.g., gender and age). Leveraging the latest advances on adversarial learning of structured speech representation, a novel structured neural network is proposed in which multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations, which are learned adversariarly and can be manipulated during VC. Moreover, the proposed architecture is time-synchronized so that the original voice timing is preserved during conversion which allows lip-sync applications. Applied to voice gender conversion on the real-world VCTK dataset, our proposed architecture can learn successfully gender-independent representation and convert the voice gender with a very high efficiency and naturalness. △ Less

Submitted 27 July, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

arXiv:2104.07288 [pdf, other]

Speaker Attentive Speech Emotion Recognition

Authors: Clément Le Moine, Nicolas Obin, Axel Roebel

Abstract: Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teach… ▽ More Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teaching the emotion recognition network about speaker identity. Our system is a combination of two ACRNN classifiers respectively dedicated to speaker and emotion recognition. The first informs the latter through a Self Speaker Attention (SSA) mechanism that is shown to considerably help to focus on emotional information of the speech signal. Experiments on social attitudes database Att-HACK and IEMOCAP corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2104.07283 [pdf, other]

Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels

Authors: Clément Le Moine Veillon, Nicolas Obin, Axel Roebel

Abstract: This paper presents a end-to-end framework for the F0 transformation in the context of expressive voice conversion. A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales and a second adversarial module is used to learn the transformation from one emotion to another. The first module is composed of a convolution layer with wav… ▽ More This paper presents a end-to-end framework for the F0 transformation in the context of expressive voice conversion. A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales and a second adversarial module is used to learn the transformation from one emotion to another. The first module is composed of a convolution layer with wavelet kernels so that the various temporal scales of F0 variations can be efficiently encoded. The single decomposition/transformation network allows to learn in a end-to-end manner the F0 decomposition that are optimal with respect to the transformation, directly from the raw F0 signal. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2006.08723 [pdf]

Threats and Countermeasures of Cyber Security in Direct and Remote Vehicle Communication Systems

Authors: Subrato Bharati, Prajoy Podder, M. Rubaiyat Hossain Mondal, Md. Robiul Alam Robel

Abstract: Traffic management, road safety, and environmental impact are important issues in the modern world. These challenges are addressed by the application of sensing, control and communication methods of intelligent transportation systems (ITS). A part of ITS is a vehicular ad-hoc network (VANET) which means a wireless network of vehicles. However, communication among vehicles in a VANET exposes severa… ▽ More Traffic management, road safety, and environmental impact are important issues in the modern world. These challenges are addressed by the application of sensing, control and communication methods of intelligent transportation systems (ITS). A part of ITS is a vehicular ad-hoc network (VANET) which means a wireless network of vehicles. However, communication among vehicles in a VANET exposes several security threats which need to be studied and addressed. In this review, firstly, the basic flow of VANET is illustrated focusing on its communication methods, architecture, characteristics, standards, and security facilities. Next, the attacks and threats for VANET are discussed. Moreover, the authentication systems are described by which vehicular networks can be protected from fake messages and malicious nodes. Security threats and counter measures are discussed for different remote vehicle communication methods namely, remote keyless entry system, dedicated short range communication, cellular scheme, Zigbee, Bluetooth, radio frequency identification, WiFi, WiMAX, and different direct vehicle communication methods namely on-board diagnosis and universal serial bus. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: 12 pages, 7 figures

Journal ref: Journal of Information Assurance and Security (ISSN 1554-1010), Volume 15 (2020), pp. 153-164, MIR Labs, www.mirlabs.net/jias/index.html

arXiv:2003.01220 [pdf, ps, other]

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

Authors: Frederik Bous, Luc Ardaillon, Axel Roebel

Abstract: This article investigates into recently emerging approaches that use deep neural networks for the estimation of glottal closure instants (GCI). We build upon our previous approach that used synthetic speech exclusively to create perfectly annotated training data and that had been shown to compare favourably with other training approaches using electroglottograph (EGG) signals. Here we introduce a… ▽ More This article investigates into recently emerging approaches that use deep neural networks for the estimation of glottal closure instants (GCI). We build upon our previous approach that used synthetic speech exclusively to create perfectly annotated training data and that had been shown to compare favourably with other training approaches using electroglottograph (EGG) signals. Here we introduce a semi-supervised training strategy that allows refining the estimator by means of an analysis-synthesis setup using real speech signals, for which GCI ground truth does not exist. Evaluation of the analyser is performed by means of comparing the GCI extracted from the glottal flow signal generated by the analyser with the GCI extracted from EGG on the CMU arctic dataset, where EGG signals were recorded in addition to speech. We observe that (1.) the artificial increase of the diversity of pulse shapes that has been used in our previous construction of the synthetic database is beneficial, (2.) training the GCI network in the analysis-synthesis setup allows achieving a very significant improvement of the GCI analyser, (3.) additional regularisation strategies allow improving the final analysis network when trained in the analysis-synthesis setup. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1910.12614 [pdf, other]

CycleGAN Voice Conversion of Spectral Envelopes using Adversarial Weights

Authors: Rafael Ferro, Nicolas Obin, Axel Roebel

Abstract: This paper tackles GAN optimization and stability issues in the context of voice conversion. First, to simplify the conversion task, we propose to use spectral envelopes as inputs. Second we propose two adversarial weight training paradigms, the generalized weighted GAN and the generator impact GAN, both aim at reducing the impact of the generator on the discriminator, so both can learn more gradu… ▽ More This paper tackles GAN optimization and stability issues in the context of voice conversion. First, to simplify the conversion task, we propose to use spectral envelopes as inputs. Second we propose two adversarial weight training paradigms, the generalized weighted GAN and the generator impact GAN, both aim at reducing the impact of the generator on the discriminator, so both can learn more gradually and efficiently during training. Applying an energy constraint to the cycleGAN paradigm considerably improved conversion quality. A subjective experiment conducted on a voice conversion task on the voice conversion challenge 2018 dataset shows first that despite a significantly reduced network complexity, the proposed method achieves state-of-the-art results, and second that the proposed weighted GAN methods outperform a previously proposed one. △ Less

Submitted 11 July, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: 5 pages, 1 figure

arXiv:1910.10235 [pdf, other]

GCI detection from raw speech using a fully-convolutional network

Authors: Luc Ardaillon, Axel Roebel

Abstract: Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convolutional neural networks have emerged, with encouraging results. Follo… ▽ More Glottal Closure Instants (GCI) detection consists in automatically detecting temporal locations of most significant excitation of the vocal tract from the speech signal. It is used in many speech analysis and processing applications, and various algorithms have been proposed for this purpose. Recently, new approaches using convolutional neural networks have emerged, with encouraging results. Following this trend, we propose a simple approach that performs a map** from the speech waveform to a target signal from which the GCIs are obtained by peak-picking. However, the ground truth GCIs used for training and evaluation are usually extracted from EGG signals, which are not perfectly reliable and often not available. To overcome this problem, we propose to train our network on high-quality synthetic speech with perfect ground truth. The performances of the proposed algorithm are compared with three other state-of-the-art approaches using publicly available datasets, and the impact of using controlled synthetic or real speech signals in the training stage is investigated. The experimental results demonstrate that the proposed method obtains similar or better results than other state-of-the-art algorithms and that using large synthetic datasets with many speakers offers a better generalization ability than using a smaller database of real speech and EGG signals. △ Less

Submitted 20 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

Comments: Minor corrections after reviews of ICASSP 2020 (accepted paper). (Corrected typos, added funding aknowledgments, added some references, cleaned bibliography, added a few details)

arXiv:1910.09497 [pdf, other]

Sound texture synthesis using RI spectrograms

Authors: Hugo Caracalla, Axel Roebel

Abstract: This article introduces a new parametric synthesis method for sound textures based on existing works in visual and sound texture synthesis. Starting from a base sound signal, an optimization process is performed until the cross-correlations between the feature-maps of several untrained 2D Convolutional Neural Networks (CNN) resemble those of an original sound texture. We use compressed RI spectrog… ▽ More This article introduces a new parametric synthesis method for sound textures based on existing works in visual and sound texture synthesis. Starting from a base sound signal, an optimization process is performed until the cross-correlations between the feature-maps of several untrained 2D Convolutional Neural Networks (CNN) resemble those of an original sound texture. We use compressed RI spectrograms as input to the CNN: this time-frequency representation is the stacking of the real and imaginary part of the Short Time Fourier Transform (STFT) and thus implicitly contains both the magnitude and phase information, allowing for convincing syntheses of various audio events. The optimization is however performed directly on the time signal to avoid any STFT consistency issue. The results of an online perceptual evaluation are also detailed, and show that this method achieves results that are more realistic-sounding than existing parametric methods on a wide array of textures. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: submitted to IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

arXiv:1905.03637 [pdf, ps, other]

Sound texture synthesis using convolutional neural networks

Authors: Hugo Caracalla, Axel Roebel

Abstract: The following article introduces a new parametric synthesis algorithm for sound textures inspired by existing methods used for visual textures. Using a 2D Convolutional Neural Network (CNN), a sound signal is modified until the temporal cross-correlations of the feature maps of its log-spectrogram resemble those of a target texture. We show that the resulting synthesized sound signal is both diffe… ▽ More The following article introduces a new parametric synthesis algorithm for sound textures inspired by existing methods used for visual textures. Using a 2D Convolutional Neural Network (CNN), a sound signal is modified until the temporal cross-correlations of the feature maps of its log-spectrogram resemble those of a target texture. We show that the resulting synthesized sound signal is both different from the original and of high quality, while being able to reproduce singular events appearing in the original. This process is performed in the time domain, discarding the harmful phase recovery step which usually concludes synthesis performed in the time-frequency domain. It is also straightforward and flexible, as it does not require any fine tuning between several losses when synthesizing diverse sound textures. A way of extending the synthesis in order to produce a sound of any length is also presented, after which synthesized spectrograms and sound signals are showcased. We also discuss on the choice of CNN, on border effects in our synthesized signals and on possible ways of modifying the algorithm in order to improve its current long computation time. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: submitted to Digital Audio Conference (DAFx 2019)

arXiv:1903.01416 [pdf, other]

Data Augmentation for Drum Transcription with Convolutional Neural Networks

Authors: Celine Jacques, Axel Roebel

Abstract: A recurrent issue in deep learning is the scarcity of data, in particular precisely annotated data. Few publicly available databases are correctly annotated and generating correct labels is very time consuming. The present article investigates into data augmentation strategies for Neural Networks training, particularly for tasks related to drum transcription. These tasks need very precise annotati… ▽ More A recurrent issue in deep learning is the scarcity of data, in particular precisely annotated data. Few publicly available databases are correctly annotated and generating correct labels is very time consuming. The present article investigates into data augmentation strategies for Neural Networks training, particularly for tasks related to drum transcription. These tasks need very precise annotations. This article investigates state-of-the-art sound transformation algorithms for remixing noise and sinusoidal parts, remixing attacks, transposing with and without time compensation and compares them to basic regularization methods such as using dropout and additive Gaussian noise. And it shows how a drum transcription algorithm based on CNN benefits from the proposed data augmentation strategy. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Journal ref: Published in Proceedings of the 27th European Signal Processing Conference (EUSIPCO), 2019

arXiv:1903.01415 [pdf, other]

Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation

Authors: Alice Cohen-Hadria, Axel Roebel, Geoffroy Peeters

Abstract: State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net a… ▽ More State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net and the Wave-U-Net that are structurally similar, but work on different input representations. First, we report a few results on variations of the U-Net model. Second, we will discuss the potential of state of the art speech and music transformation algorithms for augmentation of existing data sets and demonstrate that the effect of these augmentations depends on the signal representations used by the model. The results demonstrate a considerable improvement due to the augmentation for both models. But pitch transposition is the most effective augmentation strategy for the U-Net model, while transposition, time stretching, and formant shifting have a much more balanced effect on the Wave-U-Net model. Finally, we compare the two models on the same dataset. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Journal ref: Published in Proceedings of the 27th European Signal Processing Conference (EUSIPCO), 2019

arXiv:1903.01161 [pdf, ps, other]

Analysing Deep Learning-Spectral Envelope Prediction Methods for Singing Synthesis

Authors: Frederik Bous, Axel Roebel

Abstract: We conduct an investigation on various hyper-parameters regarding neural networks used to generate spectral envelopes for singing synthesis. Two perceptive tests, where the first compares two models directly and the other ranks models with a mean opinion score, are performed. With these tests we show that when learning to predict spectral envelopes, 2d-convolutions are superior over previously pro… ▽ More We conduct an investigation on various hyper-parameters regarding neural networks used to generate spectral envelopes for singing synthesis. Two perceptive tests, where the first compares two models directly and the other ranks models with a mean opinion score, are performed. With these tests we show that when learning to predict spectral envelopes, 2d-convolutions are superior over previously proposed 1d-convolutions and that predicting multiple frames in an iterated fashion during training is superior over injecting noise to the input data. An experimental investigation whether learning to predict a probability distribution vs.\ single samples was performed but turned out to be inconclusive. A network architecture is proposed that incorporates the improvements which we found to be useful and we show in our experiments that this network produces better results than other stat-of-the-art methods. △ Less

Submitted 4 March, 2019; originally announced March 2019.

Journal ref: Published in Proceedings of the 27th European Signal Processing Conference (EUSIPCO), 2019

arXiv:1502.00141 [pdf, other]

An evaluation framework for event detection using a morphological model of acoustic scenes

Authors: Mathieu Lagrange, Grégoire Lafay, Mathias Rossignol, Emmanouil Benetos, Axel Roebel

Abstract: This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by ab-stracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their… ▽ More This paper introduces a model of environmental acoustic scenes which adopts a morphological approach by ab-stracting temporal structures of acoustic scenes. To demonstrate its potential, this model is employed to evaluate the performance of a large set of acoustic events detection systems. This model allows us to explicitly control key morphological aspects of the acoustic scene and isolate their impact on the performance of the system under evaluation. Thus, more information can be gained on the behavior of evaluated systems, providing guidance for further improvements. The proposed model is validated using submitted systems from the IEEE DCASE Challenge; results indicate that the proposed scheme is able to successfully build datasets useful for evaluating some aspects the performance of event detection systems, more particularly their robustness to new listening conditions and the increasing level of background sounds. △ Less

Submitted 31 January, 2015; originally announced February 2015.

arXiv:1109.6651 [pdf, other]

Sound Analysis and Synthesis Adaptive in Time and Two Frequency Bands

Authors: Marco Liuni, Peter Balazs, Axel Röbel

Abstract: We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-wei… ▽ More We present an algorithm for sound analysis and resynthesis with local automatic adaptation of time-frequency resolution. There exists several algorithms allowing to adapt the analysis window depending on its time or frequency location; in what follows we propose a method which select the optimal resolution depending on both time and frequency. We consider an approach that we denote as analysis-weighting, from the point of view of Gabor frame theory. We analyze in particular the case of different adaptive time-varying resolutions within two complementary frequency bands; this is a typical case where perfect signal reconstruction cannot in general be achieved with fast algorithms, causing a certain error to be minimized. We provide examples of adaptive analyses of a music sound, and outline several possibilities that this work opens. △ Less

Submitted 29 September, 2011; originally announced September 2011.

Journal ref: Proc. of the 14th Int. Conference on Digital Audio Effects (DAFx-11), Paris, France, September 19-23, 2011

arXiv:1109.6314 [pdf, other]

An Entropy Based Method for Local Time-Adaptation of the Spectrogram

Authors: M. Liuni, A. Röbel, M. Romito, X. Rodet

Abstract: We propose a method for automatic local time-adaptation of the spectrogram of audio signals: it is based on the decomposition of a signal within a Gabor multi-frame through the STFT operator. The sparsity of the analysis in every individual frame of the multi-frame is evaluated through the Rényi entropy measures: the best local resolution is determined minimizing the entropy values. The overall sp… ▽ More We propose a method for automatic local time-adaptation of the spectrogram of audio signals: it is based on the decomposition of a signal within a Gabor multi-frame through the STFT operator. The sparsity of the analysis in every individual frame of the multi-frame is evaluated through the Rényi entropy measures: the best local resolution is determined minimizing the entropy values. The overall spectrogram of the signal we obtain thus provides local optimal resolution adaptively evolving over time. We give examples of the performance of our algorithm with an instrumental sound and a synthetic one, showing the improvement in spectrogram displaying obtained with an automatic adaptation of the resolution. The analysis operator is invertible, thus leading to a perfect reconstruction of the original signal through the analysis coefficients. △ Less

Submitted 27 September, 2011; originally announced September 2011.

Journal ref: CMMR 2010, LNCS 6684, pp. 60-75, 2011

arXiv:1109.6313 [pdf, other]

A Reduced Multiple Gabor Frame for Local Time Adaptation of the Spectrogram

Authors: M. Liuni, A. Röbel, M. Romito, X. Rodet

Abstract: In this paper we propose a method for automatic local time adap- tation of the spectrogram of an audio signal, based on its decomposition within a Gabor multi-frame. The sparsity of the analyses within each individual frame is evaluated through the Rényi entropies measures. According to the sparsity of the decompositions, an optimal resolution and a reduced multi-frame are determined, defining an… ▽ More In this paper we propose a method for automatic local time adap- tation of the spectrogram of an audio signal, based on its decomposition within a Gabor multi-frame. The sparsity of the analyses within each individual frame is evaluated through the Rényi entropies measures. According to the sparsity of the decompositions, an optimal resolution and a reduced multi-frame are determined, defining an adapted spectrogram with variable resolution and hop size. The composition of such a reduced multi-frame allows an immediate definition of a dual frame: re-synthesis techniques for this adapted analysis are easily derived by the traditional phase vocoder scheme. △ Less

Submitted 27 September, 2011; originally announced September 2011.

Journal ref: Proc. of the 13th Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010

arXiv:1109.5876 [pdf, other]

Rényi Information Measures for Spectral Change Detection

Authors: Marco Liuni, Axel Röbel, Marco Romito, Xavier Rodet

Abstract: Change detection within an audio stream is an important task in several domains, such as classification and segmentation of a sound or of a music piece, as well as indexing of broadcast news or surveillance applications. In this paper we propose two novel methods for spectral change detection without any assumption about the input sound: they are both based on the evaluation of information measure… ▽ More Change detection within an audio stream is an important task in several domains, such as classification and segmentation of a sound or of a music piece, as well as indexing of broadcast news or surveillance applications. In this paper we propose two novel methods for spectral change detection without any assumption about the input sound: they are both based on the evaluation of information measures applied to a time- frequency representation of the signal, and in particular to the spectrogram. The class of measures we consider, the Rényi entropies, are obtained by extending the Shannon entropy definition: a biasing of the spectrogram coefficients is realized through the dependence of such measures on a parameter, which allows refined results compared to those obtained with standard divergences. These methods provide a low computational cost and are well-suited as a support for higher level analysis, segmentation and classification algorithms. △ Less

Submitted 27 September, 2011; originally announced September 2011.

Comments: 2011 IEEE Conference on Acoustics, Speech and Signal Processing

Showing 1–26 of 26 results for author: Roebel, A