Search | arXiv e-print repository

LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Authors: Teerapat Jenrungrot, Michael Chinen, W. Bastiaan Kleijn, Jan Skoglund, Zalán Borsos, Neil Zeghidour, Marco Tagliasacchi

Abstract: We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the tran… ▽ More We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 5 pages, accepted to ICASSP 2023, project page: https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec

arXiv:2301.09198 [pdf, other]

Estimation of Source and Receiver Positions, Room Geometry and Reflection Coefficients From a Single Room Impulse Response

Authors: Wangyang Yu, W. Bastiaan Kleijn

Abstract: We propose an algorithm to estimate source and receiver positions, room geometry and reflection coefficients from a single room impulse response simultaneously. It is based on a symmetry analysis of the room impulse response. The proposed method utilizes the times of arrivals of the direct path, first order reflections and second order reflections. The proposed method is robust to erroneous pulses… ▽ More We propose an algorithm to estimate source and receiver positions, room geometry and reflection coefficients from a single room impulse response simultaneously. It is based on a symmetry analysis of the room impulse response. The proposed method utilizes the times of arrivals of the direct path, first order reflections and second order reflections. The proposed method is robust to erroneous pulses and non-specular reflections. It can be applied to any room with parallel walls as long as the required arrival times of reflections are available. In contrast to the state-of-art method, we do not restrict the location of source and receiver. △ Less

Submitted 22 January, 2023; originally announced January 2023.

arXiv:2207.02262 [pdf, other]

Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Authors: Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

Abstract: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective rec… ▽ More Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: Proceedings of INTERSPEECH 2022

arXiv:2204.02040 [pdf]

On the Relevance of Bandwidth Extension for Speaker Verification

Authors: Marcos Faundez-Zanuy, Mattias Nilsson, W. Bastiaan Kleijn

Abstract: In this paper, we consider the effect of a bandwidth extension of narrow-band speech signals (0.3-3.4 kHz) to 0.3-8 kHz on speaker verification. Using covariance matrix based verification systems together with detection error trade-off curves, we compare the performance between systems operating on narrow-band, wide-band (0-8 kHz), and bandwidth-extended speech. The experiments were conducted usin… ▽ More In this paper, we consider the effect of a bandwidth extension of narrow-band speech signals (0.3-3.4 kHz) to 0.3-8 kHz on speaker verification. Using covariance matrix based verification systems together with detection error trade-off curves, we compare the performance between systems operating on narrow-band, wide-band (0-8 kHz), and bandwidth-extended speech. The experiments were conducted using different short-time spectral parameterizations derived from microphone and ISDN speech databases. The studied bandwidth-extension algorithm did not introduce artifacts that affected the speaker verification task, and we achieved improvements between 1 and 10 percent (depending on the model order) over the verification system designed for narrow-band speech when mel-frequency cepstral coefficients for the short-time spectral parameterization were used. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: 4 pages published in 7th International Conference on Spoken Language Processing, September 16-20, 2002, Denver, Colorado, USA. arXiv admin note: text overlap with arXiv:2202.13865

Journal ref: 7th International Conference on Spoken Language Processing (ICSLP2002), September 16-20, 2002

arXiv:2202.13865 [pdf]

On the relevance of bandwidth extension for speaker identification

Authors: Marcos Faundez-Zanuy, Mattias Nilsson, W. Bastiaan Kleijn

Abstract: In this paper we discuss the relevance of bandwidth extension for speaker identification tasks. Mainly we want to study if it is possible to recognize voices that have been bandwith extended. For this purpose, we created two different databases (microphonic and ISDN) of speech signals that were bandwidth extended from telephone bandwidth ([300, 3400] Hz) to full bandwidth ([100, 8000] Hz). We have… ▽ More In this paper we discuss the relevance of bandwidth extension for speaker identification tasks. Mainly we want to study if it is possible to recognize voices that have been bandwith extended. For this purpose, we created two different databases (microphonic and ISDN) of speech signals that were bandwidth extended from telephone bandwidth ([300, 3400] Hz) to full bandwidth ([100, 8000] Hz). We have evaluated different parameterizations, and we have found that the MELCEPST parameterization can take advantage of the bandwidth extension algorithms in several situations. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: 4 pages

Journal ref: 2002 11th European Signal Processing Conference, 2002, pp. 1-4

arXiv:2102.11906 [pdf, other]

Handling Background Noise in Neural Speech Generation

Authors: Tom Denton, Alejandro Luebs, Felicia S. C. Lim, Andrew Storus, Hengchin Yeh, W. Bastiaan Kleijn, Jan Skoglund

Abstract: Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing… ▽ More Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: 5 pages, 3 figures, presented at the Asilomar Conference on Signals, Systems, and Computers 2020

arXiv:2102.09660 [pdf, other]

Generative Speech Coding with Predictive Variance Regularization

Authors: W. Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Hengchin Yeh

Abstract: The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the in… ▽ More The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity. △ Less

Submitted 18 February, 2021; originally announced February 2021.

MSC Class: 94 ACM Class: I.m

arXiv:1912.08308 [pdf, other]

Distributed Network Privacy using Error Correcting Codes

Authors: Matt O'Connor, W. Bastiaan Kleijn

Abstract: Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of node… ▽ More Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of nodes contributing to a distributed task. We accomplish this by wrap** local observations in a linear code and intentionally applying symbol errors prior to transmission. If many nodes join the distributed task, a proportional number of symbol errors are introduced into the code leading to decoding failure if the code's predefined symbol error limit is exceeded. △ Less

Submitted 17 December, 2019; originally announced December 2019.

arXiv:1909.04776 [pdf, other]

Generative Speech Enhancement Based on Cloned Networks

Authors: Michael Chinen, W. Bastiaan Kleijn, Felicia S. C. Lim, Jan Skoglund

Abstract: We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the cl… ▽ More We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: Accepted WASPAA 2019

arXiv:1908.07045 [pdf, other]

Salient Speech Representations Based on Cloned Networks

Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Michael Chinen, Jan Skoglund

Abstract: We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receiv… ▽ More We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of features that is identical across the clones. It additionally encourages feature independence and, optionally, reconstruction of a desired target signal by a decoder. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: Interspeech 2019

arXiv:1904.00869 [pdf, ps, other]

Room Geometry Estimation from Room Impulse Responses using Convolutional Neural Networks

Authors: Wangyang Yu, W. Bastiaan Kleijn

Abstract: We describe a new method to estimate the geometry of a room given room impulse responses. The method utilises convolutional neural networks to estimate the room geometry and uses the mean square error as the loss function. In contrast to existing methods, we do not require the position or distance of sources or receivers in the room. The method can be used with only a single room impulse response… ▽ More We describe a new method to estimate the geometry of a room given room impulse responses. The method utilises convolutional neural networks to estimate the room geometry and uses the mean square error as the loss function. In contrast to existing methods, we do not require the position or distance of sources or receivers in the room. The method can be used with only a single room impulse response between one source and one receiver for room geometry estimation. The proposed estimation method can achieve an average of six centimetre accuracy. In addition, the proposed method is shown to be computationally efficient compared to state-of-the-art methods. △ Less

Submitted 15 May, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

arXiv:1807.11320 [pdf, other]

Kernel Density Estimation-Based Markov Models with Hidden State

Authors: Gustav Eje Henter, Arne Leijon, W. Bastiaan Kleijn

Abstract: We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic… ▽ More We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic consistency properties. The models generate new data by concatenating points from the training data sequences in a context-sensitive manner, together with some additive driving noise. We present novel EM-type maximum-likelihood algorithms for data-driven bandwidth selection in KDE-MMs. Additionally, we augment the KDE-MMs with a hidden state, yielding a new model class, KDE-HMMs. The added state variable captures non-Markovian long memory and signal structure (e.g., slow oscillations), complementing the short-range dependences described by the Markov process. The resulting joint Markov and hidden-Markov structure is appealing for modelling complex real-world processes such as speech signals. We present guaranteed-ascent EM-update equations for model parameters in the case of Gaussian kernels, as well as relaxed update formulas that greatly accelerate training in practice. Experiments demonstrate increased held-out set probability for KDE-HMMs on several challenging natural and synthetic data series, compared to traditional techniques such as autoregressive models, HMMs, and their combinations. △ Less

Submitted 30 July, 2018; originally announced July 2018.

Comments: 14 pages, 6 figures

MSC Class: 62M10; 62G07 ACM Class: G.3

arXiv:1803.06718 [pdf, other]

Directional emphasis in ambisonics

Authors: W. Bastiaan Kleijn

Abstract: We describe an ambisonics enhancement method that increases the signal strength in specified directions at low computational cost. The method can be used in a static setup to emphasize the signal arriving from a particular direction or set of directions. It can also be used in an adaptive arrangement where it sharpens directionality and reduces the distortion in timbre associated with low-degree a… ▽ More We describe an ambisonics enhancement method that increases the signal strength in specified directions at low computational cost. The method can be used in a static setup to emphasize the signal arriving from a particular direction or set of directions. It can also be used in an adaptive arrangement where it sharpens directionality and reduces the distortion in timbre associated with low-degree ambisonics representations. The emphasis operator has very low computational complexity and can be applied to time-domain as well as time-frequency ambisonics representations. The operator upscales a low-degree ambisonics representation to a higher degree representation. △ Less

Submitted 24 May, 2018; v1 submitted 18 March, 2018; originally announced March 2018.

arXiv:1712.01120 [pdf, other]

Wavenet based low rate speech coding

Authors: W. Bastiaan Kleijn, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Florian Stimberg, Quan Wang, Thomas C. Walters

Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative m… ▽ More Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model. △ Less

Submitted 1 December, 2017; originally announced December 2017.

Comments: 5 pages, 2 figures

Showing 1–14 of 14 results for author: Kleijn, W B