Search | arXiv e-print repository

SimulTron: On-Device Simultaneous Speech to Speech Translation

Authors: Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

Abstract: Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the st… ▽ More Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2306.05167 [pdf, other]

Decision S4: Efficient Sequence-Based RL via State Spaces Layers

Authors: Shmuel Bar-David, Itamar Zimerman, Eliya Nachmani, Lior Wolf

Abstract: Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family… ▽ More Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: 21 pages,13 figures

MSC Class: 14J60 ACM Class: F.2.2; I.2.7

arXiv:2305.17547 [pdf, other]

Translatotron 3: Speech to Speech Translation with Monolingual Data

Authors: Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai, Heiga Zen, Michelle Tadmor Ramanovich

Abstract: This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding map**, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting… ▽ More This paper presents Translatotron 3, a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets by combining masked autoencoder, unsupervised embedding map**, and back-translation. Experimental results in speech-to-speech translation tasks between Spanish and English show that Translatotron 3 outperforms a baseline cascade system, reporting $18.14$ BLEU points improvement on the synthesized Unpaired-Conversational dataset. In contrast to supervised approaches that necessitate real paired data, or specialized modeling to replicate para-/non-linguistic information such as pauses, speaking rates, and speaker identity, Translatotron 3 showcases its capability to retain it. Audio samples can be found at http://google-research.github.io/lingvo-lab/translatotron3 △ Less

Submitted 16 January, 2024; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: To appear in ICASSP 2024

arXiv:2305.15255 [pdf, other]

Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Authors: Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, Michelle Tadmor Ramanovich

Abstract: We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key… ▽ More We present Spectron, a novel approach to adapting pre-trained large language models (LLMs) to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken language models in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples (https://michelleramanovich.github.io/spectron/spectron) and spoken QA dataset (https://github.com/google-research-datasets/LLAMA1-Test-Set). △ Less

Submitted 30 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: ICLR 2024 camera-ready

arXiv:2301.10752 [pdf, other]

Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation

Authors: Shahar Lutati, Eliya Nachmani, Lior Wolf

Abstract: The problem of speech separation, also known as the cocktail party problem, refers to the task of isolating a single speech signal from a mixture of speech signals. Previous work on source separation derived an upper bound for the source separation task in the domain of human speech. This bound is derived for deterministic models. Recent advancements in generative models challenge this bound. We s… ▽ More The problem of speech separation, also known as the cocktail party problem, refers to the task of isolating a single speech signal from a mixture of speech signals. Previous work on source separation derived an upper bound for the source separation task in the domain of human speech. This bound is derived for deterministic models. Recent advancements in generative models challenge this bound. We show how the upper bound can be generalized to the case of random generative models. Applying a diffusion model Vocoder that was pretrained to model single-speaker voices on the output of a deterministic separation model leads to state-of-the-art separation results. It is shown that this requires one to combine the output of the separation model with that of the diffusion model. In our method, a linear combination is performed, in the frequency domain, using weights that are inferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks. In particular, for two speakers, our method is able to surpass what was previously considered the upper performance bound. △ Less

Submitted 24 June, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

arXiv:2206.02246 [pdf, other]

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

Authors: Alon Levkovitch, Eliya Nachmani, Lior Wolf

Abstract: We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising… ▽ More We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training. △ Less

Submitted 22 June, 2022; v1 submitted 5 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2206.00786 [pdf, other]

Neural Decoding with Optimization of Node Activations

Authors: Eliya Nachmani, Yair Be'ery

Abstract: The problem of maximum likelihood decoding with a neural decoder for error-correcting code is considered. It is shown that the neural decoder can be improved with two novel loss terms on the node's activations. The first loss term imposes a sparse constraint on the node's activations. Whereas, the second loss term tried to mimic the node's activations from a teacher decoder which has better perfor… ▽ More The problem of maximum likelihood decoding with a neural decoder for error-correcting code is considered. It is shown that the neural decoder can be improved with two novel loss terms on the node's activations. The first loss term imposes a sparse constraint on the node's activations. Whereas, the second loss term tried to mimic the node's activations from a teacher decoder which has better performance. The proposed method has the same run time complexity and model size as the neural Belief Propagation decoder, while improving the decoding performance by up to $1.1dB$ on BCH codes. △ Less

Submitted 11 August, 2022; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: IEEE Communications Letters

arXiv:2205.11801 [pdf, other]

SepIt: Approaching a Single Channel Speech Separation Bound

Authors: Shahar Lutati, Eliya Nachmani, Lior Wolf

Abstract: We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively impro… ▽ More We present an upper bound for the Single Channel Speech Separation task, which is based on an assumption regarding the nature of short segments of speech. Using the bound, we are able to show that while the recent methods have made significant progress for a few speakers, there is room for improvement for five and ten speakers. We then introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation. At test time, SpeIt has a varying number of iterations per test sample, based on a mutual information criterion that arises from our analysis. In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers. △ Less

Submitted 21 May, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2204.02849 [pdf, other]

KNN-Diffusion: Image Generation via Large-Scale Retrieval

Authors: Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, Yaniv Taigman

Abstract: Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small an… ▽ More Recent text-to-image models have achieved impressive results. However, since they require large-scale datasets of text-image pairs, it is impractical to train them on new domains where data is scarce or not labeled. In this work, we propose using large-scale retrieval methods, in particular, efficient k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a substantially small and efficient text-to-image diffusion model without any text, (2) generating out-of-distribution images by simply swap** the retrieval database at inference time, and (3) performing text-driven local semantic manipulations while preserving object identity. To demonstrate the robustness of our method, we apply our kNN approach on two state-of-the-art diffusion backbones, and show results on several different datasets. As evaluated by human studies and automatic metrics, our method achieves state-of-the-art results compared to existing approaches that train text-to-image generation models using images only (without paired text data) △ Less

Submitted 2 October, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

arXiv:2112.00390 [pdf, other]

SegDiff: Image Segmentation with Diffusion Probabilistic Models

Authors: Tomer Amit, Tal Shaharbany, Eliya Nachmani, Lior Wolf

Abstract: Diffusion Probabilistic Methods are employed for state-of-the-art image generation. In this work, we present a method for extending such models for performing image segmentation. The method learns end-to-end, without relying on a pre-trained backbone. The information in the input image and in the current estimation of the segmentation map is merged by summing the output of two encoders. Additional… ▽ More Diffusion Probabilistic Methods are employed for state-of-the-art image generation. In this work, we present a method for extending such models for performing image segmentation. The method learns end-to-end, without relying on a pre-trained backbone. The information in the input image and in the current estimation of the segmentation map is merged by summing the output of two encoders. Additional encoding layers and a decoder are then used to iteratively refine the segmentation map, using a diffusion model. Since the diffusion model is probabilistic, it is applied multiple times, and the results are merged into a final segmentation map. The new method produces state-of-the-art results on the Cityscapes validation set, the Vaihingen building segmentation benchmark, and the MoNuSeg dataset. △ Less

Submitted 7 September, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

arXiv:2111.12986 [pdf, other]

A-Muze-Net: Music Generation by Composing the Harmony based on the Generated Melody

Authors: Or Goren, Eliya Nachmani, Lior Wolf

Abstract: We present a method for the generation of Midi files of piano music. The method models the right and left hands using two networks, where the left hand is conditioned on the right hand. This way, the melody is generated before the harmony. The Midi is represented in a way that is invariant to the musical scale, and the melody is represented, for the purpose of conditioning the harmony, by the cont… ▽ More We present a method for the generation of Midi files of piano music. The method models the right and left hands using two networks, where the left hand is conditioned on the right hand. This way, the melody is generated before the harmony. The Midi is represented in a way that is invariant to the musical scale, and the melody is represented, for the purpose of conditioning the harmony, by the content of each bar, viewed as a chord. Finally, notes are added randomly, based on this chord representation, in order to enrich the generated audio. Our experiments show a significant improvement over the state of the art for training on such datasets, and demonstrate the contribution of each of the novel components. △ Less

Submitted 25 November, 2021; originally announced November 2021.

Comments: Accepted for publication at MMM 2022

arXiv:2111.01471 [pdf, other]

Zero-Shot Translation using Diffusion Models

Authors: Eliya Nachmani, Shaked Dovrat

Abstract: In this work, we show a novel method for neural machine translation (NMT), using a denoising diffusion probabilistic model (DDPM), adjusted for textual data, following recent advances in the field. We show that it's possible to translate sentences non-autoregressively using a diffusion model conditioned on the source sentence. We also show that our model is able to translate between pairs of langu… ▽ More In this work, we show a novel method for neural machine translation (NMT), using a denoising diffusion probabilistic model (DDPM), adjusted for textual data, following recent advances in the field. We show that it's possible to translate sentences non-autoregressively using a diffusion model conditioned on the source sentence. We also show that our model is able to translate between pairs of languages unseen during training (zero-shot learning). △ Less

Submitted 2 November, 2021; originally announced November 2021.

Comments: preprint

arXiv:2110.05948 [pdf, other]

Denoising Diffusion Gamma Models

Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion… ▽ More Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underlying noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom could improve the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we introduce the Denoising Diffusion Gamma Model (DDGM) and show that noise from Gamma distribution provides improved results for image and speech generation. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise. △ Less

Submitted 10 October, 2021; originally announced October 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2106.07582

arXiv:2106.07582 [pdf, other]

Non Gaussian Denoising Diffusion Models

Authors: Eliya Nachmani, Robin San Roman, Lior Wolf

Abstract: Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion pro… ▽ More Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we show that noise from Gamma distribution provides improved results for image and speech generation. Moreover, we show that using a mixture of Gaussian noise variables in the diffusion process improves the performance over a diffusion process that is based on a single distribution. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise and a mixture of noise. △ Less

Submitted 14 June, 2021; originally announced June 2021.

arXiv:2106.04876 [pdf, other]

Recovering AES Keys with a Deep Cold Boot Attack

Authors: Itamar Zimerman, Eliya Nachmani, Lior Wolf

Abstract: Cold boot attacks inspect the corrupted random access memory soon after the power has been shut down. While most of the bits have been corrupted, many bits, at random locations, have not. Since the keys in many encryption schemes are being expanded in memory into longer keys with fixed redundancies, the keys can often be restored. In this work, we combine a novel cryptographic variant of a deep er… ▽ More Cold boot attacks inspect the corrupted random access memory soon after the power has been shut down. While most of the bits have been corrupted, many bits, at random locations, have not. Since the keys in many encryption schemes are being expanded in memory into longer keys with fixed redundancies, the keys can often be restored. In this work, we combine a novel cryptographic variant of a deep error correcting code technique with a modified SAT solver scheme to apply the attack on AES keys. Even though AES consists of Rijndael S-box elements, that are specifically designed to be resistant to linear and differential cryptanalysis, our method provides a novel formalization of the AES key scheduling as a computational graph, which is implemented by a neural message passing network. Our results show that our methods outperform the state of the art attack methods by a very large margin. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted to ICML 2021

arXiv:2104.08955 [pdf, other]

Many-Speakers Single Channel Speech Separation with Optimal Permutation Training

Authors: Shaked Dovrat, Eliya Nachmani, Lior Wolf

Abstract: Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train wit… ▽ More Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified architecture that can handle the increased number of speakers. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin. △ Less

Submitted 7 November, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: Accepted to Interspeech 2021, Data creation link added

arXiv:2104.02600 [pdf, other]

Noise Estimation for Generative Diffusion Models

Authors: Robin San-Roman, Eliya Nachmani, Lior Wolf

Abstract: Generative diffusion models have emerged as leading models in speech and image generation. However, in order to perform well with a small number of denoising steps, a costly tuning of the set of noise parameters is needed. In this work, we present a simple and versatile learning scheme that can step-by-step adjust those noise parameters, for any given number of steps, while the previous work needs… ▽ More Generative diffusion models have emerged as leading models in speech and image generation. However, in order to perform well with a small number of denoising steps, a costly tuning of the set of noise parameters is needed. In this work, we present a simple and versatile learning scheme that can step-by-step adjust those noise parameters, for any given number of steps, while the previous work needs to retune for each number separately. Furthermore, without modifying the weights of the diffusion model, we are able to significantly improve the synthesis results, for a small number of steps. Our approach comes at a negligible computation cost. △ Less

Submitted 12 September, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

arXiv:2103.11780 [pdf, other]

Autoregressive Belief Propagation for Decoding Block Codes

Authors: Eliya Nachmani, Lior Wolf

Abstract: We revisit recent methods that employ graph neural networks for decoding error correcting codes and employ messages that are computed in an autoregressive manner. The outgoing messages of the variable nodes are conditioned not only on the incoming messages, but also on an estimation of the SNR and on the inferred codeword and on two downstream computations: (i) an extended vector of parity check o… ▽ More We revisit recent methods that employ graph neural networks for decoding error correcting codes and employ messages that are computed in an autoregressive manner. The outgoing messages of the variable nodes are conditioned not only on the incoming messages, but also on an estimation of the SNR and on the inferred codeword and on two downstream computations: (i) an extended vector of parity check outcomes, (ii) the mismatch between the inferred codeword and the re-encoding of the information bits of this codeword. Unlike most learned methods in the field, our method violates the symmetry conditions that enable the other methods to train exclusively with the zero-word. Despite not having the luxury of training on a single word, and the inability to train on more than a small fraction of the relevant sample space, we demonstrate effective training. The new method obtains a bit error rate that outperforms the latest methods by a sizable margin. △ Less

Submitted 23 January, 2021; originally announced March 2021.

arXiv:2011.02329 [pdf, other]

Single channel voice separation for unknown number of speakers under reverberant and noisy settings

Authors: Shlomo E. Chazan, Lior Wolf, Eliya Nachmani, Yossi Adi

Abstract: We present a unified network for voice separation of an unknown number of speakers. The proposed approach is composed of several separation heads optimized together with a speaker classification branch. The separation is carried out in the time domain, together with parameter sharing between all separation heads. The classification branch estimates the number of speakers while each head is special… ▽ More We present a unified network for voice separation of an unknown number of speakers. The proposed approach is composed of several separation heads optimized together with a speaker classification branch. The separation is carried out in the time domain, together with parameter sharing between all separation heads. The classification branch estimates the number of speakers while each head is specialized in separating a different number of speakers. We evaluate the proposed model under both clean and noisy reverberant set-tings. Results suggest that the proposed approach is superior to the baseline model by a significant margin. Additionally, we present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously. △ Less

Submitted 4 November, 2020; originally announced November 2020.

arXiv:2009.01381 [pdf, other]

doi 10.1109/LSP.2020.3043977

SAGRNN: Self-Attentive Gated RNN for Binaural Speaker Separation with Interaural Cue Preservation

Authors: Ke Tan, Buye Xu, Anurag Kumar, Eliya Nachmani, Yossi Adi

Abstract: Most existing deep learning based binaural speaker separation systems focus on producing a monaural estimate for each of the target speakers, and thus do not preserve the interaural cues, which are crucial for human listeners to perform sound localization and lateralization. In this study, we address talker-independent binaural speaker separation with interaural cues preserved in the estimated bin… ▽ More Most existing deep learning based binaural speaker separation systems focus on producing a monaural estimate for each of the target speakers, and thus do not preserve the interaural cues, which are crucial for human listeners to perform sound localization and lateralization. In this study, we address talker-independent binaural speaker separation with interaural cues preserved in the estimated binaural signals. Specifically, we extend a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity. We develop an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals. The experimental results show that our proposed approach achieves significantly better separation performance than a recent binaural separation approach. In addition, our approach effectively preserves the interaural cues, which improves the accuracy of sound localization. △ Less

Submitted 14 November, 2020; v1 submitted 2 September, 2020; originally announced September 2020.

Comments: 5 pages, accepted by IEEE Signal Processing Letters

arXiv:2003.01531 [pdf, other]

Voice Separation with an Unknown Number of Multiple Speakers

Authors: Eliya Nachmani, Yossi Adi, Lior Wolf

Abstract: We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speake… ▽ More We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers. △ Less

Submitted 1 September, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: Accepted to ICML 2020. For associated audio samples, see http://enk100.github.io/speaker_separation

arXiv:2002.00240 [pdf, other]

Molecule Property Prediction and Classification with Graph Hypernetworks

Authors: Eliya Nachmani, Lior Wolf

Abstract: Graph neural networks are currently leading the performance charts in learning-based molecule property prediction and classification. Computational chemistry has, therefore, become the a prominent testbed for generic graph neural networks, as well as for specialized message passing methods. In this work, we demonstrate that the replacement of the underlying networks with hypernetworks leads to a b… ▽ More Graph neural networks are currently leading the performance charts in learning-based molecule property prediction and classification. Computational chemistry has, therefore, become the a prominent testbed for generic graph neural networks, as well as for specialized message passing methods. In this work, we demonstrate that the replacement of the underlying networks with hypernetworks leads to a boost in performance, obtaining state of the art results in various benchmarks. A major difficulty in the application of hypernetworks is their lack of stability. We tackle this by combining the current message and the first message. A recent work has tackled the training instability of hypernetworks in the context of error correcting codes, by replacing the activation function of the message passing network with a low-order Taylor approximation of it. We demonstrate that our generic solution can replace this domain-specific solution. △ Less

Submitted 1 February, 2020; originally announced February 2020.

arXiv:1911.03229 [pdf, other]

A Gated Hypernet Decoder for Polar Codes

Authors: Eliya Nachmani, Lior Wolf

Abstract: Hypernetworks were recently shown to improve the performance of message passing algorithms for decoding error correcting codes. In this work, we demonstrate how hypernetworks can be applied to decode polar codes by employing a new formalization of the polar belief propagation decoding scheme. We demonstrate that our method improves the previous results of neural polar decoders and achieves, for la… ▽ More Hypernetworks were recently shown to improve the performance of message passing algorithms for decoding error correcting codes. In this work, we demonstrate how hypernetworks can be applied to decode polar codes by employing a new formalization of the polar belief propagation decoding scheme. We demonstrate that our method improves the previous results of neural polar decoders and achieves, for large SNRs, the same bit-error-rate performances as the successive list cancellation method, which is known to be better than any belief propagation decoders and very close to the maximum likelihood decoder. △ Less

Submitted 10 February, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Accepted to ICASSP 2020

arXiv:1909.09036 [pdf, other]

Hyper-Graph-Network Decoders for Block Codes

Authors: Eliya Nachmani, Lior Wolf

Abstract: Neural decoders were shown to outperform classical message passing techniques for short BCH codes. In this work, we extend these results to much larger families of algebraic block codes, by performing message passing with graph neural networks. The parameters of the sub-network at each variable-node in the Tanner graph are obtained from a hypernetwork that receives the absolute values of the curre… ▽ More Neural decoders were shown to outperform classical message passing techniques for short BCH codes. In this work, we extend these results to much larger families of algebraic block codes, by performing message passing with graph neural networks. The parameters of the sub-network at each variable-node in the Tanner graph are obtained from a hypernetwork that receives the absolute values of the current message as input. To add stability, we employ a simplified version of the arctanh activation that is based on a high order Taylor approximation of this activation function. Our results show that for a large number of algebraic block codes, from diverse families of codes (BCH, LDPC, Polar), the decoding obtained with our method outperforms the vanilla belief propagation method as well as other learning techniques from the literature. △ Less

Submitted 25 October, 2019; v1 submitted 5 September, 2019; originally announced September 2019.

Comments: Accepted to NeurIPS 2019. Camera Ready

arXiv:1904.06590 [pdf, other]

Unsupervised Singing Voice Conversion

Authors: Eliya Nachmani, Lior Wolf

Abstract: We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a single CNN e… ▽ More We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a single CNN encoder for all singers, a single WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on. In order to deal with relatively small datasets, we propose a new data augmentation scheme, as well as new training losses and protocols that are based on backtranslation. Our evaluation presents evidence that the conversion produces natural signing voices that are highly recognizable as the target singer. △ Less

Submitted 25 September, 2019; v1 submitted 13 April, 2019; originally announced April 2019.

Comments: Accepted to Interspeech 2019

arXiv:1902.02263 [pdf, other]

Unsupervised Polyglot Text To Speech

Authors: Eliya Nachmani, Lior Wolf

Abstract: We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. Training is done without using matching or parallel data, i.e., without samples of the same speaker in multiple languages, making the method much more applicable. The conve… ▽ More We present a TTS neural network that is able to produce speech in multiple languages. The proposed network is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages. Training is done without using matching or parallel data, i.e., without samples of the same speaker in multiple languages, making the method much more applicable. The conversion is based on learning a polyglot network that has multiple per-language sub-networks and adding loss terms that preserve the speaker's identity in multiple languages. We evaluate the proposed polyglot neural network for three languages with a total of more than 400 speakers and demonstrate convincing conversion capabilities. △ Less

Submitted 6 February, 2019; originally announced February 2019.

Comments: The paper will be presented at ICASSP 2019

arXiv:1802.06984 [pdf, other]

Fitting New Speakers Based on a Short Untranscribed Sample

Authors: Eliya Nachmani, Adam Polyak, Yaniv Taigman, Lior Wolf

Abstract: Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, place… ▽ More Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples. △ Less

Submitted 20 February, 2018; originally announced February 2018.

arXiv:1801.02726 [pdf, other]

Near Maximum Likelihood Decoding with Deep Learning

Authors: Eliya Nachmani, Yaron Bachar, Elad Marciano, David Burshtein, Yair Be'ery

Abstract: A novel and efficient neural decoder algorithm is proposed. The proposed decoder is based on the neural Belief Propagation algorithm and the Automorphism Group. By combining neural belief propagation with permutations from the Automorphism Group we achieve near maximum likelihood performance for High Density Parity Check codes. Moreover, the proposed decoder significantly improves the decoding com… ▽ More A novel and efficient neural decoder algorithm is proposed. The proposed decoder is based on the neural Belief Propagation algorithm and the Automorphism Group. By combining neural belief propagation with permutations from the Automorphism Group we achieve near maximum likelihood performance for High Density Parity Check codes. Moreover, the proposed decoder significantly improves the decoding complexity, compared to our earlier work on the topic. We also investigate the training process and show how it can be accelerated. Simulations of the hessian and the condition number show why the learning process is accelerated. We demonstrate the decoding algorithm for various linear block codes of length up to 63 bits. △ Less

Submitted 8 January, 2018; originally announced January 2018.

Comments: The paper will be presented at IZS 2018

arXiv:1707.06588 [pdf, other]

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

Authors: Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani

Abstract: We present a new neural text to speech (TTS) method that is able to transform text to speech in voices that are sampled in the wild. Unlike other systems, our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. The network architecture is simpler than those in the existing literature and is based on a novel shifting buffer workin… ▽ More We present a new neural text to speech (TTS) method that is able to transform text to speech in voices that are sampled in the wild. Unlike other systems, our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. The network architecture is simpler than those in the existing literature and is based on a novel shifting buffer working memory. The same buffer is used for estimating the attention, computing the output audio, and for updating the buffer itself. The input sentence is encoded using a context-free lookup table that contains one entry per character or phoneme. The speakers are similarly represented by a short vector that can also be fitted to new identities, even with only a few samples. Variability in the generated speech is achieved by priming the buffer prior to generating the audio. Experimental results on several datasets demonstrate convincing capabilities, making TTS accessible to a wider range of applications. In order to promote reproducibility, we release our source code and models. △ Less

Submitted 1 February, 2018; v1 submitted 20 July, 2017; originally announced July 2017.

arXiv:1706.07043 [pdf, other]

doi 10.1109/JSTSP.2017.2788405

Deep Learning Methods for Improved Decoding of Linear Codes

Authors: Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J. Gross, David Burshtein, Yair Beery

Abstract: The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterat… ▽ More The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterations, so as to form a recurrent neural network architecture, can be implemented with comparable results. The advantage is that significantly less parameters are required. We also introduce a recurrent neural decoder architecture based on the method of successive relaxation. Improvements over standard belief propagation are also observed on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the neural belief propagation decoder can be used to improve the performance, or alternatively reduce the computational complexity, of a close to optimal decoder of short BCH codes. △ Less

Submitted 1 January, 2018; v1 submitted 21 June, 2017; originally announced June 2017.

Comments: Accepted To IEEE Journal Of Selected Topics In Signal Processing

arXiv:1702.07560 [pdf, other]

RNN Decoding of Linear Block Codes

Authors: Eliya Nachmani, Elad Marciano, David Burshtein, Yair Be'ery

Abstract: Designing a practical, low complexity, close to optimal, channel decoder for powerful algebraic codes with short to moderate block length is an open research problem. Recently it has been shown that a feed-forward neural network architecture can improve on standard belief propagation decoding, despite the large example space. In this paper we introduce a recurrent neural network architecture for d… ▽ More Designing a practical, low complexity, close to optimal, channel decoder for powerful algebraic codes with short to moderate block length is an open research problem. Recently it has been shown that a feed-forward neural network architecture can improve on standard belief propagation decoding, despite the large example space. In this paper we introduce a recurrent neural network architecture for decoding linear block codes. Our method shows comparable bit error rate results compared to the feed-forward neural network with significantly less parameters. We also demonstrate improved performance over belief propagation on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the RNN decoder can be used to improve the performance or alternatively reduce the computational complexity of the mRRD algorithm for low complexity, close to optimal, decoding of short BCH codes. △ Less

Submitted 24 February, 2017; originally announced February 2017.

arXiv:1607.04793 [pdf, other]

Learning to Decode Linear Codes Using Deep Learning

Authors: Eliya Nachmani, Yair Beery, David Burshtein

Abstract: A novel deep learning method for improving the belief propagation algorithm is proposed. The method generalizes the standard belief propagation algorithm by assigning weights to the edges of the Tanner graph. These edges are then trained using deep learning techniques. A well-known property of the belief propagation algorithm is the independence of the performance on the transmitted codeword. A cr… ▽ More A novel deep learning method for improving the belief propagation algorithm is proposed. The method generalizes the standard belief propagation algorithm by assigning weights to the edges of the Tanner graph. These edges are then trained using deep learning techniques. A well-known property of the belief propagation algorithm is the independence of the performance on the transmitted codeword. A crucial property of our new method is that our decoder preserved this property. Furthermore, this property allows us to learn only a single codeword instead of exponential number of code-words. Improvements over the belief propagation algorithm are demonstrated for various high density parity check codes. △ Less

Submitted 30 September, 2016; v1 submitted 16 July, 2016; originally announced July 2016.

Comments: Presented at the Allerton Conference 2016

Showing 1–32 of 32 results for author: Nachmani, E