Search | arXiv e-print repository

Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

Authors: Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

Abstract: In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected… ▽ More In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Accepted for publication at INTERSPEECH 2024

arXiv:2406.02649 [pdf, other]

Keyword-Guided Adaptation of Automatic Speech Recognition

Authors: Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

Abstract: Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model t… ▽ More Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model that leverages the Whisper encoder representation to dynamically generate prompts for guiding the decoder during the transcription process. We introduce two approaches to effectively steer the decoder towards these prompts: KG-Whisper, which is aimed at fine-tuning the Whisper decoder, and KG-Whisper-PT, which learns a prompt prefix. Our results show a significant improvement in the recognition accuracy of specified keywords and in reducing the overall word error rates. Specifically, in unseen language generalization, we demonstrate an average WER improvement of 5.1% over Whisper. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted to InterSpeech 2024

arXiv:2310.19708 [pdf, other]

Combining Language Models For Specialized Domains: A Colorful Approach

Authors: Daniel Eitan, Menachem Pirchi, Neta Glazer, Shai Meital, Gil Ayach, Gidon Krendel, Aviv Shamsian, Aviv Navon, Gil Hetz, Joseph Keshet

Abstract: General purpose language models (LMs) encounter difficulties when processing domain-specific jargon and terminology, which are frequently utilized in specialized fields such as medicine or industrial settings. Moreover, they often find it challenging to interpret mixed speech that blends general language with specialized jargon. This poses a challenge for automatic speech recognition systems opera… ▽ More General purpose language models (LMs) encounter difficulties when processing domain-specific jargon and terminology, which are frequently utilized in specialized fields such as medicine or industrial settings. Moreover, they often find it challenging to interpret mixed speech that blends general language with specialized jargon. This poses a challenge for automatic speech recognition systems operating within these specific domains. In this work, we introduce a novel approach that integrates domain-specific or secondary LM into general-purpose LM. This strategy involves labeling, or "coloring", each word to indicate its association with either the general or the domain-specific LM. We develop an optimized algorithm that enhances the beam search algorithm to effectively handle inferences involving colored words. Our evaluations indicate that this approach is highly effective in integrating jargon into language tasks. Notably, our method substantially lowers the error rate for domain-specific words without compromising performance in the general domain. △ Less

Submitted 1 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: Under Review

arXiv:2310.01381 [pdf, other]

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Authors: Roi Benita, Michael Elad, Joseph Keshet

Abstract: Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, g… ▽ More Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlap** frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems. △ Less

Submitted 10 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2309.08561 [pdf, other]

Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

Authors: Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, Joseph Keshet

Abstract: Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encod… ▽ More Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encoder is trained to output keyword-conditioned normalization parameters. These parameters are used to process the auditory input. We provide an extensive evaluation using challenging and diverse multi-lingual benchmarks and show significant improvements over recent keyword spotting and ASR baselines. Furthermore, we study the effectiveness of our approach on low-resource languages that were unseen during the training. The results demonstrate a substantial performance improvement compared to baseline methods. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: Under Review

arXiv:2207.05418 [pdf, other]

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

Authors: Gabi Shalev, Gal-Lev Shalev, Joseph Keshet

Abstract: Image captioning research achieved breakthroughs in recent years by develo** neural models that can generate diverse and high-quality descriptions for images drawn from the same distribution as training images. However, when facing out-of-distribution (OOD) images, such as corrupted images, or images containing unknown objects, the models fail in generating relevant captions. In this paper, we… ▽ More Image captioning research achieved breakthroughs in recent years by develo** neural models that can generate diverse and high-quality descriptions for images drawn from the same distribution as training images. However, when facing out-of-distribution (OOD) images, such as corrupted images, or images containing unknown objects, the models fail in generating relevant captions. In this paper, we consider the problem of OOD detection in image captioning. We formulate the problem and suggest an evaluation setup for assessing the model's performance on the task. Then, we analyze and show the effectiveness of the caption's likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: Accepted to ACM Multimedia (MM) 2022

arXiv:2206.14639 [pdf, other]

DDKtor: Automatic Diadochokinetic Speech Analysis

Authors: Yael Segal, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet

Abstract: Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unanno… ▽ More Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Both models work on the raw waveform and use convolutional layers for feature extraction. The first model is based on an LSTM classifier followed by fully connected layers, while the second model adds more convolutional layers followed by fully connected layers. These segmentations predicted by the models are used to obtain measures of speech rate and sound duration. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems and performs comparably to trained human annotators. Moreover, the LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2206.11632 [pdf, other]

Formant Estimation and Tracking using Probabilistic Heat-Maps

Authors: Yosi Shrem, Felix Kreuk, Joseph Keshet

Abstract: Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on… ▽ More Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on, these methods exhibit a decline in performance, limiting their usage as generic tools. The contribution of this paper is to propose a new network architecture that performs well on a variety of different speaker and speech domains. Our proposed model is composed of a shared encoder that gets as input a spectrogram and outputs a domain-invariant representation. Then, multiple decoders further process this representation, each responsible for predicting a different formant while considering the lower formant predictions. An advantage of our model is that it is based on heatmaps that generate a probability distribution over formant predictions. Results suggest that our proposed model better represents the signal over various domains and leads to better formant frequency tracking and estimation. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: interspeech 2022

arXiv:2205.04864 [pdf, other]

THOR: Threshold-Based Ranking Loss for Ordinal Regression

Authors: Tzeviya Sylvia Fuchs, Joseph Keshet

Abstract: In this work, we present a regression-based ordinal regression algorithm for supervised classification of instances into ordinal categories. In contrast to previous methods, in this work the decision boundaries between categories are predefined, and the algorithm learns to project the input examples onto their appropriate scores according to these predefined boundaries. This is achieved by adding… ▽ More In this work, we present a regression-based ordinal regression algorithm for supervised classification of instances into ordinal categories. In contrast to previous methods, in this work the decision boundaries between categories are predefined, and the algorithm learns to project the input examples onto their appropriate scores according to these predefined boundaries. This is achieved by adding a novel threshold-based pairwise loss function that aims at minimizing the regression error, which in turn minimizes the Mean Absolute Error (MAE) measure. We implemented our proposed architecture-agnostic method using the CNN-framework for feature extraction. Experimental results on five real-world benchmarks demonstrate that the proposed algorithm achieves the best MAE results compared to state-of-the-art ordinal regression algorithms. △ Less

Submitted 10 May, 2022; originally announced May 2022.

arXiv:2204.13094 [pdf, other]

Unsupervised Word Segmentation using K Nearest Neighbors

Authors: Tzeviya Sylvia Fuchs, Yedid Hoshen, Joseph Keshet

Abstract: In this paper, we propose an unsupervised kNN-based approach for word segmentation in speech utterances. Our method relies on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its K nearest neighbors within the training set. Our main assumption is that a segment containing more than one word would occur less often than a segment containing… ▽ More In this paper, we propose an unsupervised kNN-based approach for word segmentation in speech utterances. Our method relies on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its K nearest neighbors within the training set. Our main assumption is that a segment containing more than one word would occur less often than a segment containing a single word. Our method does not require phoneme discovery and is able to operate directly on pre-trained audio representations. This is in contrast to current methods that use a two-stage approach; first detecting the phonemes in the utterance and then detecting word-boundaries according to statistics calculated on phoneme patterns. Experiments on two datasets demonstrate improved results over previous single-stage methods and competitive results on state-of-the-art two-stage methods. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: Submitted to interspeech 2022

arXiv:2204.04166 [pdf, other]

Self-supervised Speaker Diarization

Authors: Yehoshua Dissen, Felix Kreuk, Joseph Keshet

Abstract: Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised d… ▽ More Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised deep-learning model for speaker diarization. Specifically, the study focuses on generating high-quality neural speaker representations without any annotated data, as well as on estimating secondary hyperparameters of the model without annotations. The speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker. The trained encoder model is then used to self-generate pseudo-labels to subsequently train a similarity score between different segments of the same call using probabilistic linear discriminant analysis (PLDA) and further to learn a clustering stop** threshold. We compared our model to state-of-the-art unsupervised as well as supervised baselines on the CallHome benchmarks. According to empirical results, our approach outperforms unsupervised methods when only two speakers are present in the call, and is only slightly worse than recent supervised models. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to Interspeech 2022

arXiv:2204.03379 [pdf, ps, other]

Correcting Mispronunciations in Speech using Spectrogram Inpainting

Authors: Talia Ben-Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet

Abstract: Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given inc… ▽ More Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of unimpaired proper speech examples, and the generator is trained to reconstruct the original proper speech. We evaluated the performance of our system on phoneme replacement of minimal pair words of English as well as on children with pronunciation disorders. Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker. △ Less

Submitted 30 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted for publication at Interspeech 2022

arXiv:2203.17019 [pdf, other]

DeepFry: Identifying Vocal Fry Using Deep Neural Networks

Authors: Bronya R. Chernyak, Talia Ben Simon, Yael Segal, Jeremy Steffman, Eleanor Chodroff, Jennifer S. Cole, Joseph Keshet

Abstract: Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly f… ▽ More Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data. △ Less

Submitted 26 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Comments: Accepted to Interspeech 2022

arXiv:2103.08265 [pdf, other]

Constant Random Perturbations Provide Adversarial Robustness with Minimal Effect on Accuracy

Authors: Bronya Roni Chernyak, Bhiksha Raj, Tamir Hazan, Joseph Keshet

Abstract: This paper proposes an attack-independent (non-adversarial training) technique for improving adversarial robustness of neural network models, with minimal loss of standard accuracy. We suggest creating a neighborhood around each training example, such that the label is kept constant for all inputs within that neighborhood. Unlike previous work that follows a similar principle, we apply this idea b… ▽ More This paper proposes an attack-independent (non-adversarial training) technique for improving adversarial robustness of neural network models, with minimal loss of standard accuracy. We suggest creating a neighborhood around each training example, such that the label is kept constant for all inputs within that neighborhood. Unlike previous work that follows a similar principle, we apply this idea by extending the training set with multiple perturbations for each training example, drawn from within the neighborhood. These perturbations are model independent, and remain constant throughout the entire training process. We analyzed our method empirically on MNIST, SVHN, and CIFAR-10, under different attacks and conditions. Results suggest that the proposed approach improves standard accuracy over other defenses while having increased robustness compared to vanilla adversarial training. △ Less

Submitted 27 July, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Journal ref: Workshop of RobustML ICLR 2021

arXiv:2103.05468 [pdf, other]

CNN-based Spoken Term Detection and Localization without Dynamic Programming

Authors: Tzeviya Sylvia Fuchs, Yael Segal, Joseph Keshet

Abstract: In this paper, we propose a spoken term detection algorithm for simultaneous prediction and localization of in-vocabulary and out-of-vocabulary terms within an audio segment. The proposed algorithm infers whether a term was uttered within a given speech signal or not by predicting the word embeddings of various parts of the speech signal and comparing them to the word embedding of the desired term… ▽ More In this paper, we propose a spoken term detection algorithm for simultaneous prediction and localization of in-vocabulary and out-of-vocabulary terms within an audio segment. The proposed algorithm infers whether a term was uttered within a given speech signal or not by predicting the word embeddings of various parts of the speech signal and comparing them to the word embedding of the desired term. The algorithm utilizes an existing embedding space for this task and does not need to train a task-specific embedding space. At inference the algorithm simultaneously predicts all possible locations of the target term and does not need dynamic programming for optimal search. We evaluate our system on several spoken term detection tasks on read speech corpora. △ Less

Submitted 7 March, 2021; originally announced March 2021.

Journal ref: ICASSP 2021

arXiv:2011.08704 [pdf, other]

Redesigning the classification layer by randomizing the class representation vectors

Authors: Gabi Shalev, Gal-Lev Shalev, Joseph Keshet

Abstract: Neural image classification models typically consist of two components. The first is an image encoder, which is responsible for encoding a given raw image into a representative vector. The second is the classification component, which is often implemented by projecting the representative vector onto target class vectors. The target class vectors, along with the rest of the model parameters, are es… ▽ More Neural image classification models typically consist of two components. The first is an image encoder, which is responsible for encoding a given raw image into a representative vector. The second is the classification component, which is often implemented by projecting the representative vector onto target class vectors. The target class vectors, along with the rest of the model parameters, are estimated so as to minimize the loss function. In this paper, we analyze how simple design choices for the classification layer affect the learning dynamics. We show that the standard cross-entropy training implicitly captures visual similarities between different classes, which might deteriorate accuracy or even prevents some models from converging. We propose to draw the class vectors randomly and set them as fixed during training, thus invalidating the visual similarities encoded in these vectors. We analyze the effects of kee** the class vectors fixed and show that it can increase the inter-class separability, intra-class compactness, and the overall model accuracy, while maintaining the robustness to image corruptions and the generalization of the learned concepts. △ Less

Submitted 29 November, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

Comments: 11 pages, 9 tables, 5 figures

arXiv:2009.01534 [pdf, other]

Fairness in the Eyes of the Data: Certifying Machine-Learning Models

Authors: Shahar Segal, Yossi Adi, Benny Pinkas, Carsten Baum, Chaya Ganesh, Joseph Keshet

Abstract: We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available… ▽ More We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available only to the tester or is publicly known in advance, even to the model creator. We investigate the soundness of the proposed approach using theoretical analysis and present statistical guarantees for the interactive test. Finally, we provide a cryptographic technique to automate fairness testing and certified inference with only black-box access to the model at hand while hiding the participants' sensitive data. △ Less

Submitted 25 June, 2021; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: Accepted to AIES-2021

arXiv:2007.13465 [pdf, other]

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

Authors: Felix Kreuk, Joseph Keshet, Yossi Adi

Abstract: We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce t… ▽ More We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance. △ Less

Submitted 6 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

Comments: Interspeech 2020 paper

arXiv:2002.04992 [pdf, other]

Phoneme Boundary Detection using Learnable Segmental Features

Authors: Felix Kreuk, Yaniv Sheena, Joseph Keshet, Yossi Adi

Abstract: Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken p… ▽ More Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superior to the baseline models and reaches state-of-the-art performance in terms of F1 and R-value. We further explore the use of phonetic transcription as additional supervision and show this yields minor improvements in performance but substantially better convergence rates. We additionally evaluate the model on a Hebrew corpus and demonstrate such phonetic supervision can be beneficial in a multi-lingual setting. △ Less

Submitted 16 February, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

arXiv:1910.13255 [pdf, other]

Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild

Authors: Yosi Shrem, Matthew Goldrick, Joseph Keshet

Abstract: Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic spe… ▽ More Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic speech. The proposed system addresses two critical issues: it can measure positive and negative VOT equally well, and it is trained to be robust to variation across annotations. Our approach is based on the structured prediction framework, where the feature functions are defined to be RNNs. These learn to capture segmental variation in the signal. Results suggest that our method substantially improves over the current state-of-the-art. In contrast to previous work, our Deep and Robust VOT annotator, Dr.VOT, can successfully estimate negative VOTs while maintaining state-of-the-art performance on positive VOTs. This high level of performance generalizes to new corpora without further retraining. Index Terms: structured prediction, multi-task learning, adversarial training, recurrent neural networks, sequence segmentation. △ Less

Submitted 27 October, 2019; originally announced October 2019.

Comments: interspeech 2019

Journal ref: interspeech 2019

arXiv:1904.07704 [pdf, other]

SpeechYOLO: Detection and Localization of Speech Objects

Authors: Yael Segal, Tzeviya Sylvia Fuchs, Joseph Keshet

Abstract: In this paper, we propose to apply object detection methods from the vision domain on the speech recognition domain, by treating audio fragments as objects. More specifically, we present SpeechYOLO, which is inspired by the YOLO algorithm for object detection in images. The goal of SpeechYOLO is to localize boundaries of utterances within the input signal, and to correctly classify them. Our syste… ▽ More In this paper, we propose to apply object detection methods from the vision domain on the speech recognition domain, by treating audio fragments as objects. More specifically, we present SpeechYOLO, which is inspired by the YOLO algorithm for object detection in images. The goal of SpeechYOLO is to localize boundaries of utterances within the input signal, and to correctly classify them. Our system is composed of a convolutional neural network, with a simple least-mean-squares loss function. We evaluated the system on several keyword spotting tasks, that include corpora of read speech and spontaneous speech. Our system compares favorably with other algorithms trained for both localization and classification. △ Less

Submitted 30 June, 2019; v1 submitted 14 April, 2019; originally announced April 2019.

Journal ref: Interspeech 2019, pp. 4210-4214

arXiv:1902.03083 [pdf, other]

Hide and Speak: Towards Deep Neural Networks for Speech Steganography

Authors: Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet

Abstract: Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for… ▽ More Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for vision are less suitable for speech, and propose a new model that includes the short-time Fourier transform and inverse-short-time Fourier transform as differentiable layers within the network, thus imposing a vital constraint on the network outputs. We empirically demonstrated the effectiveness of the proposed method comparing to deep learning based on several speech datasets and analyzed the results quantitatively and qualitatively. Moreover, we showed that the proposed approach could be applied to conceal multiple messages in a single carrier using multiple decoders or a single conditional decoder. Lastly, we evaluated our model under different channel distortions. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible. △ Less

Submitted 27 July, 2020; v1 submitted 7 February, 2019; originally announced February 2019.

arXiv:1808.06664 [pdf, ps, other]

Out-of-Distribution Detection using Multiple Semantic Label Representations

Authors: Gabi Shalev, Yossi Adi, Joseph Keshet

Abstract: Deep Neural Networks are powerful models that attained remarkable results on a variety of tasks. These models are shown to be extremely efficient when training and test data are drawn from the same distribution. However, it is not clear how a network will act when it is fed with an out-of-distribution example. In this work, we consider the problem of out-of-distribution detection in neural network… ▽ More Deep Neural Networks are powerful models that attained remarkable results on a variety of tasks. These models are shown to be extremely efficient when training and test data are drawn from the same distribution. However, it is not clear how a network will act when it is fed with an out-of-distribution example. In this work, we consider the problem of out-of-distribution detection in neural networks. We propose to use multiple semantic dense representations instead of sparse representation as the target label. Specifically, we propose to use several word representations obtained from different corpora or architectures as target labels. We evaluated the proposed model on computer vision, and speech commands detection tasks and compared it to previous methods. Results suggest that our method compares favorably with previous work. Besides, we present the efficiency of our approach for detecting wrongly classified and adversarial examples. △ Less

Submitted 10 January, 2019; v1 submitted 20 August, 2018; originally announced August 2018.

arXiv:1802.04633 [pdf, ps, other]

Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring

Authors: Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, Joseph Keshet

Abstract: Deep Neural Networks have recently gained lots of success after enabling several breakthroughs in notoriously challenging problems. Training these networks is computationally expensive and requires vast amounts of training data. Selling such pre-trained models can, therefore, be a lucrative business model. Unfortunately, once the models are sold they can be easily copied and redistributed. To avoi… ▽ More Deep Neural Networks have recently gained lots of success after enabling several breakthroughs in notoriously challenging problems. Training these networks is computationally expensive and requires vast amounts of training data. Selling such pre-trained models can, therefore, be a lucrative business model. Unfortunately, once the models are sold they can be easily copied and redistributed. To avoid this, a tracking mechanism to identify models as the intellectual property of a particular vendor is necessary. In this work, we present an approach for watermarking Deep Neural Networks in a black-box way. Our scheme works for general classification tasks and can easily be combined with current learning algorithms. We show experimentally that such a watermark has no noticeable impact on the primary task that the model is designed for and evaluate the robustness of our proposal against a multitude of practical attacks. Moreover, we provide a theoretical analysis, relating our approach to previous work on backdooring. △ Less

Submitted 11 June, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

arXiv:1802.04528 [pdf, other]

Deceiving End-to-End Deep Learning Malware Detectors using Adversarial Examples

Authors: Felix Kreuk, Assi Barak, Shir Aviv-Reuven, Moran Baruch, Benny Pinkas, Joseph Keshet

Abstract: In recent years, deep learning has shown performance breakthroughs in many applications, such as image detection, image segmentation, pose estimation, and speech recognition. However, this comes with a major concern: deep networks have been found to be vulnerable to adversarial examples. Adversarial examples are slightly modified inputs that are intentionally designed to cause a misclassification… ▽ More In recent years, deep learning has shown performance breakthroughs in many applications, such as image detection, image segmentation, pose estimation, and speech recognition. However, this comes with a major concern: deep networks have been found to be vulnerable to adversarial examples. Adversarial examples are slightly modified inputs that are intentionally designed to cause a misclassification by the model. In the domains of images and speech, the modifications are so small that they are not seen or heard by humans, but nevertheless greatly affect the classification of the model. Deep learning models have been successfully applied to malware detection. In this domain, generating adversarial examples is not straightforward, as small modifications to the bytes of the file could lead to significant changes in its functionality and validity. We introduce a novel loss function for generating adversarial examples specifically tailored for discrete input sets, such as executable bytes. We modify malicious binaries so that they would be detected as benign, while preserving their original functionality, by injecting a small sequence of bytes (payload) in the binary file. We applied this approach to an end-to-end convolutional deep learning malware detection model and show a high rate of detection evasion. Moreover, we show that our generated payload is robust enough to be transferable within different locations of the same file and across different files, and that its entropy is low and similar to that of benign data sections. △ Less

Submitted 10 January, 2019; v1 submitted 13 February, 2018; originally announced February 2018.

arXiv:1801.03339 [pdf, other]

Fooling End-to-end Speaker Verification by Adversarial Examples

Authors: Felix Kreuk, Yossi Adi, Moustapha Cisse, Joseph Keshet

Abstract: Automatic speaker verification systems are increasingly used as the primary means to authenticate costumers. Recently, it has been proposed to train speaker verification systems using end-to-end deep neural models. In this paper, we show that such systems are vulnerable to adversarial example attack. Adversarial examples are generated by adding a peculiar noise to original speaker examples, in suc… ▽ More Automatic speaker verification systems are increasingly used as the primary means to authenticate costumers. Recently, it has been proposed to train speaker verification systems using end-to-end deep neural models. In this paper, we show that such systems are vulnerable to adversarial example attack. Adversarial examples are generated by adding a peculiar noise to original speaker examples, in such a way that they are almost indistinguishable from the original examples by a human listener. Yet, the generated waveforms, which sound as speaker A can be used to fool such a system by claiming as if the waveforms were uttered by speaker B. We present white-box attacks on an end-to-end deep network that was either trained on YOHO or NTIMIT. We also present two black-box attacks: where the adversarial examples were generated with a system that was trained on YOHO, but the attack is on a system that was trained on NTIMIT; and when the adversarial examples were generated with a system that was trained on Mel-spectrum feature set, but the attack is on a system that was trained on MFCC. Results suggest that the accuracy of the attacked system was decreased and the false-positive rate was dramatically increased. △ Less

Submitted 16 February, 2018; v1 submitted 10 January, 2018; originally announced January 2018.

arXiv:1707.05373 [pdf, other]

Houdini: Fooling Deep Structured Prediction Models

Authors: Moustapha Cisse, Yossi Adi, Natalia Neverova, Joseph Keshet

Abstract: Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance meas… ▽ More Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation. △ Less

Submitted 17 July, 2017; originally announced July 2017.

Comments: 12 pages, 8 figures, under review

arXiv:1704.01653 [pdf, other]

Automatic Measurement of Pre-aspiration

Authors: Yaniv Sheena, Míša Hejná, Yossi Adi, Joseph Keshet

Abstract: Pre-aspiration is defined as the period of glottal friction occurring in sequences of vocalic/consonantal sonorants and phonetically voiceless obstruents. We propose two machine learning methods for automatic measurement of pre-aspiration duration: a feedforward neural network, which works at the frame level; and a structured prediction model, which relies on manually designed feature functions, a… ▽ More Pre-aspiration is defined as the period of glottal friction occurring in sequences of vocalic/consonantal sonorants and phonetically voiceless obstruents. We propose two machine learning methods for automatic measurement of pre-aspiration duration: a feedforward neural network, which works at the frame level; and a structured prediction model, which relies on manually designed feature functions, and works at the segment level. The input for both algorithms is a speech signal of an arbitrary length containing a single obstruent, and the output is a pair of times which constitutes the pre-aspiration boundaries. We train both models on a set of manually annotated examples. Results suggest that the structured model is superior to the frame-based model as it yields higher accuracy in predicting the boundaries and generalizes to new speakers and new languages. Finally, we demonstrate the applicability of our structured prediction algorithm by replicating linguistic analysis of pre-aspiration in Aberystwyth English with high correlation. △ Less

Submitted 15 June, 2017; v1 submitted 5 April, 2017; originally announced April 2017.

arXiv:1703.09817 [pdf, other]

Learning Similarity Functions for Pronunciation Variations

Authors: Einat Naaman, Yossi Adi, Joseph Keshet

Abstract: A significant source of errors in Automatic Speech Recognition (ASR) systems is due to pronunciation variations which occur in spontaneous and conversational speech. Usually ASR systems use a finite lexicon that provides one or more pronunciations for each word. In this paper, we focus on learning a similarity function between two pronunciations. The pronunciations can be the canonical and the sur… ▽ More A significant source of errors in Automatic Speech Recognition (ASR) systems is due to pronunciation variations which occur in spontaneous and conversational speech. Usually ASR systems use a finite lexicon that provides one or more pronunciations for each word. In this paper, we focus on learning a similarity function between two pronunciations. The pronunciations can be the canonical and the surface pronunciations of the same word or they can be two surface pronunciations of different words. This task generalizes problems such as lexical access (the problem of learning the map** between words and their possible pronunciations), and defining word neighborhoods. It can also be used to dynamically increase the size of the pronunciation lexicon, or in predicting ASR errors. We propose two methods, which are based on recurrent neural networks, to learn the similarity function. The first is based on binary classification, and the second is based on learning the ranking of the pronunciations. We demonstrate the efficiency of our approach on the task of lexical access using a subset of the Switchboard conversational speech corpus. Results suggest that on this task our methods are superior to previous methods which are based on graphical Bayesian methods. △ Less

Submitted 18 June, 2017; v1 submitted 28 March, 2017; originally announced March 2017.

arXiv:1611.01783 [pdf, other]

Domain Adaptation For Formant Estimation Using Deep Learning

Authors: Yehoshua Dissen, Joseph Keshet, Jacob Goldberger, Cynthia Clopper

Abstract: In this paper we present a domain adaptation technique for formant estimation using a deep network. We first train a deep learning network on a small read speech dataset. We then freeze the parameters of the trained network and use several different datasets to train an adaptation layer that makes the obtained network universal in the sense that it works well for a variety of speakers and speech d… ▽ More In this paper we present a domain adaptation technique for formant estimation using a deep network. We first train a deep learning network on a small read speech dataset. We then freeze the parameters of the trained network and use several different datasets to train an adaptation layer that makes the obtained network universal in the sense that it works well for a variety of speakers and speech domains with very different characteristics. We evaluated our adapted network on three datasets, each of which has different speaker characteristics and speech styles. The performance of our method compares favorably with alternative methods for formant estimation. △ Less

Submitted 6 November, 2016; originally announced November 2016.

arXiv:1610.08166 [pdf, other]

doi 10.1121/1.4972527

Automatic measurement of vowel duration via structured prediction

Authors: Yossi Adi, Joseph Keshet, Emily Cibelli, Erin Gustafson, Cynthia Clopper, Matthew Goldrick

Abstract: A key barrier to making phonetic studies scalable and replicable is the need to rely on subjective, manual annotation. To help meet this challenge, a machine learning algorithm was developed for automatic measurement of a widely used phonetic measure: vowel duration. Manually-annotated data were used to train a model that takes as input an arbitrary length segment of the acoustic signal containing… ▽ More A key barrier to making phonetic studies scalable and replicable is the need to rely on subjective, manual annotation. To help meet this challenge, a machine learning algorithm was developed for automatic measurement of a widely used phonetic measure: vowel duration. Manually-annotated data were used to train a model that takes as input an arbitrary length segment of the acoustic signal containing a single vowel that is preceded and followed by consonants and outputs the duration of the vowel. The model is based on the structured prediction framework. The input signal and a hypothesized set of a vowel's onset and offset are mapped to an abstract vector space by a set of acoustic feature functions. The learning algorithm is trained in this space to minimize the difference in expectations between predicted and manually-measured vowel durations. The trained model can then automatically estimate vowel durations without phonetic or orthographic transcription. Results comparing the model to three sets of manually annotated data suggest it out-performed the current gold standard for duration measurement, an HMM-based forced aligner (which requires orthographic or phonetic transcription as an input). △ Less

Submitted 26 October, 2016; originally announced October 2016.

arXiv:1610.07918 [pdf, other]

Sequence Segmentation Using Joint RNN and Structured Prediction Models

Authors: Yossi Adi, Joseph Keshet, Emily Cibelli, Matthew Goldrick

Abstract: We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured… ▽ More We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured loss function which can be designed to the given segmentation task. We demonstrate the effectiveness of our method by applying it to two simple tasks commonly used in phonetic studies: word segmentation and voice onset time segmentation. Results sug- gest the proposed model is superior to previous methods, ob- taining state-of-the-art results on the tested datasets. △ Less

Submitted 25 October, 2016; originally announced October 2016.

Comments: under review

arXiv:1512.07851 [pdf, other]

Context-Based Prediction of App Usage

Authors: Joseph Keshet, Adam Kariv, Arnon Dagan, Dvir Volk, Joey Simhon

Abstract: There are around a hundred installed apps on an average smartphone. The high number of apps and the limited number of app icons that can be displayed on the device's screen requires a new paradigm to address their visibility to the user. In this paper we propose a new online algorithm for dynamically predicting a set of apps that the user is likely to use. The algorithm runs on the user's device a… ▽ More There are around a hundred installed apps on an average smartphone. The high number of apps and the limited number of app icons that can be displayed on the device's screen requires a new paradigm to address their visibility to the user. In this paper we propose a new online algorithm for dynamically predicting a set of apps that the user is likely to use. The algorithm runs on the user's device and constantly learns the user's habits at a given time, location, and device state. It is designed to actively help the user to navigate to the desired app as well as to provide a personalized feeling, and hence is aimed at maximizing the AUC. We show both theoretically and empirically that the algorithm maximizes the AUC, and yields good results on a set of 1,000 devices. △ Less

Submitted 25 January, 2016; v1 submitted 24 December, 2015; originally announced December 2015.

arXiv:1512.02033 [pdf, ps, other]

Risk Minimization in Structured Prediction using Orbit Loss

Authors: Danny Karmon, Joseph Keshet

Abstract: We introduce a new surrogate loss function called orbit loss in the structured prediction framework, which has good theoretical and practical advantages. While the orbit loss is not convex, it has a simple analytical gradient and a simple perceptron-like learning rule. We analyze the new loss theoretically and state a PAC-Bayesian generalization bound. We also prove that the new loss is consistent… ▽ More We introduce a new surrogate loss function called orbit loss in the structured prediction framework, which has good theoretical and practical advantages. While the orbit loss is not convex, it has a simple analytical gradient and a simple perceptron-like learning rule. We analyze the new loss theoretically and state a PAC-Bayesian generalization bound. We also prove that the new loss is consistent in the strong sense; namely, the risk achieved by the set of the trained parameters approaches the infimum risk achievable by any linear decoder over the given features. Methods that are aimed at risk minimization, such as the structured ramp loss, the structured probit loss and the direct loss minimization require at least two inference operations per training iteration. In this sense, the orbit loss is more efficient as it requires only one inference operation per training iteration, while yields similar performance. We conclude the paper with an empirical comparison of the proposed loss function to the structured hinge loss, the structured ramp loss, the structured probit loss and the direct loss minimization method on several benchmark datasets and tasks. △ Less

Submitted 9 December, 2015; v1 submitted 7 December, 2015; originally announced December 2015.

arXiv:1109.4603 [pdf, other]

Explicit Approximations of the Gaussian Kernel

Authors: Andrew Cotter, Joseph Keshet, Nathan Srebro

Abstract: We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a bet… ▽ More We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a better approximation in terms of the computational cost involved. This makes our "Taylor features" especially attractive for use on very large data sets, in conjunction with online or stochastic training. △ Less

Submitted 21 September, 2011; originally announced September 2011.

Comments: 11 pages, 2 tables, 2 figures

Showing 1–35 of 35 results for author: Keshet, J