Skip to main content

Showing 1–17 of 17 results for author: Keshet, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.19363  [pdf, ps, other

    eess.AS

    Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment

    Authors: Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff

    Abstract: Forced alignment (FA) plays a key role in speech research through the automatic time alignment of speech signals with corresponding text transcriptions. Despite the move towards end-to-end architectures for speech technology, FA is still dominantly achieved through a classic GMM-HMM acoustic model. This work directly compares alignment performance from leading automatic speech recognition (ASR) me… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Journal ref: Interspeech 2024

  2. arXiv:2406.18928  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network

    Authors: Yehoshua Dissen, Shiry Yonash, Israel Cohen, Joseph Keshet

    Abstract: In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted for publication at INTERSPEECH 2024

  3. arXiv:2406.02649  [pdf, other

    eess.AS cs.LG cs.SD

    Keyword-Guided Adaptation of Automatic Speech Recognition

    Authors: Aviv Shamsian, Aviv Navon, Neta Glazer, Gill Hetz, Joseph Keshet

    Abstract: Automatic Speech Recognition (ASR) technology has made significant progress in recent years, providing accurate transcription across various domains. However, some challenges remain, especially in noisy environments and specialized jargon. In this paper, we propose a novel approach for improved jargon word recognition by contextual biasing Whisper-based models. We employ a keyword spotting model t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted to InterSpeech 2024

  4. arXiv:2310.01381  [pdf, other

    cs.SD cs.CL eess.AS

    DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

    Authors: Roi Benita, Michael Elad, Joseph Keshet

    Abstract: Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, g… ▽ More

    Submitted 10 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

  5. arXiv:2309.08561  [pdf, other

    eess.AS cs.LG cs.SD

    Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

    Authors: Aviv Navon, Aviv Shamsian, Neta Glazer, Gill Hetz, Joseph Keshet

    Abstract: Open vocabulary keyword spotting is a crucial and challenging task in automatic speech recognition (ASR) that focuses on detecting user-defined keywords within a spoken utterance. Keyword spotting methods commonly map the audio utterance and keyword into a joint embedding space to obtain some affinity score. In this work, we propose AdaKWS, a novel method for keyword spotting in which a text encod… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

    Comments: Under Review

  6. arXiv:2206.14639  [pdf, other

    eess.AS cs.LG cs.SD

    DDKtor: Automatic Diadochokinetic Speech Analysis

    Authors: Yael Segal, Kasia Hitczenko, Matthew Goldrick, Adam Buchwald, Angela Roberts, Joseph Keshet

    Abstract: Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unanno… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted to Interspeech 2022

  7. arXiv:2206.11632  [pdf, other

    cs.SD eess.AS

    Formant Estimation and Tracking using Probabilistic Heat-Maps

    Authors: Yosi Shrem, Felix Kreuk, Joseph Keshet

    Abstract: Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: interspeech 2022

  8. arXiv:2204.13094  [pdf, other

    cs.SD eess.AS

    Unsupervised Word Segmentation using K Nearest Neighbors

    Authors: Tzeviya Sylvia Fuchs, Yedid Hoshen, Joseph Keshet

    Abstract: In this paper, we propose an unsupervised kNN-based approach for word segmentation in speech utterances. Our method relies on self-supervised pre-trained speech representations, and compares each audio segment of a given utterance to its K nearest neighbors within the training set. Our main assumption is that a segment containing more than one word would occur less often than a segment containing… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: Submitted to interspeech 2022

  9. arXiv:2204.04166  [pdf, other

    cs.SD cs.LG eess.AS

    Self-supervised Speaker Diarization

    Authors: Yehoshua Dissen, Felix Kreuk, Joseph Keshet

    Abstract: Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised d… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  10. arXiv:2204.03379  [pdf, ps, other

    eess.AS cs.LG

    Correcting Mispronunciations in Speech using Spectrogram Inpainting

    Authors: Talia Ben-Simon, Felix Kreuk, Faten Awwad, Jacob T. Cohen, Joseph Keshet

    Abstract: Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given inc… ▽ More

    Submitted 30 June, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted for publication at Interspeech 2022

  11. arXiv:2203.17019  [pdf, other

    eess.AS cs.LG cs.SD

    DeepFry: Identifying Vocal Fry Using Deep Neural Networks

    Authors: Bronya R. Chernyak, Talia Ben Simon, Yael Segal, Jeremy Steffman, Eleanor Chodroff, Jennifer S. Cole, Joseph Keshet

    Abstract: Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly f… ▽ More

    Submitted 26 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  12. arXiv:2103.05468  [pdf, other

    eess.AS cs.LG cs.SD

    CNN-based Spoken Term Detection and Localization without Dynamic Programming

    Authors: Tzeviya Sylvia Fuchs, Yael Segal, Joseph Keshet

    Abstract: In this paper, we propose a spoken term detection algorithm for simultaneous prediction and localization of in-vocabulary and out-of-vocabulary terms within an audio segment. The proposed algorithm infers whether a term was uttered within a given speech signal or not by predicting the word embeddings of various parts of the speech signal and comparing them to the word embedding of the desired term… ▽ More

    Submitted 7 March, 2021; originally announced March 2021.

    Journal ref: ICASSP 2021

  13. arXiv:2007.13465  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

    Authors: Felix Kreuk, Joseph Keshet, Yossi Adi

    Abstract: We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce t… ▽ More

    Submitted 6 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Comments: Interspeech 2020 paper

  14. arXiv:2002.04992  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Phoneme Boundary Detection using Learnable Segmental Features

    Authors: Felix Kreuk, Yaniv Sheena, Joseph Keshet, Yossi Adi

    Abstract: Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken p… ▽ More

    Submitted 16 February, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

  15. arXiv:1910.13255  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Dr.VOT : Measuring Positive and Negative Voice Onset Time in the Wild

    Authors: Yosi Shrem, Matthew Goldrick, Joseph Keshet

    Abstract: Voice Onset Time (VOT), a key measurement of speech for basic research and applied medical studies, is the time between the onset of a stop burst and the onset of voicing. When the voicing onset precedes burst onset the VOT is negative; if voicing onset follows the burst, it is positive. In this work, we present a deep-learning model for accurate and reliable measurement of VOT in naturalistic spe… ▽ More

    Submitted 27 October, 2019; originally announced October 2019.

    Comments: interspeech 2019

    Journal ref: interspeech 2019

  16. arXiv:1904.07704  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    SpeechYOLO: Detection and Localization of Speech Objects

    Authors: Yael Segal, Tzeviya Sylvia Fuchs, Joseph Keshet

    Abstract: In this paper, we propose to apply object detection methods from the vision domain on the speech recognition domain, by treating audio fragments as objects. More specifically, we present SpeechYOLO, which is inspired by the YOLO algorithm for object detection in images. The goal of SpeechYOLO is to localize boundaries of utterances within the input signal, and to correctly classify them. Our syste… ▽ More

    Submitted 30 June, 2019; v1 submitted 14 April, 2019; originally announced April 2019.

    Journal ref: Interspeech 2019, pp. 4210-4214

  17. arXiv:1902.03083  [pdf, other

    cs.SD cs.CR cs.LG eess.AS stat.ML

    Hide and Speak: Towards Deep Neural Networks for Speech Steganography

    Authors: Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, Joseph Keshet

    Abstract: Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for… ▽ More

    Submitted 27 July, 2020; v1 submitted 7 February, 2019; originally announced February 2019.