Skip to main content

Showing 1–48 of 48 results for author: Kamper, H

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.07133  [pdf, other

    eess.AS cs.CL cs.SD

    Translating speech with just images

    Authors: Dan Oneata, Herman Kamper

    Abstract: Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2403.13922  [pdf, other

    cs.CL eess.AS

    Visually Grounded Speech Models have a Mutual Exclusivity Bias

    Authors: Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

    Abstract: When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: a novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech model… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted to TACL, pre-MIT Press publication version

  3. arXiv:2401.17902  [pdf, other

    eess.AS cs.CL cs.SD

    Revisiting speech segmentation and lexicon learning with better features

    Authors: Herman Kamper, Benjamin van Niekerk

    Abstract: We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 2 pages

  4. arXiv:2310.08104  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion aims to convert source speech into a target voice using recordings of the target speaker as a reference. Newer models are producing increasingly realistic output. But what happens when models are fed with non-standard data, such as speech from a user with a speech impairment? We investigate how a recent voice conversion model performs on non-standard downstream voice conversion ta… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 11 pages, 1 figure, 5 tables. Accepted at SACAIR 2023

  5. arXiv:2307.06040  [pdf, other

    eess.AS cs.LG cs.SD

    Rhythm Modeling for Voice Conversion

    Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

    Abstract: Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic-an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representatio… ▽ More

    Submitted 12 July, 2023; originally announced July 2023.

    Comments: 5 pages, 4 figures, 4 tables, submitted to IEEE Signal Processing Letters

  6. arXiv:2307.02083  [pdf, other

    eess.AS cs.CL

    Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

    Authors: Christiaan Jacobs, Herman Kamper

    Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings. In this paper we explore semantic AWE modelling. These AWEs should not only capture phonetics but also the meaning of a word (similar to textual word embeddings). We consider the scenario where we only… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Submitted to IEEE SPL

  7. arXiv:2307.01673  [pdf, other

    eess.AS cs.CL cs.SD

    Disentanglement in a GAN for Unconditional Speech Synthesis

    Authors: Matthew Baas, Herman Kamper

    Abstract: Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech syn… ▽ More

    Submitted 25 January, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: 12 pages, 5 tables, 4 figures. Accepted to IEEE TASLP. arXiv admin note: substantial text overlap with arXiv:2210.05271

  8. arXiv:2306.11371  [pdf, other

    eess.AS cs.CL

    Visually grounded few-shot word learning in low-resource settings

    Authors: Leanne Nortje, Dan Oneata, Herman Kamper

    Abstract: We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by either using an artificial setting with digit word-image pairs or by using a large number of examples… ▽ More

    Submitted 18 April, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to TASLP. arXiv admin note: substantial text overlap with arXiv:2305.15937

  9. arXiv:2306.00410  [pdf, other

    cs.CL cs.SD eess.AS

    Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili

    Authors: Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper

    Abstract: We consider hate speech detection through keyword spotting on radio broadcasts. One approach is to build an automatic speech recognition (ASR) system for the target low-resource language. We compare this to using acoustic word embedding (AWE) models that map speech segments to a space where matching words have similar vectors. We specifically use a multilingual AWE model trained on labelled data f… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech 2023

  10. arXiv:2305.18975  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion With Just Nearest Neighbors

    Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper

    Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity -- making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effecti… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 5 page, 1 table, 2 figures. Accepted at Interspeech 2023

  11. arXiv:2305.15937  [pdf, other

    cs.CL cs.AI eess.AS

    Visually grounded few-shot word acquisition with fewer shots

    Authors: Leanne Nortje, Benjamin van Niekerk, Herman Kamper

    Abstract: We propose a visually grounded speech model that acquires new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this problem by either using an artificial setting with digit word-image pairs or by using a large number of examples per class. We p… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  12. arXiv:2305.13080  [pdf, other

    cs.CL cs.AI eess.AS

    Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning

    Authors: Ruan van der Merwe, Herman Kamper

    Abstract: We consider the problem of few-shot spoken word classification in a setting where a model is incrementally introduced to new word classes. This would occur in a user-defined keyword system where new words can be added as the system is used. In such a continual learning scenario, a model might start to misclassify earlier words as newer classes are added, i.e. catastrophic forgetting. To address th… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures, Accepted to Interspeech 2023

    ACM Class: I.2.7; I.2.6

  13. arXiv:2210.07677  [pdf, other

    eess.AS cs.AI cs.SD

    TransFusion: Transcribing Speech with Multinomial Diffusion

    Authors: Matthew Baas, Kevin Eloff, Herman Kamper

    Abstract: Diffusion models have shown exceptional scaling properties in the image synthesis domain, and initial attempts have shown similar benefits for applying diffusion to unconditional text synthesis. Denoising diffusion models attempt to iteratively refine a sampled noise signal until it resembles a coherent signal (such as an image or written sentence). In this work we aim to see whether the benefits… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: 12 pages, 4 figures, 1 table. Accepted at SACAIR 2022

  14. arXiv:2210.06229  [pdf, other

    cs.CL cs.SD eess.AS

    Towards visually prompted keyword localisation for zero-resource spoken languages

    Authors: Leanne Nortje, Herman Kamper

    Abstract: Imagine being able to show a system a visual depiction of a keyword and finding spoken utterances that contain this keyword from a zero-resource speech corpus. We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs. To do VPKL, we propose a speech-vision model with a novel localising… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  15. arXiv:2210.05271  [pdf, other

    cs.SD cs.AI eess.AS

    GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

    Authors: Matthew Baas, Herman Kamper

    Abstract: We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN) for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques,… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 6 pages, 2 figures, 2 tables. Accepted at IEEE SLT 2022

  16. arXiv:2210.04600  [pdf, other

    cs.CL eess.AS

    YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

    Authors: Kayode Olaleye, Dan Oneata, Herman Kamper

    Abstract: Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a ne… ▽ More

    Submitted 12 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  17. arXiv:2206.11706  [pdf, other

    eess.AS cs.CL cs.LG stat.ML

    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery

    Authors: Werner van der Merwe, Herman Kamper, Johan du Preez

    Abstract: Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery fro… ▽ More

    Submitted 29 June, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

  18. arXiv:2202.11929  [pdf, other

    cs.CL cs.SD eess.AS

    Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

    Authors: Herman Kamper

    Abstract: Recent work on unsupervised speech segmentation has used self-supervised models with phone and word segmentation modules that are trained jointly. This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). To do this… ▽ More

    Submitted 9 January, 2023; v1 submitted 24 February, 2022; originally announced February 2022.

    Comments: 11 pages, 5 figures, 5 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 31 (2023) 684-694

  19. arXiv:2202.01107  [pdf, other

    cs.CL cs.SD eess.AS

    Keyword localisation in untranscribed speech using visually grounded speech models

    Authors: Kayode Olaleye, Dan Oneata, Herman Kamper

    Abstract: Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised -- trained without any explicit textual label or location information. To… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

    Comments: 10 figures, 5 tables

  20. arXiv:2111.02674  [pdf, other

    eess.AS cs.CL cs.SD

    Voice Conversion Can Improve ASR in Very Low-Resource Settings

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion (VC) could be used to improve speech recognition systems in low-resource languages by using it to augment limited training data. However, VC has not been widely used for this purpose because of practical issues such as compute speed and limitations when converting to and from unseen speakers. Moreover, it is still unclear whether a VC model trained on one well-resourced language c… ▽ More

    Submitted 21 June, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: 5 page, 4 tables, 2 figures. Accepted at Interspeech 2022

  21. A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

    Authors: Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman Kamper

    Abstract: The goal of voice conversion is to transform source speech into a target voice, kee** the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to… ▽ More

    Submitted 8 June, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: 5 pages, 2 figures, 2 tables. Accepted at ICASSP 2022

  22. Feature learning for efficient ASR-free keyword spotting in low-resource languages

    Authors: Ewald van der Westhuizen, Herman Kamper, Raghav Menon, John Quinn, Thomas Niesler

    Abstract: We consider feature learning for efficient keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations in parts of Africa in which almost no language resources are available. For rapid development in such languages, we rely on a small, easily-compiled set of isolated keywords. These keyword templates ar… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 37 pages, 14 figures, Preprint accepted for publication in Computer Speech and Language

  23. arXiv:2108.00917  [pdf, other

    eess.AS cs.SD

    Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

    Authors: Benjamin van Niekerk, Leanne Nortje, Matthew Baas, Herman Kamper

    Abstract: Contrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. Previous work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. However, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize o… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: Accepted to Interspeech 2021

  24. arXiv:2106.12834  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language

    Authors: Christiaan Jacobs, Herman Kamper

    Abstract: Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelle… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  25. arXiv:2106.08859  [pdf, other

    cs.CL cs.SD eess.AS

    Attention-Based Keyword Localisation in Speech using Visual Grounding

    Authors: Kayode Olaleye, Herman Kamper

    Abstract: Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do key… ▽ More

    Submitted 23 June, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  26. arXiv:2106.00043  [pdf, other

    eess.AS cs.CL cs.SD

    StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

    Authors: Matthew Baas, Herman Kamper

    Abstract: Voice conversion is the task of converting a spoken utterance from a source speaker so that it appears to be said by a different target speaker while retaining the linguistic content of the utterance. Recent advances have led to major improvements in the quality of voice conversion systems. However, to be useful in a wider range of contexts, voice conversion systems would need to be (i) trainable… ▽ More

    Submitted 31 May, 2021; originally announced June 2021.

    Comments: 16 pages, 3 figures. Published in Springer Communications in Computer and Information Science, Artificial Intelligence Research (SACAIR 2021), vol. 1342, pp. 69-84, 2020

    Journal ref: In: Springer Communications in Computer and Information Science, Artificial Intelligence Research (SACAIR 2021), vol. 1342, pp. 69-84, 2020

  27. arXiv:2103.10731  [pdf, other

    cs.CL eess.AS

    Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation

    Authors: Christiaan Jacobs, Yevgen Matusevych, Herman Kamper

    Abstract: Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based recurrent models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an un… ▽ More

    Submitted 19 March, 2021; originally announced March 2021.

    Comments: Accepted to SLT 2021

  28. arXiv:2012.07551  [pdf, other

    cs.CL eess.AS

    Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

    Authors: Herman Kamper, Benjamin van Niekerk

    Abstract: We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the… ▽ More

    Submitted 11 June, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  29. arXiv:2012.07396  [pdf, other

    cs.CL eess.AS

    Towards localisation of keywords in speech using weak supervision

    Authors: Kayode Olaleye, Benjamin van Niekerk, Herman Kamper

    Abstract: Developments in weakly supervised and self-supervised models could enable speech technology in low-resource settings where full transcriptions are not available. We consider whether keyword localisation is possible using two forms of weak supervision where location information is not provided explicitly. In the first, only the presence or absence of a word is indicated, i.e. a bag-of-words (BoW) l… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to NeurIPS-SAS

  30. arXiv:2012.07387  [pdf, other

    cs.CL eess.AS

    A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

    Authors: Lisa van Staden, Herman Kamper

    Abstract: Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by map** speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted to SLT 2021

  31. arXiv:2012.05680  [pdf, other

    cs.CL cs.SD eess.AS

    Direct multimodal few-shot learning of speech and images

    Authors: Leanne Nortje, Herman Kamper

    Abstract: We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples. Imagine an agent is shown an image along with a spoken word describing the object in the picture, e.g. pen, book and eraser. After observing a few paired examples of each class, the model is asked to identify the "book" in a set of unseen pictures. Previous w… ▽ More

    Submitted 29 July, 2021; v1 submitted 10 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  32. arXiv:2012.02221  [pdf, other

    eess.AS cs.CL cs.SD

    A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

    Authors: Puyuan Peng, Herman Kamper, Karen Livescu

    Abstract: We propose a new unsupervised model for map** a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages. Our model, which we refer to as a maximal sampling correspondence variational autoencoder (MCVAE), is a recurrent neural network (RNN) trai… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: 10 pages, 6 figures, NeurIPS 2020 Workshop Self-Supervised Learning for Speech and Audio Processing

  33. arXiv:2008.06258  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

    Authors: Leanne Nortje, Herman Kamper

    Abstract: We consider the task of multimodal one-shot speech-image matching. An agent is shown a picture along with a spoken word describing the object in the picture, e.g. cookie, broccoli and ice-cream. After observing one paired speech-image example per class, it is shown a new set of unseen pictures, and asked to pick the "ice-cream". Previous work attempted to tackle this problem using transfer learnin… ▽ More

    Submitted 14 August, 2020; originally announced August 2020.

    Comments: Accepted at Interspeech 2020

  34. arXiv:2008.02888  [pdf, other

    cs.CL cs.SD eess.AS

    Evaluating computational models of infant phonetic learning across languages

    Authors: Yevgen Matusevych, Thomas Schatz, Herman Kamper, Naomi H. Feldman, Sharon Goldwater

    Abstract: In the first year of life, infants' speech perception becomes attuned to the sounds of their native language. Many accounts of this early phonetic learning exist, but computational models predicting the attunement patterns observed in infants from the speech input they hear have been lacking. A recent study presented the first such model, drawing on algorithms proposed for unsupervised learning fr… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 7 pages, 1 figure

    Journal ref: 2020. In S. Denison, M. Mack, Y. Xu, and B. Armstrong (Eds.), Proceedings of the 42nd Annual Conference of the Cognitive Science Society (pp. 571-577). Austin, TX: Cognitive Science Society

  35. arXiv:2006.02295  [pdf, other

    cs.CL cs.SD eess.AS

    Improved acoustic word embeddings for zero-resource languages using multilingual transfer

    Authors: Herman Kamper, Yevgen Matusevych, Sharon Goldwater

    Abstract: Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. Such embeddings can form the basis for speech search, indexing and discovery systems when conventional speech recognition is not possible. In zero-resource settings where unlabelled speech is the only available resource, we need a method that gives robust embeddings on an arbitrary language. Here we… ▽ More

    Submitted 5 February, 2021; v1 submitted 2 June, 2020; originally announced June 2020.

    Comments: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and Language Processing

  36. arXiv:2005.09409  [pdf, other

    eess.AS cs.CL

    Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

    Authors: Benjamin van Niekerk, Leanne Nortje, Herman Kamper

    Abstract: In this paper, we explore vector quantization for acoustic unit discovery. Leveraging unlabelled data, we aim to learn discrete representations of speech that separate phonetic content from speaker-specific details. We propose two neural models to tackle this challenge - both use vector quantization to map continuous features to a finite set of codes. The first model is a type of vector-quantized… ▽ More

    Submitted 19 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 3 figures, 2 tables, accepted to Interspeech 2020

  37. Unsupervised feature learning for speech using correspondence and Siamese networks

    Authors: Petri-Johan Last, Herman A. Engelbrecht, Herman Kamper

    Abstract: In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature f… ▽ More

    Submitted 28 March, 2020; originally announced March 2020.

    Comments: 5 pages, 3 figures, 2 tables; accepted to the IEEE Signal Processing Letters, (c) 2020 IEEE

    Journal ref: IEEE Signal Processing Letters 27 (2020) 421-425

  38. arXiv:2002.02109  [pdf, other

    cs.CL eess.AS

    Multilingual acoustic word embedding models for processing zero-resource languages

    Authors: Herman Kamper, Yevgen Matusevych, Sharon Goldwater

    Abstract: Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to u… ▽ More

    Submitted 21 February, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

    Comments: 5 pages, 4 figures, 1 table; accepted to ICASSP 2020. arXiv admin note: text overlap with arXiv:1811.00403

  39. arXiv:1912.05193  [pdf, other

    eess.IV cs.CV

    Deep motion estimation for parallel inter-frame prediction in video compression

    Authors: André Nortje, Herman A. Engelbrecht, Herman Kamper

    Abstract: Standard video codecs rely on optical flow to guide inter-frame prediction: pixels from reference frames are moved via motion vectors to predict target video frames. We propose to learn binary motion codes that are encoded based on an input video sequence. These codes are not limited to 2D translations, but can capture complex motion (war**, rotation and occlusion). Our motion codes are learned… ▽ More

    Submitted 11 December, 2019; originally announced December 2019.

    Comments: 25 pages, 11 figures, 5 tables

  40. BINet: a binary inpainting network for deep patch-based image compression

    Authors: André Nortje, Willie Brink, Herman A. Engelbrecht, Herman Kamper

    Abstract: Recent deep learning models outperform standard lossy image compression codecs. However, applying these models on a patch-by-patch basis requires that each image patch be encoded and decoded independently. The influence from adjacent patches is therefore lost, leading to block artefacts at low bitrates. We propose the Binary Inpainting Network (BINet), an autoencoder framework which incorporates b… ▽ More

    Submitted 13 January, 2021; v1 submitted 11 December, 2019; originally announced December 2019.

    Comments: Signal Processing: Image Communication

    Journal ref: Signal Processing: Image Communication 92C (2021) 116119

  41. arXiv:1904.10947  [pdf, other

    cs.CL cs.SD eess.AS

    On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

    Authors: Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu

    Abstract: Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task… ▽ More

    Submitted 30 August, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

  42. arXiv:1904.07556  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

    Authors: Ryan Eloff, André Nortje, Benjamin van Niekerk, Avashna Govender, Leanne Nortje, Arnu Pretorius, Elan van Biljon, Ewald van der Westhuizen, Lisa van Staden, Herman Kamper

    Abstract: For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. Unsupervised discrete subword modelling could be useful for studies of phonetic category learning in infants or in low-resource speech technology requiring symbolic input. We use an autoencoder (AE) architecture with intermed… ▽ More

    Submitted 28 June, 2019; v1 submitted 16 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019

  43. arXiv:1904.07078  [pdf, other

    cs.CL cs.SD eess.AS

    Semantic query-by-example speech search using visual grounding

    Authors: Herman Kamper, Aristotelis Anastassiou, Karen Livescu

    Abstract: A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode retrieval. Here we consider how such models can be used for query-by-example (QbE) search, the task of retrieving utterances relevant to a given spoken query. We… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019

  44. arXiv:1811.08284  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders

    Authors: Raghav Menon, Herman Kamper, Ewald van der Westhuizen, John Quinn, Thomas Niesler

    Abstract: We compare features for dynamic time war** (DTW) when used to bootstrap keyword spotting (KWS) in an almost zero-resource setting. Such quickly-deployable systems aim to support United Nations (UN) humanitarian relief efforts in parts of Africa with severely under-resourced languages. Our objective is to identify acoustic features that provide acceptable KWS performance in such environments. As… ▽ More

    Submitted 12 July, 2019; v1 submitted 14 November, 2018; originally announced November 2018.

    Comments: 5 pages, 2 figures, 2 tables, 38 references, Accepted at Interspeech 2019

  45. Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

    Authors: Enno Hermann, Herman Kamper, Sharon Goldwater

    Abstract: Subword modeling for zero-resource languages aims to learn low-level representations of speech audio without using transcriptions or other resources from the target language (such as text corpora or pronunciation dictionaries). A good representation should capture phonetic content and abstract away from other types of variability, such as speaker differences and channel noise. Previous work in thi… ▽ More

    Submitted 7 April, 2020; v1 submitted 9 November, 2018; originally announced November 2018.

    Comments: 17 pages, 6 figures, 7 tables. Accepted for publication in Computer Speech and Language. arXiv admin note: text overlap with arXiv:1803.08863

  46. arXiv:1811.03875  [pdf, other

    cs.CL cs.CV cs.LG eess.AS

    Multimodal One-Shot Learning of Speech and Images

    Authors: Ryan Eloff, Herman A. Engelbrecht, Herman Kamper

    Abstract: Imagine a robot is shown new concepts visually together with spoken tags, e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per class, it is shown a new set of unseen instances of these objects, and asked to pick the "milk". Without receiving any hard labels, could it learn to match the new continuous speech input to the correct visual instance? Although unimodal one-shot… ▽ More

    Submitted 15 April, 2019; v1 submitted 9 November, 2018; originally announced November 2018.

    Comments: 5 pages, 1 figure, 3 tables; accepted to ICASSP 2019

  47. arXiv:1811.00403  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

    Authors: Herman Kamper

    Abstract: We investigate unsupervised models that can map a variable-duration speech segment to a fixed-dimensional representation. In settings where unlabelled speech is the only available resource, such acoustic word embeddings can form the basis for "zero-resource" speech search, discovery and indexing systems. Most existing unsupervised embedding methods still use some supervision, such as word or phone… ▽ More

    Submitted 15 April, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: 5 pages, 3 figures, 2 tables; accepted to ICASSP 2019

  48. arXiv:1710.01949  [pdf, other

    cs.CL cs.CV eess.AS

    Semantic speech retrieval with a visually grounded model of untranscribed speech

    Authors: Herman Kamper, Gregory Shakhnarovich, Karen Livescu

    Abstract: There is growing interest in models that can learn from unlabelled speech paired with visual context. This setting is relevant for low-resource speech processing, robotics, and human language acquisition research. Here we study how a visually grounded speech model, trained on images of scenes paired with spoken captions, captures aspects of semantics. We use an external image tagger to generate so… ▽ More

    Submitted 31 October, 2018; v1 submitted 5 October, 2017; originally announced October 2017.

    Comments: 10 pages, 3 figures, 5 tables; accepted to the IEEE/ACM Transactions on Audio, Speech and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing 27 (2019) 89-98