Search | arXiv e-print repository

Discrete representations in neural models of spoken language

Authors: Bertrand Higy, Lieke Gelderloos, Afra Alishahi, Grzegorz Chrupała

Abstract: The distributed and continuous representations used by neural networks are at odds with representations employed in linguistics, which are typically symbolic. Vector quantization has been proposed as a way to induce discrete neural representations that are closer in nature to their linguistic counterparts. However, it is not clear which metrics are the best-suited to analyze such discrete represen… ▽ More The distributed and continuous representations used by neural networks are at odds with representations employed in linguistics, which are typically symbolic. Vector quantization has been proposed as a way to induce discrete neural representations that are closer in nature to their linguistic counterparts. However, it is not clear which metrics are the best-suited to analyze such discrete representations. We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language. We compare the results they show when applied to two different models, while systematically studying the effect of the placement and size of the discretization layer. We find that different evaluation regimes can give inconsistent results. While we can attribute them to the properties of the different metrics in most cases, one point of concern remains: the use of minimal pairs of phoneme triples as stimuli disadvantages larger discrete unit inventories, unlike metrics applied to complete utterances. Furthermore, while in general vector quantization induces representations that correlate with units posited in linguistics, the strength of this correlation is only moderate. △ Less

Submitted 16 September, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

Comments: Accepted for publication at BlackboxNLP 2021

arXiv:2005.02721 [pdf, other]

Learning to Understand Child-directed and Adult-directed Speech

Authors: Lieke Gelderloos, Grzegorz Chrupała, Afra Alishahi

Abstract: Speech directed to children differs from adult-directed speech in linguistic aspects such as repetition, word choice, and sentence length, as well as in aspects of the speech signal itself, such as prosodic and phonemic variation. Human language acquisition research indicates that child-directed speech helps language learners. This study explores the effect of child-directed speech when learning t… ▽ More Speech directed to children differs from adult-directed speech in linguistic aspects such as repetition, word choice, and sentence length, as well as in aspects of the speech signal itself, such as prosodic and phonemic variation. Human language acquisition research indicates that child-directed speech helps language learners. This study explores the effect of child-directed speech when learning to extract semantic information from speech directly. We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS). We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better. The results suggest that this is at least partially due to linguistic rather than acoustic properties of the two registers, as we see the same pattern when looking at models trained on acoustically comparable synthetic speech. △ Less

Submitted 16 July, 2021; v1 submitted 6 May, 2020; originally announced May 2020.

Comments: Authors found an error in preprocessing of transcriptions before they were fed to SBERT. After correction, the experiments were rerun. The updated results can be found in this version. Importantly, - Most scores were affected to a small degree (performance was slightly worse). - The effect was consistent across conditions. Therefore, the general patterns remain the same

arXiv:1906.01530 [pdf, other]

The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue

Authors: Janosch Haber, Tim Baumgärtner, Ece Takmaz, Lieke Gelderloos, Elia Bruni, Raquel Fernández

Abstract: This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising… ▽ More This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction. △ Less

Submitted 26 June, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

Comments: Updates 26-06-2019: Changed caption sizes to comply with the ACL style guidelines and corrected some references

arXiv:1803.08869 [pdf, other]

On the difficulty of a distributional semantics of spoken language

Authors: Grzegorz Chrupała, Lieke Gelderloos, Ákos Kádár, Afra Alishahi

Abstract: In the domain of unsupervised learning most work on speech has focused on discovering low-level constructs such as phoneme inventories or word-like units. In contrast, for written language, where there is a large body of work on unsupervised induction of semantic representations of words, whole sentences and longer texts. In this study we examine the challenges of adapting these approaches from wr… ▽ More In the domain of unsupervised learning most work on speech has focused on discovering low-level constructs such as phoneme inventories or word-like units. In contrast, for written language, where there is a large body of work on unsupervised induction of semantic representations of words, whole sentences and longer texts. In this study we examine the challenges of adapting these approaches from written to spoken language. We conjecture that unsupervised learning of the semantics of spoken language becomes feasible if we abstract from the surface variability. We simulate this setting with a dataset of utterances spoken by a realistic but uniform synthetic voice. We evaluate two simple unsupervised models which, to varying degrees of success, learn semantic representations of speech fragments. Finally we present inconclusive results on human speech, and discuss the challenges inherent in learning distributional semantic representations on unrestricted natural spoken language. △ Less

Submitted 26 October, 2018; v1 submitted 23 March, 2018; originally announced March 2018.

Comments: Proceedings of the Society for Computation in Linguistics 2019

arXiv:1702.01991 [pdf, other]

doi 10.18653/v1/P17-1057

Representations of language in a model of visually grounded speech signal

Authors: Grzegorz Chrupała, Lieke Gelderloos, Afra Alishahi

Abstract: We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by diffe… ▽ More We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become richer as we go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease. △ Less

Submitted 30 June, 2017; v1 submitted 7 February, 2017; originally announced February 2017.

Comments: Accepted at ACL 2017

arXiv:1610.03342 [pdf, other]

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

Authors: Lieke Gelderloos, Grzegorz Chrupała

Abstract: We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model i… ▽ More We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning. △ Less

Submitted 11 October, 2016; originally announced October 2016.

Comments: Accepted at COLING 2016

Showing 1–6 of 6 results for author: Gelderloos, L