Search | arXiv e-print repository

Efficient Multi-Modal Embeddings from Structured Data

Abstract: Multi-modal word semantics aims to enhance embeddings with perceptual input, assuming that human meaning representation is grounded in sensory experience. Most research focuses on evaluation involving direct visual input, however, visual grounding can contribute to linguistic applications as well. Another motivation for this paper is the growing need for more interpretable models and for evaluatin… ▽ More Multi-modal word semantics aims to enhance embeddings with perceptual input, assuming that human meaning representation is grounded in sensory experience. Most research focuses on evaluation involving direct visual input, however, visual grounding can contribute to linguistic applications as well. Another motivation for this paper is the growing need for more interpretable models and for evaluating model efficiency regarding size and performance. This work explores the impact of visual information for semantics when the evaluation involves no direct visual input, specifically semantic similarity and relatedness. We investigate a new embedding type in-between linguistic and visual modalities, based on the structured annotations of Visual Genome. We compare uni- and multi-modal models including structured, linguistic and image based representations. We measure the efficiency of each model with regard to data and model size, modality / data distribution and information gain. The analysis includes an interpretation of embedding structures. We found that this new embedding conveys complementary information for text based embeddings. It achieves comparable performance in an economic way, using orders of magnitude less resources than visual models. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 5 pages, 5 pages of appendix, 7 figures

arXiv:2109.04562 [pdf, other]

TIAGE: A Benchmark for Topic-Shift Aware Dialog Modeling

Authors: Huiyuan Xie, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Ann Copestake

Abstract: Human conversations naturally evolve around different topics and fluently move between them. In research on dialog systems, the ability to actively and smoothly transition to new topics is often ignored. In this paper we introduce TIAGE, a new topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, we introduce three tasks to investigate differen… ▽ More Human conversations naturally evolve around different topics and fluently move between them. In research on dialog systems, the ability to actively and smoothly transition to new topics is often ignored. In this paper we introduce TIAGE, a new topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, we introduce three tasks to investigate different scenarios of topic-shift modeling in dialog settings: topic-shift detection, topic-shift triggered response generation and topic-aware dialog generation. Experiments on these tasks show that the topic-shift signals in TIAGE are useful for topic-shift response generation. On the other hand, dialog systems still struggle to decide when to change topic. This indicates further research is needed in topic-shift aware dialog modeling. △ Less

Submitted 9 September, 2021; originally announced September 2021.

Comments: Accepted to appear in Findings of EMNLP 2021

arXiv:2011.07593 [pdf, other]

Morphologically Aware Word-Level Translation

Authors: Paula Czarnowska, Sebastian Ruder, Ryan Cotterell, Ann Copestake

Abstract: We propose a novel morphologically aware probability model for bilingual lexicon induction, which jointly models lexeme translation and inflectional morphology in a structured way. Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning, while inflectional morphology provides additional syntactic information. This approach leads to substantial performan… ▽ More We propose a novel morphologically aware probability model for bilingual lexicon induction, which jointly models lexeme translation and inflectional morphology in a structured way. Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning, while inflectional morphology provides additional syntactic information. This approach leads to substantial performance improvements - 19% average improvement in accuracy across 6 language pairs over the state of the art in the supervised setting and 16% in the weakly supervised setting. As another contribution, we highlight issues associated with modern BLI that stem from ignoring inflectional morphology, and propose three suggestions for improving the task. △ Less

Submitted 15 November, 2020; originally announced November 2020.

Comments: COLING 2020

arXiv:1912.08960 [pdf, other]

Going Beneath the Surface: Evaluating Image Captioning for Grammaticality, Truthfulness and Diversity

Authors: Huiyuan Xie, Tom Sherborne, Alexander Kuhnle, Ann Copestake

Abstract: Image captioning as a multimodal task has drawn much interest in recent years. However, evaluation for this task remains a challenging problem. Existing evaluation metrics focus on surface similarity between a candidate caption and a set of reference captions, and do not check the actual relation between a caption and the underlying visual content. We introduce a new diagnostic evaluation framewor… ▽ More Image captioning as a multimodal task has drawn much interest in recent years. However, evaluation for this task remains a challenging problem. Existing evaluation metrics focus on surface similarity between a candidate caption and a set of reference captions, and do not check the actual relation between a caption and the underlying visual content. We introduce a new diagnostic evaluation framework for the task of image captioning, with the goal of directly assessing models for grammaticality, truthfulness and diversity (GTD) of generated captions. We demonstrate the potential of our evaluation framework by evaluating existing image captioning models on a wide ranging set of synthetic datasets that we construct for diagnostic evaluation. We empirically show how the GTD evaluation framework, in combination with diagnostic datasets, can provide insights into model capabilities and limitations to supplement standard evaluations. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1909.02855 [pdf, other]

Don't Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

Authors: Paula Czarnowska, Sebastian Ruder, Edouard Grave, Ryan Cotterell, Ann Copestake

Abstract: Human translators routinely have to translate rare inflections of words - due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as habláramos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art b… ▽ More Human translators routinely have to translate rare inflections of words - due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as habláramos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art bilingual lexicon inducers are capable of learning this kind of generalization. We introduce 40 morphologically complete dictionaries in 10 languages and evaluate three of the state-of-the-art models on the task of translation of less frequent morphological forms. We demonstrate that the performance of state-of-the-art models drops considerably when evaluated on infrequent morphological inflections and then show that adding a simple morphological constraint at training time improves the performance, proving that the bilingual lexicon inducers can benefit from better encoding of morphology. △ Less

Submitted 22 October, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

Comments: EMNLP 2019

arXiv:1908.06336 [pdf, other]

What is needed for simple spatial language capabilities in VQA?

Authors: Alexander Kuhnle, Ann Copestake

Abstract: Visual question answering (VQA) comprises a variety of language capabilities. The diagnostic benchmark dataset CLEVR has fueled progress by hel** to better assess and distinguish models in basic abilities like counting, comparing and spatial reasoning in vitro. Following this approach, we focus on spatial language capabilities and investigate the question: what are the key ingredients to handle… ▽ More Visual question answering (VQA) comprises a variety of language capabilities. The diagnostic benchmark dataset CLEVR has fueled progress by hel** to better assess and distinguish models in basic abilities like counting, comparing and spatial reasoning in vitro. Following this approach, we focus on spatial language capabilities and investigate the question: what are the key ingredients to handle simple visual-spatial relations? We look at the SAN, RelNet, FiLM and MC models and evaluate their learning behavior on diagnostic data which is solely focused on spatial relations. Via comparative analysis and targeted model modification we identify what really is required to substantially improve upon the CNN-LSTM baseline. △ Less

Submitted 22 October, 2019; v1 submitted 17 August, 2019; originally announced August 2019.

arXiv:1812.11737 [pdf, other]

The meaning of "most" for visual question answering models

Authors: Alexander Kuhnle, Ann Copestake

Abstract: The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design da… ▽ More The correct interpretation of quantifier statements in the context of a visual scene requires non-trivial inference mechanisms. For the example of "most", we discuss two strategies which rely on fundamentally different cognitive concepts. Our aim is to identify what strategy deep learning models for visual question answering learn when trained on such questions. To this end, we carefully design data to replicate experiments from psycholinguistics where the same question was investigated for humans. Focusing on the FiLM visual question answering model, our experiments indicate that a form of approximate number system emerges whose performance declines with more difficult scenes as predicted by Weber's law. Moreover, we identify confounding factors, like spatial arrangement of the scene, which impede the effectiveness of this system. △ Less

Submitted 4 June, 2019; v1 submitted 31 December, 2018; originally announced December 2018.

arXiv:1809.03044 [pdf, other]

How clever is the FiLM model, and how clever can it be?

Authors: Alexander Kuhnle, Huiyuan Xie, Ann Copestake

Abstract: The FiLM model achieves close-to-perfect performance on the diagnostic CLEVR dataset and is distinguished from other such models by having a comparatively simple and easily transferable architecture. In this paper, we investigate in more detail the ability of FiLM to learn various linguistic constructions. Our main results show that (a) FiLM is not able to learn relational statements straight away… ▽ More The FiLM model achieves close-to-perfect performance on the diagnostic CLEVR dataset and is distinguished from other such models by having a comparatively simple and easily transferable architecture. In this paper, we investigate in more detail the ability of FiLM to learn various linguistic constructions. Our main results show that (a) FiLM is not able to learn relational statements straight away except for very simple instances, (b) training on a broader set of instances as well as pretraining on simpler instance types can help alleviate these learning difficulties, (c) mixing is less robust than pretraining and very sensitive to the compositional structure of the dataset. Overall, our results suggest that the approach of big all-encompassing datasets and the paradigm of "the effectiveness of data" may have fundamental limitations. △ Less

Submitted 9 September, 2018; originally announced September 2018.

arXiv:1709.00226 [pdf, other]

Semantic Composition via Probabilistic Model Theory

Authors: Guy Emerson, Ann Copestake

Abstract: Semantic composition remains an open problem for vector space models of semantics. In this paper, we explain how the probabilistic graphical model used in the framework of Functional Distributional Semantics can be interpreted as a probabilistic version of model theory. Building on this, we explain how various semantic phenomena can be recast in terms of conditional probabilities in the graphical… ▽ More Semantic composition remains an open problem for vector space models of semantics. In this paper, we explain how the probabilistic graphical model used in the framework of Functional Distributional Semantics can be interpreted as a probabilistic version of model theory. Building on this, we explain how various semantic phenomena can be recast in terms of conditional probabilities in the graphical model. This connection between formal semantics and machine learning is helpful in both directions: it gives us an explicit mechanism for modelling context-dependent meanings (a challenge for formal semantics), and also gives us well-motivated techniques for composing distributed representations (a challenge for distributional semantics). We present results on two datasets that go beyond word similarity, showing how these semantically-motivated techniques improve on the performance of vector models. △ Less

Submitted 1 September, 2017; originally announced September 2017.

Comments: International Conference on Computational Semantics (IWCS)

arXiv:1709.00224 [pdf, other]

Variational Inference for Logical Inference

Authors: Guy Emerson, Ann Copestake

Abstract: Functional Distributional Semantics is a framework that aims to learn, from text, semantic representations which can be interpreted in terms of truth. Here we make two contributions to this framework. The first is to show how a type of logical inference can be performed by evaluating conditional probabilities. The second is to make these calculations tractable by means of a variational approximati… ▽ More Functional Distributional Semantics is a framework that aims to learn, from text, semantic representations which can be interpreted in terms of truth. Here we make two contributions to this framework. The first is to show how a type of logical inference can be performed by evaluating conditional probabilities. The second is to make these calculations tractable by means of a variational approximation. This approximation also enables faster convergence during training, allowing us to close the gap with state-of-the-art vector space models when evaluating on semantic similarity. We demonstrate promising performance on two tasks. △ Less

Submitted 1 September, 2017; originally announced September 2017.

Comments: Conference on Logic and Machine Learning in Natural Language (LaML)

arXiv:1706.01322 [pdf, other]

Deep learning evaluation using deep linguistic processing

Authors: Alexander Kuhnle, Ann Copestake

Abstract: We discuss problems with the standard approaches to evaluation for tasks like visual question answering, and argue that artificial data can be used to address these as a complement to current practice. We demonstrate that with the help of existing 'deep' linguistic processing technology we are able to create challenging abstract datasets, which enable us to investigate the language understanding a… ▽ More We discuss problems with the standard approaches to evaluation for tasks like visual question answering, and argue that artificial data can be used to address these as a complement to current practice. We demonstrate that with the help of existing 'deep' linguistic processing technology we are able to create challenging abstract datasets, which enable us to investigate the language understanding abilities of multimodal deep learning models in detail, as compared to a single performance value on a static and monolithic dataset. △ Less

Submitted 12 May, 2018; v1 submitted 5 June, 2017; originally announced June 2017.

arXiv:1704.04517 [pdf, other]

ShapeWorld - A new test methodology for multimodal language understanding

Authors: Alexander Kuhnle, Ann Copestake

Abstract: We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities. In this approach, artificial data is automatically generated according to the experimenter's specifications. The content of the data, both during training and evaluation, can be controlled in detail, which enables tasks to be created that require… ▽ More We introduce a novel framework for evaluating multimodal deep learning models with respect to their language understanding and generalization abilities. In this approach, artificial data is automatically generated according to the experimenter's specifications. The content of the data, both during training and evaluation, can be controlled in detail, which enables tasks to be created that require true generalization abilities, in particular the combination of previously introduced concepts in novel ways. We demonstrate the potential of our methodology by evaluating various visual question answering models on four different tasks, and show how our framework gives us detailed insights into their capabilities and limitations. By open-sourcing our framework, we hope to stimulate progress in the field of multimodal language understanding. △ Less

Submitted 14 April, 2017; originally announced April 2017.

arXiv:1606.08003 [pdf, other]

Functional Distributional Semantics

Authors: Guy Emerson, Ann Copestake

Abstract: Vector space models have become popular in distributional semantics, despite the challenges they face in capturing various semantic phenomena. We propose a novel probabilistic framework which draws on both formal semantics and recent advances in machine learning. In particular, we separate predicates from the entities they refer to, allowing us to perform Bayesian inference based on logical forms.… ▽ More Vector space models have become popular in distributional semantics, despite the challenges they face in capturing various semantic phenomena. We propose a novel probabilistic framework which draws on both formal semantics and recent advances in machine learning. In particular, we separate predicates from the entities they refer to, allowing us to perform Bayesian inference based on logical forms. We describe an implementation of this framework using a combination of Restricted Boltzmann Machines and feedforward neural networks. Finally, we demonstrate the feasibility of this approach by training it on a parsed corpus and evaluating it on established similarity datasets. △ Less

Submitted 26 June, 2016; originally announced June 2016.

Comments: Published at Representation Learning for NLP workshop at ACL 2016, https://sites.google.com/site/repl4nlp2016/

Showing 1–13 of 13 results for author: Copestake, A