Search | arXiv e-print repository

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Authors: Amit Parekh, Nikolas Vitsakis, Alessandro Suglia, Ioannis Konstas

Abstract: Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across la… ▽ More Evaluating the generalisation capabilities of multimodal models based solely on their performance on out-of-distribution data fails to capture their true robustness. This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models, considering architectural design, input perturbations across language and vision modalities, and increased task complexity. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes, raising concerns about overfitting to spurious correlations. By employing this evaluation framework on current Transformer-based multimodal models for robotic manipulation tasks, we uncover limitations and suggest future advancements should focus on architectural and training innovations that better integrate multimodal inputs, enhancing a model's generalisation prowess by prioritising sensitivity to input content over incidental correlations. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.19297 [pdf, other]

Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Authors: Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia

Abstract: Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that e… ▽ More Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for develo** Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.13807 [pdf, other]

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Authors: Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

Abstract: AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset… ▽ More AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI. △ Less

Submitted 21 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Code available https://github.com/alanaai/EVUD

arXiv:2312.02431 [pdf, other]

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

Authors: Alessandro Suglia, Ioannis Konstas, Oliver Lemon

Abstract: In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V… ▽ More In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for develo** grounded meanings in neural models. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: Preprint for JAIR before copyediting

arXiv:2311.04067 [pdf, other]

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Authors: Georgios Pantazopoulos, Malvina Nikandrou, Amit Parekh, Bhathiya Hemanthage, Arash Eshghi, Ioannis Konstas, Verena Rieser, Oliver Lemon, Alessandro Suglia

Abstract: Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action predi… ▽ More Interactive and embodied tasks pose at least two fundamental challenges to existing Vision & Language (VL) models, including 1) grounding language in trajectories of actions and observations, and 2) referential disambiguation. To tackle these challenges, we propose an Embodied MultiModal Agent (EMMA): a unified encoder-decoder model that reasons over images and trajectories, and casts action prediction as multimodal text generation. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks. Different to previous modular approaches with independently trained components, we use a single multitask model where each task contributes to goal completion. EMMA performs on par with similar models on several VL benchmarks and sets a new state-of-the-art performance (36.81% success rate) on the Dialog-guided Task Completion (DTC), a benchmark to evaluate dialog-guided agents in the Alexa Arena △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: EMNLP 2023

arXiv:2307.16689 [pdf, other]

No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Authors: Vevake Balaraman, Arash Eshghi, Ioannis Konstas, Ioannis Papaioannou

Abstract: The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent afte… ▽ More The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee's erroneous response. Here, we collect and publicly release Repair-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI's GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs' TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR's by the GPT-3 models, which then significantly improves when exposed to Repair-QA. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: Accepted at SIGDIAL'23

arXiv:2305.19911 [pdf, other]

Neuron to Graph: Interpreting Language Model Neurons at Scale

Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make th… ▽ More Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.17553 [pdf, other]

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Authors: Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez

Abstract: Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we e… ▽ More Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects. △ Less

Submitted 3 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: To be published in ACL Findings 2023; for code see https://github.com/apartresearch/specificityplus; for a homepage see https://specificityplus.apartresearch.com/; updated Figures to uniform style

ACM Class: I.2.7

arXiv:2305.16519 [pdf, other]

The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering

Authors: Sabrina Chiesurin, Dimitris Dimakopoulos, Marco Antonio Sobrevilla Cabezudo, Arash Eshghi, Ioannis Papaioannou, Verena Rieser, Ioannis Konstas

Abstract: Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. "unfaithful" with respect to a rationale as retrieved from a knowledge base. In this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, wh… ▽ More Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. "unfaithful" with respect to a rationale as retrieved from a knowledge base. In this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, whereas other phenomena, such as pronouns and ellipsis are dis-preferred. We use open-domain question answering systems as our test-bed for task based dialog generation and compare several open- and closed-book models. Our results highlight the danger of systems that appear to be trustworthy by parroting user input while providing an unfaithful response. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 5 pages, ACL Findings 2023

arXiv:2305.15507 [pdf, other]

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Authors: Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

Abstract: Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to prop… ▽ More Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 17 pages, 5 figure, ACL 2023

arXiv:2305.06074 [pdf, other]

iLab at SemEval-2023 Task 11 Le-Wi-Di: Modelling Disagreement or Modelling Perspectives?

Authors: Nikolas Vitsakis, Amit Parekh, Tanvi Dinkar, Gavin Abercrombie, Ioannis Konstas, Verena Rieser

Abstract: There are two competing approaches for modelling annotator disagreement: distributional soft-labelling approaches (which aim to capture the level of disagreement) or modelling perspectives of individual annotators or groups thereof. We adapt a multi-task architecture -- which has previously shown success in modelling perspectives -- to evaluate its performance on the SEMEVAL Task 11. We do so by c… ▽ More There are two competing approaches for modelling annotator disagreement: distributional soft-labelling approaches (which aim to capture the level of disagreement) or modelling perspectives of individual annotators or groups thereof. We adapt a multi-task architecture -- which has previously shown success in modelling perspectives -- to evaluate its performance on the SEMEVAL Task 11. We do so by combining both approaches, i.e. predicting individual annotator perspectives as an interim step towards predicting annotator disagreement. Despite its previous success, we found that a multi-task approach performed poorly on datasets which contained distinct annotator opinions, suggesting that this approach may not always be suitable when modelling perspectives. Furthermore, our results explain that while strongly perspectivist approaches might not achieve state-of-the-art performance according to evaluation metrics used by distributional approaches, our approach allows for a more nuanced understanding of individual perspectives present in the data. We argue that perspectivist approaches are preferable because they enable decision makers to amplify minority views, and that it is important to re-evaluate metrics to reflect this goal. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: To appear in the Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics, 2023

arXiv:2304.12918 [pdf, other]

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than cu… ▽ More Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality. △ Less

Submitted 22 April, 2023; originally announced April 2023.

Comments: To be published at ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models

arXiv:2211.04534 [pdf, other]

Going for GOAL: A Resource for Grounded Football Commentaries

Authors: Alessandro Suglia, José Lopes, Emanuele Bastianelli, Andrea Vanzo, Shubham Agarwal, Malvina Nikandrou, Lu Yu, Ioannis Konstas, Verena Rieser

Abstract: Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer')… ▽ More Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: Preprint formatted using the ACM Multimedia template (8 pages + appendix)

arXiv:2210.07373 [pdf, other]

doi 10.18653/v1/2023.eacl-main.176

Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models

Authors: Zdeněk Kasner, Ioannis Konstas, Ondřej Dušek

Abstract: Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on th… ▽ More Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains. △ Less

Submitted 16 October, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Long paper at EACL '23. Code and data: https://github.com/kasnerz/rel2text

ACM Class: I.2.7

arXiv:2210.00044 [pdf, other]

Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering

Authors: Mavina Nikandrou, Lu Yu, Alessandro Suglia, Ioannis Konstas, Verena Rieser

Abstract: Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. Although continual learning has been widely studied in computer vision, its application to Vision+Language tasks is not that straightforward, as settings can be parameterized in multiple ways according to their input modalities. In this paper, we present a detailed study of how diff… ▽ More Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. Although continual learning has been widely studied in computer vision, its application to Vision+Language tasks is not that straightforward, as settings can be parameterized in multiple ways according to their input modalities. In this paper, we present a detailed study of how different settings affect performance for Visual Question Answering. We first propose three plausible task formulations and demonstrate their impact on the performance of continual learning algorithms. We break down several factors of task similarity, showing that performance and sensitivity to task order highly depend on the shift of the output distribution. We also investigate the potential of pretrained models and compare the robustness of transformer models with different visual embeddings. Finally, we provide an analysis interpreting model representations and their impact on forgetting. Our results highlight the importance of stabilizing visual representations in deeper layers. △ Less

Submitted 20 January, 2024; v1 submitted 30 September, 2022; originally announced October 2022.

arXiv:2110.01295 [pdf, other]

SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts

Authors: Ruben Kruiper, Ioannis Konstas, Alasdair Gray, Farhad Sadeghineko, Richard Watson, Bimal Kumar

Abstract: Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shal… ▽ More Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for ACC. We annotate a small domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger that achieves 79,93 F1-score on the test set. We then show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents, and that both contiguous and discontiguous Multi-Word Expressions (MWE) are discovered with reasonable accuracy (70,3%). △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: To be published in the NLLP workshop at EMNLP 2021, 9 pages (15 including reference and appendices). For the ScotReg corpus, SPaR.txt dataset and code see: http://github.com/rubenkruiper/SPaR.txt

arXiv:2109.10650 [pdf, other]

MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization

Authors: Xinnuo Xu, Ondřej Dušek, Shashi Narayan, Verena Rieser, Ioannis Konstas

Abstract: One of the most challenging aspects of current single-document news summarization is that the summary often contains 'extrinsic hallucinations', i.e., facts that are not present in the source document, which are often derived via world knowledge. This causes summarization systems to act more like open-ended language models tending to hallucinate facts that are erroneous. In this paper, we mitigate… ▽ More One of the most challenging aspects of current single-document news summarization is that the summary often contains 'extrinsic hallucinations', i.e., facts that are not present in the source document, which are often derived via world knowledge. This causes summarization systems to act more like open-ended language models tending to hallucinate facts that are erroneous. In this paper, we mitigate this problem with the help of multiple supplementary resource documents assisting the task. We present a new dataset MiRANews and benchmark existing summarization models. In contrast to multi-document summarization, which addresses multiple events from several source documents, we still aim at generating a summary for a single document. We show via data analysis that it's not only the models which are to blame: more than 27% of facts mentioned in the gold summaries of MiRANews are better grounded on assisting documents than in the main source articles. An error analysis of generated summaries from pretrained models fine-tuned on MiRANews reveals that this has an even bigger effects on models: assisted summarization reduces 55% of hallucinations when compared to single-document summarization models trained on the main article only. Our code and data are available at https://github.com/XinnuoXu/MiRANews. △ Less

Submitted 22 September, 2021; originally announced September 2021.

Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Findings (EMNLP2021 Findings)

arXiv:2106.05580 [pdf, other]

AGGGEN: Ordering and Aggregating while Generating

Authors: Xinnuo Xu, Ondřej Dušek, Verena Rieser, Ioannis Konstas

Abstract: We present AGGGEN (pronounced 'again'), a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems: input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end: AGGGEN performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) bet… ▽ More We present AGGGEN (pronounced 'again'), a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems: input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end: AGGGEN performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) between input representation and target text. Experiments on the WebNLG and E2E challenge data show that by using fact-based alignments our approach is more interpretable, expressive, robust to noise, and easier to control, while retaining the advantages of end-to-end systems in terms of fluency. Our code is available at https://github.com/XinnuoXu/AggGen. △ Less

Submitted 17 June, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: Correct the first citation in the Zero-shot Few-shot scenarios paragraph in Section 7

Journal ref: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL2021)

arXiv:2105.13710 [pdf, other]

OTTers: One-turn Topic Transitions for Open-Domain Dialogue

Authors: Karin Sevegnani, David M. Howcroft, Ioannis Konstas, Verena Rieser

Abstract: Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a "bridging" utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations… ▽ More Mixed initiative in open-domain dialogue requires a system to pro-actively introduce new topics. The one-turn topic transition task explores how a system connects two topics in a cooperative and coherent manner. The goal of the task is to generate a "bridging" utterance connecting the new topic to the topic of the previous conversation turn. We are especially interested in commonsense explanations of how a new topic relates to what has been mentioned before. We first collect a new dataset of human one-turn topic transitions, which we call OTTers. We then explore different strategies used by humans when asked to complete such a task, and notice that the use of a bridging utterance to connect the two topics is the approach used the most. We finally show how existing state-of-the-art text generation models can be adapted to this task and examine the performance of these baselines on different splits of the OTTers data. △ Less

Submitted 28 May, 2021; originally announced May 2021.

Journal ref: ACL2021

arXiv:2102.00424 [pdf, other]

An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Authors: Alessandro Suglia, Yonatan Bisk, Ioannis Konstas, Antonio Vergari, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

Abstract: Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic… ▽ More Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL. △ Less

Submitted 31 January, 2021; originally announced February 2021.

Comments: Accepted paper for the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021)

arXiv:2011.02917 [pdf, other]

Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

Authors: Alessandro Suglia, Antonio Vergari, Ioannis Konstas, Yonatan Bisk, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

Abstract: In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, re… ▽ More In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, relying instead on gold category labels for objects in the scene both at training and inference time. This provides an unnatural performance advantage when categories at inference time match those at training time, and it causes models to fail in more realistic "zero-shot" scenarios where out-of-domain object categories are involved. To overcome this issue, we introduce a novel "imagination" module based on Regularized Auto-Encoders, that learns context-aware and category-aware latent embeddings without relying on category labels at inference time. Our imagination module outperforms state-of-the-art competitors by 8.26% gameplay accuracy in the CompGuessWhat?! zero-shot scenario (Suglia et al., 2020), and it improves the Oracle and Guesser accuracy by 2.08% and 12.86% in the GuessWhat?! benchmark, when no gold categories are available at inference time. The imagination module also boosts reasoning about object properties and attributes. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: Accepted to the International Conference on Computational Linguistics (COLING) 2020

arXiv:2006.02174 [pdf, other]

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Authors: Alessandro Suglia, Ioannis Konstas, Andrea Vanzo, Emanuele Bastianelli, Desmond Elliott, Stella Frank, Oliver Lemon

Abstract: Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub… ▽ More Approaches to Grounded Language Learning typically focus on a single task-based final performance measure that may not depend on desirable properties of the learned hidden representations, such as their ability to predict salient attributes or to generalise to unseen situations. To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular concerning attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with abstract and situated attributes. By using diagnostic classifiers, we show that current models learn representations that are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%). △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted to the Annual Conference of the Association for Computational Linguistics (ACL) 2020

arXiv:2005.07753 [pdf, other]

A Scientific Information Extraction Dataset for Nature Inspired Engineering

Authors: Ruben Kruiper, Julian F. V. Vincent, Jessica Chen-Burger, Marc P. Y. Desmulliez, Ioannis Konstas

Abstract: Nature has inspired various ground-breaking technological developments in applications ranging from robotics to aerospace engineering and the manufacturing of medical devices. However, accessing the information captured in scientific biology texts is a time-consuming and hard task that requires domain-specific knowledge. Improving access for outsiders can help interdisciplinary research like Natur… ▽ More Nature has inspired various ground-breaking technological developments in applications ranging from robotics to aerospace engineering and the manufacturing of medical devices. However, accessing the information captured in scientific biology texts is a time-consuming and hard task that requires domain-specific knowledge. Improving access for outsiders can help interdisciplinary research like Nature Inspired Engineering. This paper describes a dataset of 1,500 manually-annotated sentences that express domain-independent relations between central concepts in a scientific biology text, such as trade-offs and correlations. The arguments of these relations can be Multi Word Expressions and have been annotated with modifying phrases to form non-projective graphs. The dataset allows for training and evaluating Relation Extraction algorithms that aim for coarse-grained ty** of scientific biological documents, enabling a high-level filter for engineers. △ Less

Submitted 26 May, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: Published in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). Updated dataset statistics, results unchanged

arXiv:2005.07751 [pdf, other]

In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts

Authors: Ruben Kruiper, Julian F. V. Vincent, Jessica Chen-Burger, Marc P. Y. Desmulliez, Ioannis Konstas

Abstract: Information Extraction (IE) from scientific texts can be used to guide readers to the central information in scientific documents. But narrow IE systems extract only a fraction of the information captured, and Open IE systems do not perform well on the long and complex sentences encountered in scientific texts. In this work we combine the output of both types of systems to achieve Semi-Open Relati… ▽ More Information Extraction (IE) from scientific texts can be used to guide readers to the central information in scientific documents. But narrow IE systems extract only a fraction of the information captured, and Open IE systems do not perform well on the long and complex sentences encountered in scientific texts. In this work we combine the output of both types of systems to achieve Semi-Open Relation Extraction, a new task that we explore in the Biology domain. First, we present the Focused Open Biological Information Extraction (FOBIE) dataset and use FOBIE to train a state-of-the-art narrow scientific IE system to extract trade-off relations and arguments that are central to biology texts. We then run both the narrow IE system and a state-of-the-art Open IE system on a corpus of 10k open-access scientific biological texts. We show that a significant amount (65%) of erroneous and uninformative Open IE extractions can be filtered using narrow IE extractions. Furthermore, we show that the retained extractions are significantly more often informative to a reader. △ Less

Submitted 26 May, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: To be published in ACL 2020 conference proceedings. Updated dataset statistics, results unchanged

arXiv:2005.07493 [pdf, other]

History for Visual Dialog: Do we really need it?

Authors: Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, Verena Rieser

Abstract: Visual Dialog involves "understanding" the dialog history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to generate the correct response. In this paper, we show that co-attention models which explicitly encode dialog history outperform models that don't, achieving state-of-the-art performance (72 % NDCG on val set)… ▽ More Visual Dialog involves "understanding" the dialog history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to generate the correct response. In this paper, we show that co-attention models which explicitly encode dialog history outperform models that don't, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowd-sourcing dataset collection procedure by showing that history is indeed only required for a small amount of the data and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisDialConv) of the VisDial val set and provide a benchmark of 63% NDCG. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: ACL'20

arXiv:1910.13299 [pdf, other]

Findings of the Third Workshop on Neural Generation and Translation

Authors: Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, Katsuhito Sudoh

Abstract: This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where pa… ▽ More This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document-level generation and translation (DGT) where participants were tasked with develo** systems that generate summaries from structured data, potentially with assistance from text in another language. △ Less

Submitted 29 October, 2019; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: Fixed the metadata (author list)

arXiv:1910.04731 [pdf, other]

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

Authors: Ondřej Dušek, Karin Sevegnani, Ioannis Konstas, Verena Rieser

Abstract: We present a recurrent neural network based system for automatic quality estimation of natural language generation (NLG) outputs, which jointly learns to assign numerical ratings to individual outputs and to provide pairwise rankings of two different outputs. The latter is trained using pairwise hinge loss over scores from two copies of the rating network. We use learning to rank and synthetic d… ▽ More We present a recurrent neural network based system for automatic quality estimation of natural language generation (NLG) outputs, which jointly learns to assign numerical ratings to individual outputs and to provide pairwise rankings of two different outputs. The latter is trained using pairwise hinge loss over scores from two copies of the rating network. We use learning to rank and synthetic data to improve the quality of ratings assigned by our system: we synthesise training pairs of distorted system outputs and train the system to rank the less distorted one higher. This leads to a 12% increase in correlation with human ratings over the previous benchmark. We also establish the state of the art on the dataset of relative rankings from the E2E NLG Challenge (Dušek et al., 2019), where synthetic data lead to a 4% accuracy increase over the base model. △ Less

Submitted 10 October, 2019; originally announced October 2019.

Comments: Accepted as a short paper at INLG 2019

ACM Class: I.2.7

arXiv:1909.06644 [pdf, ps, other]

Current Challenges in Spoken Dialogue Systems and Why They Are Critical for Those Living with Dementia

Authors: Angus Addlesee, Arash Eshghi, Ioannis Konstas

Abstract: Dialogue technologies such as Amazon's Alexa have the potential to transform the healthcare industry. However, current systems are not yet naturally interactive: they are often turn-based, have naive end-of-turn detection and completely ignore many types of verbal and visual feedback - such as backchannels, hesitation markers, filled pauses, gaze, brow furrows and disfluencies - that are crucial i… ▽ More Dialogue technologies such as Amazon's Alexa have the potential to transform the healthcare industry. However, current systems are not yet naturally interactive: they are often turn-based, have naive end-of-turn detection and completely ignore many types of verbal and visual feedback - such as backchannels, hesitation markers, filled pauses, gaze, brow furrows and disfluencies - that are crucial in guiding and managing the conversational process. This is especially important in the healthcare industry as target users of Spoken Dialogue Systems (SDSs) are likely to be frail, older, distracted or suffer from cognitive decline which impacts their ability to make effective use of current systems. In this paper, we outline some of the challenges that are in urgent need of further research, including Incremental Speech Recognition and a systematic study of the interactional patterns in conversation that are potentially diagnostic of dementia, and how these might inform research on and the design of the next generation of SDSs. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: Published at Dialog for Good 2019 - Workshop on Speech and Language Technology Serving Society

Journal ref: Dialog for Good (2019)

arXiv:1904.03651 [pdf, other]

SEQ^3: Differentiable Sequence-to-Sequence-to-Sequence Autoencoder for Unsupervised Abstractive Sentence Compression

Authors: Christos Baziotis, Ion Androutsopoulos, Ioannis Konstas, Alexandros Potamianos

Abstract: Neural sequence-to-sequence models are currently the dominant approach in several natural language processing tasks, but require large parallel corpora. We present a sequence-to-sequence-to-sequence autoencoder (SEQ^3), consisting of two chained encoder-decoder pairs, with words used as a sequence of discrete latent variables. We apply the proposed model to unsupervised abstractive sentence compre… ▽ More Neural sequence-to-sequence models are currently the dominant approach in several natural language processing tasks, but require large parallel corpora. We present a sequence-to-sequence-to-sequence autoencoder (SEQ^3), consisting of two chained encoder-decoder pairs, with words used as a sequence of discrete latent variables. We apply the proposed model to unsupervised abstractive sentence compression, where the first and last sequences are the input and reconstructed sentences, respectively, while the middle sequence is the compressed sentence. Constraining the length of the latent word sequences forces the model to distill important information from the input. A pretrained language model, acting as a prior over the latent sequences, encourages the compressed sentences to be human-readable. Continuous relaxations enable us to sample from categorical distributions, allowing gradient-based optimization, unlike alternatives that rely on reinforcement learning. The proposed model does not require parallel text-summary pairs, achieving promising results in unsupervised sentence compression on benchmark datasets. △ Less

Submitted 9 June, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

Comments: Accepted to NAACL 2019

arXiv:1810.11955 [pdf, other]

Improving Context Modelling in Multimodal Dialogue Generation

Authors: Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, Verena Rieser

Abstract: In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of… ▽ More In this work, we investigate the task of textual response generation in a multimodal task-oriented dialogue system. Our work is based on the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion domain. We introduce a multimodal extension to the Hierarchical Recurrent Encoder-Decoder (HRED) model and show that this extension outperforms strong baselines in terms of text-based similarity metrics. We also showcase the shortcomings of current vision and language models by performing an error analysis on our system's output. △ Less

Submitted 20 October, 2018; originally announced October 2018.

Journal ref: Proceedings of the 11th International Conference on Natural Language Generation, pages 129-134, Tilburg, The Netherlands, 2018

arXiv:1810.11954 [pdf, other]

A Knowledge-Grounded Multimodal Search-Based Conversational Agent

Authors: Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, Verena Rieser

Abstract: Multimodal search-based dialogue is a challenging new task: It extends visually grounded question answering systems into multi-turn conversations with access to an external database. We address this new challenge by learning a neural response generation system from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded multimodal conversation… ▽ More Multimodal search-based dialogue is a challenging new task: It extends visually grounded question answering systems into multi-turn conversations with access to an external database. We address this new challenge by learning a neural response generation system from the recently released Multimodal Dialogue (MMD) dataset (Saha et al., 2017). We introduce a knowledge-grounded multimodal conversational model where an encoded knowledge base (KB) representation is appended to the decoder input. Our model substantially outperforms strong baselines in terms of text-based similarity measures (over 9 BLEU points, 3 of which are solely due to the use of additional information from the KB. △ Less

Submitted 20 October, 2018; originally announced October 2018.

Journal ref: Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 59-66, Brussels, Belgium, October 2018

arXiv:1809.06873 [pdf, other]

Better Conversations by Modeling,Filtering,and Optimizing for Coherence and Diversity

Authors: Xinnuo Xu, Ondřej Dušek, Ioannis Konstas, Verena Rieser

Abstract: We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically cohe… ▽ More We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity. △ Less

Submitted 18 September, 2018; originally announced September 2018.

Journal ref: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3981-3991, Brussels, Belgium, November 2018

arXiv:1808.09588 [pdf, ps, other]

Map** Language to Code in Programmatic Context

Authors: Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer

Abstract: Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending… ▽ More Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending on the functionality the class provides (e.g., a sort function may or may not be available when we are asked to "return the smallest element" in a particular member variable list). We introduce CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. We also present a detailed error analysis suggesting that there is significant room for future work on this task. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: Accepted at EMNLP 2018

arXiv:1704.08760 [pdf, other]

Learning a Neural Semantic Parser from User Feedback

Authors: Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, Luke Zettlemoyer

Abstract: We present an approach to rapidly and easily build natural language interfaces to databases for new domains, whose performance improves over time based on user feedback, and requires minimal intervention. To achieve this, we adapt neural sequence models to map utterances directly to SQL with its full expressivity, bypassing any intermediate meaning representations. These models are immediately dep… ▽ More We present an approach to rapidly and easily build natural language interfaces to databases for new domains, whose performance improves over time based on user feedback, and requires minimal intervention. To achieve this, we adapt neural sequence models to map utterances directly to SQL with its full expressivity, bypassing any intermediate meaning representations. These models are immediately deployed online to solicit feedback from real users to flag incorrect queries. Finally, the popularity of SQL facilitates gathering annotations for incorrect predictions using the crowd, which is directly used to improve our models. This complete feedback loop, without intermediate representations or database specific engineering, opens up new ways of building high quality semantic parsers. Experiments suggest that this approach can be deployed quickly for any new target domain, as we show by learning a semantic parser for an online academic database from scratch. △ Less

Submitted 27 April, 2017; originally announced April 2017.

Comments: Accepted at ACL 2017

arXiv:1704.08381 [pdf, other]

Neural AMR: Sequence-to-Sequence Models for Parsing and Generation

Authors: Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Ye** Choi, Luke Zettlemoyer

Abstract: Sequence-to-sequence models have shown strong performance across a broad range of applications. However, their application to parsing and generating text usingAbstract Meaning Representation (AMR)has been limited, due to the relatively limited amount of labeled data and the non-sequential nature of the AMR graphs. We present a novel training procedure that can lift this limitation using millions o… ▽ More Sequence-to-sequence models have shown strong performance across a broad range of applications. However, their application to parsing and generating text usingAbstract Meaning Representation (AMR)has been limited, due to the relatively limited amount of labeled data and the non-sequential nature of the AMR graphs. We present a novel training procedure that can lift this limitation using millions of unlabeled sentences and careful preprocessing of the AMR graphs. For AMR parsing, our model achieves competitive results of 62.1SMATCH, the current best score reported without significant use of external semantic resources. For AMR generation, our model establishes a new state-of-the-art performance of BLEU 33.8. We present extensive ablative and qualitative analysis including strong evidence that sequence-based AMR models are robust against ordering variations of graph-to-sequence conversions. △ Less

Submitted 18 August, 2017; v1 submitted 26 April, 2017; originally announced April 2017.

Comments: Accepted in ACL 2017

arXiv:1702.01841 [pdf, ps, other]

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task

Authors: Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Ye** Choi, Noah A. Smith

Abstract: A writer's style depends not just on personal traits but also on her intent and mental state. In this paper, we show how variants of the same writing task can lead to measurable differences in writing style. We present a case study based on the story cloze task (Mostafazadeh et al., 2016a), where annotators were assigned similar writing tasks with different constraints: (1) writing an entire story… ▽ More A writer's style depends not just on personal traits but also on her intent and mental state. In this paper, we show how variants of the same writing task can lead to measurable differences in writing style. We present a case study based on the story cloze task (Mostafazadeh et al., 2016a), where annotators were assigned similar writing tasks with different constraints: (1) writing an entire story, (2) adding a story ending for a given story context, and (3) adding an incoherent ending to a story. We show that a simple linear classifier informed by stylistic features is able to successfully distinguish among the three cases, without even looking at the story context. In addition, combining our stylistic features with language model predictions reaches state of the art performance on the story cloze challenge. Our results demonstrate that different task framings can dramatically affect the way people write. △ Less

Submitted 13 July, 2017; v1 submitted 6 February, 2017; originally announced February 2017.

Comments: 11 pages, CoNLL 2017

arXiv:1610.06210 [pdf, other]

A Theme-Rewriting Approach for Generating Algebra Word Problems

Authors: Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettlemoyer, Hannaneh Hajishirzi

Abstract: Texts present coherent stories that have a particular theme or overall setting, for example science fiction or western. In this paper, we present a text generation method called {\it rewriting} that edits existing human-authored narratives to change their theme without changing the underlying story. We apply the approach to math word problems, where it might help students stay more engaged by quic… ▽ More Texts present coherent stories that have a particular theme or overall setting, for example science fiction or western. In this paper, we present a text generation method called {\it rewriting} that edits existing human-authored narratives to change their theme without changing the underlying story. We apply the approach to math word problems, where it might help students stay more engaged by quickly transforming all of their homework assignments to the theme of their favorite movie without changing the math concepts that are being taught. Our rewriting method uses a two-stage decoding process, which proposes new words from the target theme and scores the resulting stories according to a number of factors defining aspects of syntactic, semantic, and thematic coherence. Experiments demonstrate that the final stories typically represent the new theme well while still testing the original math concepts, outperforming a number of baselines. We also release a new dataset of human-authored rewrites of math word problems in several themes. △ Less

Submitted 19 October, 2016; originally announced October 2016.

Comments: To appear EMNLP 2016

Showing 1–37 of 37 results for author: Konstas, I