-
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Authors:
Letitia Parcalabescu,
Anette Frank
Abstract:
Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when they…
▽ More
Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when they produce explanations as opposed to providing answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing unimodal tests and measures to VLM decoders. We find that VLMs are less self-consistent than LLMs. Text contributions in VL decoders are more important than image contributions in all examined tasks. Moreover, the contributions of images are significantly stronger for explanation generation compared to answer generation. This difference is even larger in CoT compared to post-hoc explanations. Lastly, we provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which before only covered VL encoders. We find that VL decoders still struggle with most phenomena tested by VALSE.
△ Less
Submitted 10 June, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
PAKT: Perspectivized Argumentation Knowledge Graph and Tool for Deliberation Analysis (with Supplementary Materials)
Authors:
Moritz Plenz,
Philipp Heinisch,
Anette Frank,
Philipp Cimiano
Abstract:
Deliberative processes play a vital role in sha** opinions, decisions and policies in our society. In contrast to persuasive debates, deliberation aims to foster understanding of conflicting perspectives among interested parties. The exchange of arguments in deliberation serves to elucidate viewpoints, to raise awareness of conflicting interests, and to finally converge on a resolution. To bette…
▽ More
Deliberative processes play a vital role in sha** opinions, decisions and policies in our society. In contrast to persuasive debates, deliberation aims to foster understanding of conflicting perspectives among interested parties. The exchange of arguments in deliberation serves to elucidate viewpoints, to raise awareness of conflicting interests, and to finally converge on a resolution. To better understand and analyze the underlying processes of deliberation, we propose PAKT, a Perspectivized Argumentation Knowledge Graph and Tool. The graph structures the argumentative space across diverse topics, where arguments i) are divided into premises and conclusions, ii) are annotated for stances, framings and their underlying values and iii) are connected to background knowledge. We show how to construct PAKT and conduct case studies on the obtained multifaceted argumentation graph. Our findings show the analytical potential offered by our framework, highlighting the capability to go beyond individual arguments and to reveal structural patterns in the way participants and stakeholders argue in a debate. The overarching goal of our work is to facilitate constructive discourse and informed decision making as a special form of argumentation. We offer public access to PAKT and its rich capabilities to support analytics, visualizaton, navigation and efficient search, for diverse forms of argumentation.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting
Authors:
Phillip Richter-Pechanski,
Philipp Wiesenbach,
Dominic M. Schwab,
Christina Kiriakou,
Nicolas Geis,
Christoph Dieterich,
Anette Frank
Abstract:
Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, whi…
▽ More
Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Exploring Continual Learning of Compositional Generalization in NLI
Authors:
Xiyan Fu,
Anette Frank
Abstract:
Compositional Natural Language Inference has been explored to assess the true abilities of neural models to perform NLI. Yet, current evaluations assume models to have full access to all primitive inferences in advance, in contrast to humans that continuously acquire inference knowledge. In this paper, we introduce the Continual Compositional Generalization in Inference (C2Gen NLI) challenge, wher…
▽ More
Compositional Natural Language Inference has been explored to assess the true abilities of neural models to perform NLI. Yet, current evaluations assume models to have full access to all primitive inferences in advance, in contrast to humans that continuously acquire inference knowledge. In this paper, we introduce the Continual Compositional Generalization in Inference (C2Gen NLI) challenge, where a model continuously acquires knowledge of constituting primitive inference tasks as a basis for compositional inferences. We explore how continual learning affects compositional generalization in NLI, by designing a continual learning setup for compositional NLI inference tasks. Our experiments demonstrate that models fail to compositionally generalize in a continual scenario. To address this problem, we first benchmark various continual learning algorithms and verify their efficacy. We then further analyze C2Gen, focusing on how to order primitives and compositional inference types and examining correlations between subtasks. Our analyses show that by learning subtasks continuously while observing their dependencies and increasing degrees of difficulty, continual learning can enhance composition generalization ability.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Graph Language Models
Authors:
Moritz Plenz,
Anette Frank
Abstract:
While Language Models (LMs) are the workhorses of NLP, their interplay with structured knowledge graphs (KGs) is still actively researched. Current methods for encoding such graphs typically either (i) linearize them for embedding with LMs -- which underutilize structural information, or (ii) use Graph Neural Networks (GNNs) to preserve the graph structure -- but GNNs cannot represent text feature…
▽ More
While Language Models (LMs) are the workhorses of NLP, their interplay with structured knowledge graphs (KGs) is still actively researched. Current methods for encoding such graphs typically either (i) linearize them for embedding with LMs -- which underutilize structural information, or (ii) use Graph Neural Networks (GNNs) to preserve the graph structure -- but GNNs cannot represent text features as well as pretrained LMs. In our work we introduce a novel LM type, the Graph Language Model (GLM), that integrates the strengths of both approaches and mitigates their weaknesses. The GLM parameters are initialized from a pretrained LM to enhance understanding of individual graph concepts and triplets. Simultaneously, we design the GLM's architecture to incorporate graph biases, thereby promoting effective knowledge distribution within the graph. This enables GLMs to process graphs, texts, and interleaved inputs of both. Empirical evaluations on relation classification tasks show that GLM embeddings surpass both LM- and GNN-based baselines in supervised and zero-shot setting, demonstrating their versatility.
△ Less
Submitted 3 June, 2024; v1 submitted 13 January, 2024;
originally announced January 2024.
-
On Measuring Faithfulness or Self-consistency of Natural Language Explanations
Authors:
Letitia Parcalabescu,
Anette Frank
Abstract:
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulnes…
▽ More
Large language models (LLMs) can explain their predictions through post-hoc or Chain-of-Thought (CoT) explanations. But an LLM could make up reasonably sounding explanations that are unfaithful to its underlying reasoning. Recent work has designed tests that aim to judge the faithfulness of post-hoc or CoT explanations. In this work we argue that these faithfulness tests do not measure faithfulness to the models' inner workings -- but rather their self-consistency at output level. Our contributions are three-fold: i) We clarify the status of faithfulness tests in view of model explainability, characterising them as self-consistency tests instead. This assessment we underline by ii) constructing a Comparative Consistency Bank for self-consistency tests that for the first time compares existing tests on a common suite of 11 open LLMs and 5 tasks -- including iii) our new self-consistency measure CC-SHAP. CC-SHAP is a fine-grained measure (not a test) of LLM self-consistency. It compares how a model's input contributes to the predicted answer and to generating the explanation. Our fine-grained CC-SHAP metric allows us iii) to compare LLM behaviour when making predictions and to analyse the effect of other consistency tests at a deeper level, which takes us one step further towards measuring faithfulness by bringing us closer to the internals of the model than strictly surface output-oriented tests. Our code is available at \url{https://github.com/Heidelberg-NLP/CC-SHAP}
△ Less
Submitted 1 June, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Authors:
Ilker Kesen,
Andrea Pedrotti,
Mustafa Dogan,
Michele Cafagna,
Emre Can Acikgoz,
Letitia Parcalabescu,
Iacer Calixto,
Anette Frank,
Albert Gatt,
Aykut Erdem,
Erkut Erdem
Abstract:
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm foo…
▽ More
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, hel** to highlight areas that still need to be explored.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation
Authors:
Xiyan Fu,
Anette Frank
Abstract:
Despite the success of neural models in solving reasoning tasks, their compositional generalization capabilities remain unclear. In this work, we propose a new setting of the structured explanation generation task to facilitate compositional reasoning research. Previous works found that symbolic methods achieve superior compositionality by using pre-defined inference rules for iterative reasoning.…
▽ More
Despite the success of neural models in solving reasoning tasks, their compositional generalization capabilities remain unclear. In this work, we propose a new setting of the structured explanation generation task to facilitate compositional reasoning research. Previous works found that symbolic methods achieve superior compositionality by using pre-defined inference rules for iterative reasoning. But these approaches rely on brittle symbolic transfers and are restricted to well-defined tasks. Hence, we propose a dynamic modularized reasoning model, MORSE, to improve the compositional generalization of neural models. MORSE factorizes the inference process into a combination of modules, where each module represents a functional unit. Specifically, we adopt modularized self-attention to dynamically select and route inputs to dedicated heads, which specializes them to specific functions. We conduct experiments for increasing lengths and shapes of reasoning trees on two benchmarks to test MORSE's compositional generalization abilities, and find it outperforms competitive baselines. Model ablation and deeper analyses show the effectiveness of dynamic reasoning modules and their generalization abilities.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature
Authors:
Frederick Riemenschneider,
Anette Frank
Abstract:
Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classi…
▽ More
Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English texts into Ancient Greek. Further, we present a case study, demonstrating SPhilBERTa's capability to facilitate automated detection of intertextual parallels. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
AMR4NLI: Interpretable and robust NLI measures from semantic graphs
Authors:
Juri Opitz,
Shira Wein,
Julius Steen,
Anette Frank,
Nathan Schneider
Abstract:
The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent…
▽ More
The task of natural language inference (NLI) asks whether a given premise (expressed in NL) entails a given NL hypothesis. NLI benchmarks contain human ratings of entailment, but the meaning relationships driving these ratings are not formalized. Can the underlying sentence pair relationships be made more explicit in an interpretable yet robust fashion? We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs (Abstract Meaning Representations), and measure whether the hypothesis is a semantic substructure of the premise, utilizing interpretable metrics. Our evaluation on three English benchmarks finds value in both contextualized embeddings and semantic graphs; moreover, they provide complementary signals, and can be leveraged together in a hybrid model.
△ Less
Submitted 5 September, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness
Authors:
Julius Steen,
Juri Opitz,
Anette Frank,
Katja Markert
Abstract:
Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth o…
▽ More
Conditional language models still generate unfaithful output that is not supported by their input. These unfaithful generations jeopardize trust in real-world applications such as summarization or human-machine interaction, motivating a need for automatic faithfulness metrics. To implement such metrics, NLI models seem attractive, since they solve a strongly related task that comes with a wealth of prior research and data. But recent research suggests that NLI models require costly additional machinery to perform reliably across datasets, e.g., by running inference on a cartesian product of input and generated sentences, or supporting them with a question-generation/answering step.
In this work we show that pure NLI models _can_ outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures. We propose: (1) Augmenting NLI training data to adapt NL inferences to the specificities of faithfulness prediction in dialogue; (2) Making use of both entailment and contradiction probabilities in NLI, and (3) Using Monte-Carlo dropout during inference. Applied to the TRUE benchmark, which combines faithfulness datasets across diverse domains and tasks, our approach strongly improves a vanilla NLI model and significantly outperforms previous work, while showing favourable computational cost.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
SETI: Systematicity Evaluation of Textual Inference
Authors:
Xiyan Fu,
Anette Frank
Abstract:
We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve the…
▽ More
We propose SETI (Systematicity Evaluation of Textual Inference), a novel and comprehensive benchmark designed for evaluating pre-trained language models (PLMs) for their systematicity capabilities in the domain of textual inference. Specifically, SETI offers three different NLI tasks and corresponding datasets to evaluate various types of systematicity in reasoning processes. In order to solve these tasks, models are required to perform compositional inference based on known primitive constituents. We conduct experiments of SETI on six widely used PLMs. Results show that various PLMs are able to solve unseen compositional inferences when having encountered the knowledge of how to combine primitives, with good performance. However, they are considerably limited when this knowledge is unknown to the model (40-100% points decrease). Furthermore, we find that PLMs can improve drastically once exposed to crucial compositional knowledge in minimalistic shots. These findings position SETI as the first benchmark for measuring the future progress of PLMs in achieving systematicity generalization in the textual inference.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Exploring Large Language Models for Classical Philology
Authors:
Frederick Riemenschneider,
Anette Frank
Abstract:
Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-…
▽ More
Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English. We evaluate all models on morphological and syntactic tasks, including lemmatization, which demonstrates the added value of T5's decoding abilities. We further define two probing tasks to investigate the knowledge acquired by models pre-trained on Classical texts. Our experiments provide the first benchmarking analysis of existing models of Ancient Greek. Results show that our models provide significant improvements over the SoTA. The systematic analysis of model types can inform future research in designing language models for Classical languages, including the development of novel generative tasks. We make all our models available as community resources, along with a large curated pre-training corpus for Ancient Greek, to support the creation of a larger, comparable model zoo for Classical Philology. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Similarity-weighted Construction of Contextualized Commonsense Knowledge Graphs for Knowledge-intense Argumentation Tasks
Authors:
Moritz Plenz,
Juri Opitz,
Philipp Heinisch,
Philipp Cimiano,
Anette Frank
Abstract:
Arguments often do not make explicit how a conclusion follows from its premises. To compensate for this lack, we enrich arguments with structured background knowledge to support knowledge-intense argumentation tasks. We present a new unsupervised method for constructing Contextualized Commonsense Knowledge Graphs (CCKGs) that selects contextually relevant knowledge from large knowledge graphs (KGs…
▽ More
Arguments often do not make explicit how a conclusion follows from its premises. To compensate for this lack, we enrich arguments with structured background knowledge to support knowledge-intense argumentation tasks. We present a new unsupervised method for constructing Contextualized Commonsense Knowledge Graphs (CCKGs) that selects contextually relevant knowledge from large knowledge graphs (KGs) efficiently and at high quality. Our work goes beyond context-insensitive knowledge extraction heuristics by computing semantic similarity between KG triplets and textual arguments. Using these triplet similarities as weights, we extract contextualized knowledge paths that connect a conclusion to its premise, while maximizing similarity to the argument. We combine multiple paths into a CCKG that we optionally prune to reduce noise and raise precision. Intrinsic evaluation of the quality of our graphs shows that our method is effective for (re)constructing human explanation graphs. Manual evaluations in a large-scale knowledge selection setup confirm high recall and precision of implicit CSK in the CCKGs. Finally, we demonstrate the effectiveness of CCKGs in a knowledge-insensitive argument quality rating task, outperforming strong baselines and rivaling a GPT-3 based system.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Semantic Information in a model of Resource Gathering Agents
Authors:
Damian R Sowinski,
Jonathan Carroll-Nellenback,
Robert N Markwick,
Jordi Piñero,
Marcelo Gleiser,
Artemy Kolchinsky,
Gourab Ghoshal,
Adam Frank
Abstract:
We explore the application of a new theory of Semantic Information to the well-motivated problem of a resource foraging agent. Semantic information is defined as the subset of correlations, measured via the transfer entropy, between agent $A$ and environment $E$ that is necessary for the agent to maintain its viability $V$. Viability, in turn, is endogenously defined as opposed to the use of exoge…
▽ More
We explore the application of a new theory of Semantic Information to the well-motivated problem of a resource foraging agent. Semantic information is defined as the subset of correlations, measured via the transfer entropy, between agent $A$ and environment $E$ that is necessary for the agent to maintain its viability $V$. Viability, in turn, is endogenously defined as opposed to the use of exogenous quantities like utility functions. In our model, the forager's movements are determined by its ability to measure, via a sensor, the presence of an individual unit of resource, while the viability function is its expected lifetime. Through counterfactual interventions -- scrambling the correlations between agent and environment via noising the sensor -- we demonstrate the presence of a critical value of the noise parameter, $η_c$, above which the forager's expected lifetime is dramatically reduced. On the other hand, for $η< η_c$ there is little-to-no effect on its ability to survive. We refer to this boundary as the semantic threshold, quantifying the subset of agent-environment correlations that the agent actually needs to maintain its desired state of staying alive. Each bit of information affects the agent's ability to persist both above and below the semantic threshold. Modeling the viability curve and its semantic threshold via forager/environment parameters, we show how the correlations are instantiated. Our work provides a useful model for studies of established agents in terms of semantic information. It also shows that such semantic thresholds may prove useful for understanding the role information plays in allowing systems to become autonomous agents.
△ Less
Submitted 17 October, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Authors:
Letitia Parcalabescu,
Anette Frank
Abstract:
Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the m…
▽ More
Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models -- LXMERT, CLIP and four ALBEF variants -- on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at \url{https://github.com/Heidelberg-NLP/MM-SHAP}.
△ Less
Submitted 23 May, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
What Pronouns for Pepper? A Critical Review of Gender/ing in Research
Authors:
Katie Seaborn,
Alexa Frank
Abstract:
Gender/ing guides how we view ourselves, the world around us, and each other--including non-humans. Critical voices have raised the alarm about stereotyped gendering in the design of socially embodied artificial agents like voice assistants, conversational agents, and robots. Yet, little is known about how this plays out in research and to what extent. As a first step, we critically reviewed the c…
▽ More
Gender/ing guides how we view ourselves, the world around us, and each other--including non-humans. Critical voices have raised the alarm about stereotyped gendering in the design of socially embodied artificial agents like voice assistants, conversational agents, and robots. Yet, little is known about how this plays out in research and to what extent. As a first step, we critically reviewed the case of Pepper, a gender-ambiguous humanoid robot. We conducted a systematic review (n=75) involving meta-synthesis and content analysis, examining how participants and researchers gendered Pepper through stated and unstated signifiers and pronoun usage. We found that ascriptions of Pepper's gender were inconsistent, limited, and at times discordant, with little evidence of conscious gendering and some indication of researcher influence on participant gendering. We offer six challenges driving the state of affairs and a practical framework coupled with a critical checklist for centering gender in research on artificial agents.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Authors:
Juri Opitz,
Anette Frank
Abstract:
Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight…
▽ More
Recently, astonishing advances have been observed in AMR parsing, as measured by the structural Smatch metric. In fact, today's systems achieve performance levels that seem to surpass estimates of human inter annotator agreement (IAA). Therefore, it is unclear how well Smatch (still) relates to human estimates of parse quality, as in this situation potentially fine-grained errors of similar weight may impact the AMR's meaning to different degrees.
We conduct an analysis of two popular and strong AMR parsers that -- according to Smatch -- reach quality levels on par with human IAA, and assess how human quality ratings relate to Smatch and other AMR metrics. Our main findings are: i) While high Smatch scores indicate otherwise, we find that AMR parsing is far from being solved: we frequently find structurally small, but semantically unacceptable errors that substantially distort sentence meaning. ii) Considering high-performance parsers, better Smatch scores may not necessarily indicate consistently better parsing quality. To obtain a meaningful and comprehensive assessment of quality differences of parse(r)s, we recommend augmenting evaluations with macro statistics, use of additional metrics, and more human analysis.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Automatic differentiation and the optimization of differential equation models in biology
Authors:
Steven A. Frank
Abstract:
A computational revolution unleashed the power of artificial neural networks. At the heart of that revolution is automatic differentiation, which calculates the derivative of a performance measure relative to a large number of parameters. Differentiation enhances the discovery of improved performance in large models, an achievement that was previously difficult or impossible. Recently, a second co…
▽ More
A computational revolution unleashed the power of artificial neural networks. At the heart of that revolution is automatic differentiation, which calculates the derivative of a performance measure relative to a large number of parameters. Differentiation enhances the discovery of improved performance in large models, an achievement that was previously difficult or impossible. Recently, a second computational advance optimizes the temporal trajectories traced by differential equations. Optimization requires differentiating a measure of performance over a trajectory, such as the closeness of tracking the environment, with respect to the parameters of the differential equations. Because model trajectories are usually calculated numerically by multistep algorithms, such as Runge-Kutta, the automatic differentiation must be passed through the numerical algorithm. This article explains how such automatic differentiation of trajectories is achieved. It also discusses why such computational breakthroughs are likely to advance theoretical and statistical studies of biological problems, in which one can consider variables as dynamic paths over time and space. Many common problems arise between improving success in computational learning models over performance landscapes, improving evolutionary fitness over adaptive landscapes, and improving statistical fits to data over information landscapes.
△ Less
Submitted 11 October, 2022; v1 submitted 10 July, 2022;
originally announced July 2022.
-
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
Authors:
Juri Opitz,
Anette Frank
Abstract:
Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar…
▽ More
Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar. However, such metrics tend to be slow, rely on parsers, and do not reach state-of-the-art performance when rating sentence similarity.
In this work, we aim at the best of both worlds, by learning to induce $S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our S$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize various semantic sentence features (e.g., semantic roles, negation, or quantification). We show how to i) learn a decomposition of the sentence embeddings into semantic features, through approximation of a suite of interpretable AMR graph metrics, and how to ii) preserve the overall power of the neural embeddings by controlling the decomposition learning process with a second objective that enforces consistency with the similarity ratings of an SBERT teacher model. In our experimental studies, we show that our approach offers interpretability -- while fully preserving the effectiveness and efficiency of the neural sentence embeddings.
△ Less
Submitted 28 October, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation -- through the Lens of Semantic Similarity Rating
Authors:
Laura Zeidler,
Juri Opitz,
Anette Frank
Abstract:
Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR. Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning, by develo** a dyn…
▽ More
Evaluating the quality of generated text is difficult, since traditional NLG evaluation metrics, focusing more on surface form than meaning, often fail to assign appropriate scores. This is especially problematic for AMR-to-text evaluation, given the abstract nature of AMR. Our work aims to support the development and improvement of NLG evaluation metrics that focus on meaning, by develo** a dynamic CheckList for NLG metrics that is interpreted by being organized around meaning-relevant linguistic phenomena. Each test instance consists of a pair of sentences with their AMR graphs and a human-produced textual semantic similarity or relatedness score. Our CheckList facilitates comparative evaluation of metrics and reveals strengths and weaknesses of novel and traditional metrics. We demonstrate the usefulness of CheckList by designing a new metric GraCo that computes lexical cohesion graphs over AMR concepts. Our analysis suggests that GraCo presents an interesting NLG metric worth future investigation and that meaning-oriented NLG metrics can profit from graph-based metric components using AMR.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Optimizing differential equations to fit data and predict outcomes
Authors:
Steven A. Frank
Abstract:
Many scientific problems focus on observed patterns of change or on how to design a system to achieve particular dynamics. Those problems often require fitting differential equation models to target trajectories. Fitting such models can be difficult because each evaluation of the fit must calculate the distance between the model and target patterns at numerous points along a trajectory. The gradie…
▽ More
Many scientific problems focus on observed patterns of change or on how to design a system to achieve particular dynamics. Those problems often require fitting differential equation models to target trajectories. Fitting such models can be difficult because each evaluation of the fit must calculate the distance between the model and target patterns at numerous points along a trajectory. The gradient of the fit with respect to the model parameters can be challenging. Recent technical advances in automatic differentiation through numerical differential equation solvers potentially change the fitting process into a relatively easy problem, opening up new possibilities to study dynamics. However, application of the new tools to real data may fail to achieve a good fit. This article illustrates how to overcome a variety of common challenges, using the classic ecological data for oscillations in hare and lynx populations. Models include simple ordinary differential equations (ODEs) and neural ordinary differential equations (NODEs), which use artificial neural networks to estimate the derivatives of differential equation systems. Comparing the fits obtained with ODEs versus NODEs, representing small and large parameter spaces, and changing the number of variable dimensions provide insight into the geometry of the observed and model trajectories. To analyze the quality of the models for predicting future observations, a Bayesian-inspired preconditioned stochastic gradient Langevin dynamics (pSGLD) calculation of the posterior distribution of predicted model trajectories clarifies the tendency for various models to underfit or overfit the data. Coupling fitted differential equation systems with pSGLD sampling provides a powerful way to study the properties of optimization surfaces, raising an analogy with mutation-selection dynamics on fitness landscapes.
△ Less
Submitted 16 April, 2022;
originally announced April 2022.
-
SMARAGD: Learning SMatch for Accurate and Rapid Approximate Graph Distance
Authors:
Juri Opitz,
Philipp Meier,
Anette Frank
Abstract:
The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai and Knight, 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurat…
▽ More
The similarity of graph structures, such as Meaning Representations (MRs), is often assessed via structural matching algorithms, such as Smatch (Cai and Knight, 2013). However, Smatch involves a combinatorial problem that suffers from NP-completeness, making large-scale applications, e.g., graph clustering or search, infeasible. To alleviate this issue, we learn SMARAGD: Semantic Match for Accurate and Rapid Approximate Graph Distance. We show the potential of neural networks to approximate Smatch scores, i) in linear time using a machine translation framework to predict alignments, or ii) in constant time using a Siamese CNN to directly predict Smatch scores. We show that the approximation error can be substantially reduced through data augmentation and graph anonymization.
△ Less
Submitted 1 June, 2023; v1 submitted 24 March, 2022;
originally announced March 2022.
-
The Case for Technosignatures: Why They May Be Abundant, Long-lived, Highly Detectable, and Unambiguous
Authors:
Jason T. Wright,
Jacob Haqq-Misra,
Adam Frank,
Ravi Kopparapu,
Manasvi Lingam,
Sofia Z. Sheikh
Abstract:
The intuition suggested by the Drake equation implies that technology should be less prevalent than biology in the galaxy. However, it has been appreciated for decades in the SETI community that technosignatures could be more abundant, longer-lived, more detectable, and less ambiguous than biosignatures. We collect the arguments for and against technosignatures' ubiquity and discuss the implicatio…
▽ More
The intuition suggested by the Drake equation implies that technology should be less prevalent than biology in the galaxy. However, it has been appreciated for decades in the SETI community that technosignatures could be more abundant, longer-lived, more detectable, and less ambiguous than biosignatures. We collect the arguments for and against technosignatures' ubiquity and discuss the implications of some properties of technological life that fundamentally differ from nontechnological life in the context of modern astrobiology: It can spread among the stars to many sites, it can be more easily detected at large distances, and it can produce signs that are unambiguously technological. As an illustration in terms of the Drake equation, we consider two Drake-like equations, for technosignatures (calculating N(tech)) and biosignatures (calculating N(bio)). We argue that Earth and humanity may be poor guides to the longevity term L and that its maximum value could be very large, in that technology can outlive its creators and even its host star. We conclude that while the Drake equation implies that N(bio)>>N(tech), it is also plausible that N(tech)>>N(bio). As a consequence, as we seek possible indicators of extraterrestrial life, for instance, via characterization of the atmospheres of habitable exoplanets, we should search for both biosignatures and technosignatures. This exercise also illustrates ways in which biosignature and technosignature searches can complement and supplement each other and how methods of technosignature search, including old ideas from SETI, can inform the search for biosignatures and life generally.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
Active Learning Over Multiple Domains in Natural Language Tasks
Authors:
Shayne Longpre,
Julia Reisler,
Edward Greg Huang,
Yi Lu,
Andrew Frank,
Nikhil Ramesh,
Chris DuBois
Abstract:
Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detec…
▽ More
Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. We ask (1) what family of methods are effective for this task? And, (2) what properties of selected examples and domains achieve strong results? Among 18 acquisition functions from 4 families of methods, we find H-Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. We also show the importance of a diverse allocation of domains, as well as room-for-improvement of existing methods on both domain and example selection. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks.
△ Less
Submitted 8 February, 2022; v1 submitted 1 February, 2022;
originally announced February 2022.
-
Experiences with managing data parallel computational workflows for High-throughput Fragment Molecular Orbital (FMO) Calculations
Authors:
Dimuthu Wannipurage,
Indrajit Deb,
Eroma Abeysinghe,
Sudhakar Pamidighantam,
Suresh Marru,
Marlon Pierce,
Aaron T. Frank
Abstract:
Fragment Molecular Orbital (FMO) calculations provide a framework to speed up quantum mechanical calculations and so can be used to explore structure-energy relationships in large and complex biomolecular systems. These calculations are still onerous, especially when applied to large sets of molecules. Therefore, cyberinfrastructure that provides mechanisms and user interfaces that manage job subm…
▽ More
Fragment Molecular Orbital (FMO) calculations provide a framework to speed up quantum mechanical calculations and so can be used to explore structure-energy relationships in large and complex biomolecular systems. These calculations are still onerous, especially when applied to large sets of molecules. Therefore, cyberinfrastructure that provides mechanisms and user interfaces that manage job submissions, failed job resubmissions, data retrieval, and data storage for these calculations are needed. Motivated by the need to rapidly identify drugs that are likely to bind to targets implicated in SARS-CoV-2, the virus that causes COVID-19, we developed a static parameter swee** framework with Apache Airavata middleware to apply to complexes formed between SARS-CoV-2 M-pro (the main protease in SARS-CoV-2) and 2820 small-molecules in a drug-repurposing library. Here we describe the implementation of our framework for managing the executions of the high-throughput FMO calculations. The approach is general and so should find utility in large-scale FMO calculations on biomolecular systems.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Consensus between Epistemic Agents is Difficult
Authors:
Damian R. Sowinski,
Jonathan Carroll-Nellenback,
Jeremy M. DeSilva,
Adam Frank,
Gourab Ghoshal,
Marcelo Gleiser,
Hari Seldon
Abstract:
We introduce an epistemic information measure between two data streams, that we term $influence$. Closely related to transfer entropy, the measure must be estimated by epistemic agents with finite memory resources via sampling accessible data streams. We show that even under ideal conditions, epistemic agents using slightly different sampling strategies might not achieve consensus in their conclus…
▽ More
We introduce an epistemic information measure between two data streams, that we term $influence$. Closely related to transfer entropy, the measure must be estimated by epistemic agents with finite memory resources via sampling accessible data streams. We show that even under ideal conditions, epistemic agents using slightly different sampling strategies might not achieve consensus in their conclusions about which data stream is influencing which. As an illustration, we examine a real world data stream where different sampling strategies result in contradictory conclusions, explaining why some politically charged topics might exist due to purely epistemic reasons irrespective of the actual ontology of the world.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Authors:
Letitia Parcalabescu,
Michele Cafagna,
Lilitta Muradjan,
Anette Frank,
Iacer Calixto,
Albert Gatt
Abstract:
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modali…
▽ More
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.
△ Less
Submitted 14 March, 2022; v1 submitted 14 December, 2021;
originally announced December 2021.
-
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning
Authors:
Constantin Eichenberg,
Sidney Black,
Samuel Weinbach,
Letitia Parcalabescu,
Anette Frank
Abstract:
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of…
▽ More
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2% of the number of samples used to train SimVLM.
△ Less
Submitted 24 October, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity
Authors:
Juri Opitz,
Angel Daza,
Anette Frank
Abstract:
Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step.
In this work we p…
▽ More
Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step.
In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (BAMBOO), the first benchmark to support empirical assessment of graph-based MR similarity metrics. BAMBOO maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric's robustness against meaning-altering and meaning-preserving graph transformations. We show the benefits of BAMBOO by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Translate, then Parse! A strong baseline for Cross-Lingual AMR Parsing
Authors:
Sarah Uhrig,
Yoalli Rezepka Garcia,
Juri Opitz,
Anette Frank
Abstract:
In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data…
▽ More
In cross-lingual Abstract Meaning Representation (AMR) parsing, researchers develop models that project sentences from various languages onto their AMRs to capture their essential semantic structures: given a sentence in any language, we aim to capture its core semantic content through concepts connected by manifold types of semantic relations. Methods typically leverage large silver training data to learn a single model that is able to project non-English sentences to AMRs. However, we find that a simple baseline tends to be over-looked: translating the sentences to English and projecting their AMR with a monolingual AMR parser (translate+parse,T+P). In this paper, we revisit this simple two-step base-line, and enhance it with a strong NMT system and a strong AMR parser. Our experiments show that T+P outperforms a recent state-of-the-art system across all tested languages: German, Italian, Spanish and Mandarin with +14.6, +12.6, +14.3 and +16.0 Smatch points.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Generating Hypothetical Events for Abductive Inference
Authors:
Debjit Paul,
Anette Frank
Abstract:
Abductive reasoning starts from some observations and aims at finding the most plausible explanation for these observations. To perform abduction, humans often make use of temporal and causal inferences, and knowledge about how some hypothetical situation can result in different outcomes. This work offers the first study of how such knowledge impacts the Abductive NLI task -- which consists in cho…
▽ More
Abductive reasoning starts from some observations and aims at finding the most plausible explanation for these observations. To perform abduction, humans often make use of temporal and causal inferences, and knowledge about how some hypothetical situation can result in different outcomes. This work offers the first study of how such knowledge impacts the Abductive NLI task -- which consists in choosing the more likely explanation for given observations. We train a specialized language model LMI that is tasked to generate what could happen next from a hypothetical scenario that evolves from a given event. We then propose a multi-task model MTL to solve the Abductive NLI task, which predicts a plausible explanation by a) considering different possible events emerging from candidate hypotheses -- events generated by LMI -- and b) selecting the one that is most similar to the observed outcome. We show that our MTL model improves over prior vanilla pre-trained LMs fine-tuned on Abductive NLI. Our manual evaluation and analysis suggest that learning about possible next events from different hypothetical scenarios supports abductive inference.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
COINS: Dynamically Generating COntextualized Inference Rules for Narrative Story Completion
Authors:
Debjit Paul,
Anette Frank
Abstract:
Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules, and using them to guide the generation of task-specific textual outputs. In this paper we present COINS, a recursive inference framework that i) iteratively re…
▽ More
Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules, and using them to guide the generation of task-specific textual outputs. In this paper we present COINS, a recursive inference framework that i) iteratively reads context sentences, ii) dynamically generates contextualized inference rules, encodes them, and iii) uses them to guide task-specific output generation. We apply COINS to a Narrative Story Completion task that asks a model to complete a story with missing sentences, to produce a coherent story with plausible logical connections, causal relationships, and temporal dependencies. By modularizing inference and sentence generation steps in a recurrent model, we aim to make reasoning steps and their effects on next sentence generation transparent. Our automatic and manual evaluations show that the model generates better story sentences than SOTA baselines, especially in terms of coherence. We further demonstrate improved performance over strong pre-trained LMs in generating commonsense inference rules. The recursive nature of COINS holds the potential for controlled generation of longer sequences.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
CO-NNECT: A Framework for Revealing Commonsense Knowledge Paths as Explicitations of Implicit Knowledge in Texts
Authors:
Maria Becker,
Katharina Korfhage,
Debjit Paul,
Anette Frank
Abstract:
In this work we leverage commonsense knowledge in form of knowledge paths to establish connections between sentences, as a form of explicitation of implicit knowledge. Such connections can be direct (singlehop paths) or require intermediate concepts (multihop paths). To construct such paths we combine two model types in a joint framework we call Co-nnect: a relation classifier that predicts direct…
▽ More
In this work we leverage commonsense knowledge in form of knowledge paths to establish connections between sentences, as a form of explicitation of implicit knowledge. Such connections can be direct (singlehop paths) or require intermediate concepts (multihop paths). To construct such paths we combine two model types in a joint framework we call Co-nnect: a relation classifier that predicts direct connections between concepts; and a target prediction model that generates target or intermediate concepts given a source concept and a relation, which we use to construct multihop paths. Unlike prior work that relies exclusively on static knowledge sources, we leverage language models finetuned on knowledge stored in ConceptNet, to dynamically generate knowledge paths, as explanations of implicit knowledge that connects sentences in texts. As a central contribution we design manual and automatic evaluation settings for assessing the quality of the generated paths. We conduct evaluations on two argumentative datasets and show that a combination of the two model types generates meaningful, high-quality knowledge paths between sentences that reveal implicit knowledge conveyed in text.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
What is Multimodality?
Authors:
Letitia Parcalabescu,
Nils Trost,
Anette Frank
Abstract:
The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on represent…
▽ More
The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.
△ Less
Submitted 10 June, 2021; v1 submitted 10 March, 2021;
originally announced March 2021.
-
Pivot Through English: Reliably Answering Multilingual Questions without Document Retrieval
Authors:
Ivan Montero,
Shayne Longpre,
Ni Lao,
Andrew J. Frank,
Christopher DuBois
Abstract:
Existing methods for open-retrieval question answering in lower resource languages (LRLs) lag significantly behind English. They not only suffer from the shortcomings of non-English document retrieval, but are reliant on language-specific supervision for either the task or translation. We formulate a task setup more realistic to available resources, that circumvents document retrieval to reliably…
▽ More
Existing methods for open-retrieval question answering in lower resource languages (LRLs) lag significantly behind English. They not only suffer from the shortcomings of non-English document retrieval, but are reliant on language-specific supervision for either the task or translation. We formulate a task setup more realistic to available resources, that circumvents document retrieval to reliably transfer knowledge from English to lower resource languages. Assuming a strong English question answering model or database, we compare and analyze methods that pivot through English: to map foreign queries to English and then English answers back to target language answers. Within this task setup we propose Reranked Multilingual Maximal Inner Product Search (RM-MIPS), akin to semantic similarity retrieval over the English training set with reranking, which outperforms the strongest baselines by 2.7% on XQuAD and 6.2% on MKQA. Analysis demonstrates the particular efficacy of this strategy over state-of-the-art alternatives in challenging settings: low-resource languages, with extensive distractor data and query distribution misalignment. Circumventing retrieval, our analysis shows this approach offers rapid answer generation to almost any language off-the-shelf, without the need for any additional training data in the target language.
△ Less
Submitted 15 July, 2021; v1 submitted 27 December, 2020;
originally announced December 2020.
-
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Authors:
Letitia Parcalabescu,
Albert Gatt,
Anette Frank,
Iacer Calixto
Abstract:
We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that mod…
▽ More
We investigate the reasoning ability of pretrained vision and language (V&L) models in two tasks that require multimodal integration: (1) discriminating a correct image-sentence pair from an incorrect one, and (2) counting entities in an image. We evaluate three pretrained V&L models on these tasks: ViLBERT, ViLBERT 12-in-1 and LXMERT, in zero-shot and finetuned settings. Our results show that models solve task (1) very well, as expected, since all models are pretrained on task (1). However, none of the pretrained V&L models is able to adequately solve task (2), our counting probe, and they cannot generalise to out-of-distribution quantities. We propose a number of explanations for these findings: LXMERT (and to some extent ViLBERT 12-in-1) show some evidence of catastrophic forgetting on task (1). Concerning our results on the counting probe, we find evidence that all models are impacted by dataset bias, and also fail to individuate entities in the visual input. While a selling point of pretrained V&L models is their ability to solve complex tasks, our findings suggest that understanding their reasoning and grounding capabilities requires more targeted investigations on specific phenomena.
△ Less
Submitted 17 June, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
The fundamental equations of change in statistical ensembles and biological populations
Authors:
Steven A. Frank,
Frank J. Bruggeman
Abstract:
A recent article in Nature Physics unified key results from thermodynamics, statistics, and information theory. The unification arose from a general equation for the rate of change in the information content of a system. The general equation describes the change in the moments of an observable quantity over a probability distribution. One term in the equation describes the change in the probabilit…
▽ More
A recent article in Nature Physics unified key results from thermodynamics, statistics, and information theory. The unification arose from a general equation for the rate of change in the information content of a system. The general equation describes the change in the moments of an observable quantity over a probability distribution. One term in the equation describes the change in the probability distribution. The other term describes the change in the observable values for a given state. We show the equivalence of this general equation for moment dynamics with the widely known Price equation from evolutionary theory, named after George Price. We introduce the Price equation from its biological roots, review a mathematically abstract form of the equation, and discuss the potential for this equation to unify diverse mathematical theories from different disciplines. The new work in Nature Physics and many applications in biology show that this equation also provides the basis for deriving many novel theoretical results within each discipline.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Social Commonsense Reasoning with Multi-Head Knowledge Attention
Authors:
Debjit Paul,
Anette Frank
Abstract:
Social Commonsense Reasoning requires understanding of text, knowledge about social events and their pragmatic implications, as well as commonsense reasoning skills. In this work we propose a novel multi-head knowledge attention model that encodes semi-structured commonsense inference rules and learns to incorporate them in a transformer-based reasoning cell. We assess the model's performance on t…
▽ More
Social Commonsense Reasoning requires understanding of text, knowledge about social events and their pragmatic implications, as well as commonsense reasoning skills. In this work we propose a novel multi-head knowledge attention model that encodes semi-structured commonsense inference rules and learns to incorporate them in a transformer-based reasoning cell. We assess the model's performance on two tasks that require different reasoning skills: Abductive Natural Language Inference and Counterfactual Invariance Prediction as a new task. We show that our proposed model improves performance over strong state-of-the-art models (i.e., RoBERTa) across both reasoning tasks. Notably we are, to the best of our knowledge, the first to demonstrate that a model that learns to perform counterfactual reasoning helps predicting the best explanation in an abductive reasoning task. We validate the robustness of the model's reasoning capabilities by perturbing the knowledge and provide qualitative analysis on the model's knowledge incorporation capabilities.
△ Less
Submitted 12 October, 2020;
originally announced October 2020.
-
X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset
Authors:
Angel Daza,
Anette Frank
Abstract:
Even though SRL is researched for many languages, major improvements have mostly been obtained for English, for which more resources are available. In fact, existing multilingual SRL datasets contain disparate annotation styles or come from different domains, hampering generalization in multilingual learning. In this work, we propose a method to automatically construct an SRL corpus that is parall…
▽ More
Even though SRL is researched for many languages, major improvements have mostly been obtained for English, for which more resources are available. In fact, existing multilingual SRL datasets contain disparate annotation styles or come from different domains, hampering generalization in multilingual learning. In this work, we propose a method to automatically construct an SRL corpus that is parallel in four languages: English, French, German, Spanish, with unified predicate and role annotations that are fully comparable across languages. We apply high-quality machine translation to the English CoNLL-09 dataset and use multilingual BERT to project its high-quality annotations to the target languages. We include human-validated test sets that we use to measure the projection quality, and show that projection is denser and more precise than a strong baseline. Finally, we train different SOTA models on our novel corpus for mono- and multilingual SRL, showing that the multilingual annotations improve performance especially for the weaker languages.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR
Authors:
Juri Opitz,
Anette Frank
Abstract:
Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metric…
▽ More
Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation, since an abstract meaning representation allows for numerous surface realizations. In this work we aim to alleviate these issues by proposing $\mathcal{M}\mathcal{F}_β$, a decomposable metric that builds on two pillars. The first is the principle of meaning preservation $\mathcal{M}$: it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (fine-grained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form $\mathcal{F}$ that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since $\mathcal{M}\mathcal{F}_β$ does not necessarily rely on gold AMRs, it may extend to other text generation tasks.
△ Less
Submitted 26 January, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
A Neural Network Looks at Leonardo's(?) Salvator Mundi
Authors:
Steven J. Frank,
Andrea M. Frank
Abstract:
We use convolutional neural networks (CNNs) to analyze authorship questions surrounding the works of Leonardo da Vinci -- in particular, Salvator Mundi, the world's most expensive painting and among the most controversial. Trained on the works of an artist under study and visually comparable works of other artists, our system can identify likely forgeries and shed light on attribution controversie…
▽ More
We use convolutional neural networks (CNNs) to analyze authorship questions surrounding the works of Leonardo da Vinci -- in particular, Salvator Mundi, the world's most expensive painting and among the most controversial. Trained on the works of an artist under study and visually comparable works of other artists, our system can identify likely forgeries and shed light on attribution controversies. Leonardo's few extant paintings test the limits of our system and require corroborative techniques of testing and analysis.
△ Less
Submitted 21 May, 2020;
originally announced May 2020.
-
Asteroid: the PyTorch-based audio source separation toolkit for researchers
Authors:
Manuel Pariente,
Samuele Cornell,
Joris Cosentino,
Sunit Sivasankaran,
Efthymios Tzinis,
Jens Heitkaemper,
Michel Olvera,
Fabian-Robert Stöter,
Mathieu Hu,
Juan M. Martín-Doñas,
David Ditter,
Ariel Frank,
Antoine Deleforge,
Emmanuel Vincent
Abstract:
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Aste…
▽ More
This paper describes Asteroid, the PyTorch-based audio source separation toolkit for researchers. Inspired by the most successful neural source separation systems, it provides all neural building blocks required to build such a system. To improve reproducibility, Kaldi-style recipes on common audio source separation datasets are also provided. This paper describes the software architecture of Asteroid and its most important features. By showing experimental results obtained with Asteroid's recipes, we show that our implementations are at least on par with most results reported in reference papers. The toolkit is publicly available at https://github.com/mpariente/asteroid .
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Analysis of Dutch Master Paintings with Convolutional Neural Networks
Authors:
Steven J. Frank,
Andrea M. Frank
Abstract:
Trained on the works of an artist under study and visually comparable works of other artists, convolutional neural networks can identify forgeries and provide attributions. They can also assign classification probabilities within a painting, revealing mixed authorship and identifying regions painted by different hands.
Trained on the works of an artist under study and visually comparable works of other artists, convolutional neural networks can identify forgeries and provide attributions. They can also assign classification probabilities within a painting, revealing mixed authorship and identifying regions painted by different hands.
△ Less
Submitted 16 August, 2020; v1 submitted 12 February, 2020;
originally announced February 2020.
-
AMR Similarity Metrics from Principles
Authors:
Juri Opitz,
Letitia Parcalabescu,
Anette Frank
Abstract:
Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002) and increases computational efficiency by ablating the variable-align…
▽ More
Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns the variables of two graphs and assesses triple matches. The recent SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002) and increases computational efficiency by ablating the variable-alignment.
In this paper, i) we establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR; ii) we undertake a thorough analysis of Smatch and SemBleu where we show that the latter exhibits some undesirable properties. For example, it does not conform to the identity of indiscernibles rule and introduces biases that are hard to control; iii) we propose a novel metric S$^2$match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria. We assess its suitability and show its advantages over Smatch and SemBleu.
△ Less
Submitted 17 September, 2020; v1 submitted 29 January, 2020;
originally announced January 2020.
-
Socially intelligent task and motion planning for human-robot interaction
Authors:
Andrea Frank,
Laurel Riek
Abstract:
As social beings, much human behavior is predicated on social context - the ambient social state that includes cultural norms, social signals, individual preferences, etc. In this paper, we propose a socially-aware task and motion planning algorithm that considers social context to generate appropriate and effective plans in human social environments (HSEs). The key strength of our proposed approa…
▽ More
As social beings, much human behavior is predicated on social context - the ambient social state that includes cultural norms, social signals, individual preferences, etc. In this paper, we propose a socially-aware task and motion planning algorithm that considers social context to generate appropriate and effective plans in human social environments (HSEs). The key strength of our proposed approach is that it explicitly models how potential actions not only affect objective cost, but also transform the social context in which it plans and acts. We investigate strategies to limit the complexity of our algorithm, so that our planner will remain tractable for mobile platforms in complex HSEs like hospitals and factories. The planner will also consider the relative importance and urgency of its tasks, which it uses to determine when it is and is not appropriate to violate social expectations to achieve its objective. This social awareness will allow robots to understand a fundamental rule of society: just because something makes your job easier, does not make it the right thing to do!
To our knowledge, the proposed work is the first task and motion planning approach that supports socially intelligent robot policy for HSEs. Through this ongoing work, robots will be able to understand, respect, and leverage social context accomplish tasks both acceptably and effectively in HSEs.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
Implicit Knowledge in Argumentative Texts: An Annotated Corpus
Authors:
Maria Becker,
Katharina Korfhage,
Anette Frank
Abstract:
When speaking or writing, people omit information that seems clear and evident, such that only part of the message is expressed in words. Especially in argumentative texts it is very common that (important) parts of the argument are implied and omitted. We hypothesize that for argument analysis it will be beneficial to reconstruct this implied information. As a starting point for filling such know…
▽ More
When speaking or writing, people omit information that seems clear and evident, such that only part of the message is expressed in words. Especially in argumentative texts it is very common that (important) parts of the argument are implied and omitted. We hypothesize that for argument analysis it will be beneficial to reconstruct this implied information. As a starting point for filling such knowledge gaps, we build a corpus consisting of high-quality human annotations of missing and implied information in argumentative texts. To learn more about the characteristics of both the argumentative texts and the added information, we further annotate the data with semantic clause types and commonsense knowledge relations. The outcome of our work is a carefully de-signed and richly annotated dataset, for which we then provide an in-depth analysis by investigating characteristic distributions and correlations of the assigned labels. We reveal interesting patterns and intersections between the annotation categories and properties of our dataset, which enable insights into the characteristics of both argumentative texts and implicit knowledge in terms of structural features and semantic information. The results of our analysis can help to assist automated argument analysis and can guide the process of revealing implicit information in argumentative texts automatically.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Translate and Label! An Encoder-Decoder Approach for Cross-lingual Semantic Role Labeling
Authors:
Angel Daza,
Anette Frank
Abstract:
We propose a Cross-lingual Encoder-Decoder model that simultaneously translates and generates sentences with Semantic Role Labeling annotations in a resource-poor target language. Unlike annotation projection techniques, our model does not need parallel data during inference time. Our approach can be applied in monolingual, multilingual and cross-lingual settings and is able to produce dependency-…
▽ More
We propose a Cross-lingual Encoder-Decoder model that simultaneously translates and generates sentences with Semantic Role Labeling annotations in a resource-poor target language. Unlike annotation projection techniques, our model does not need parallel data during inference time. Our approach can be applied in monolingual, multilingual and cross-lingual settings and is able to produce dependency-based and span-based SRL annotations. We benchmark the labeling performance of our model in different monolingual and multilingual settings using well-known SRL datasets. We then train our model in a cross-lingual setting to generate new SRL labeled data. Finally, we measure the effectiveness of our method by using the generated data to augment the training basis for resource-poor languages and perform manual evaluation to show that it produces high-quality sentences and assigns accurate semantic role annotations. Our proposed architecture offers a flexible method for leveraging SRL data in multiple languages.
△ Less
Submitted 29 August, 2019;
originally announced August 2019.
-
Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension
Authors:
Todor Mihaylov,
Anette Frank
Abstract:
In this work, we propose to use linguistic annotations as a basis for a \textit{Discourse-Aware Semantic Self-Attention} encoder that we employ for reading comprehension on long narrative texts. We extract relations between discourse units, events and their arguments as well as coreferring mentions, using available annotation tools. Our empirical evaluation shows that the investigated structures i…
▽ More
In this work, we propose to use linguistic annotations as a basis for a \textit{Discourse-Aware Semantic Self-Attention} encoder that we employ for reading comprehension on long narrative texts. We extract relations between discourse units, events and their arguments as well as coreferring mentions, using available annotation tools. Our empirical evaluation shows that the investigated structures improve the overall performance, especially intra-sentential and cross-sentential discourse relations, sentence-internal semantic role relations, and long-distance coreference relations. We show that dedicating self-attention heads to intra-sentential relations and relations connecting neighboring sentences is beneficial for finding answers to questions in longer contexts. Our findings encourage the use of discourse-semantic annotations to enhance the generalization capacity of self-attention models for reading comprehension.
△ Less
Submitted 28 August, 2019;
originally announced August 2019.
-
Salient Slices: Improved Neural Network Training and Performance with Image Entropy
Authors:
Steven J. Frank,
Andrea M. Frank
Abstract:
As a training and analysis strategy for convolutional neural networks (CNNs), we slice images into tiled segments and use, for training and prediction, segments that both satisfy a criterion of information diversity and contain sufficient content to support classification. In particular, we utilize image entropy as the diversity criterion. This ensures that each tile carries as much information di…
▽ More
As a training and analysis strategy for convolutional neural networks (CNNs), we slice images into tiled segments and use, for training and prediction, segments that both satisfy a criterion of information diversity and contain sufficient content to support classification. In particular, we utilize image entropy as the diversity criterion. This ensures that each tile carries as much information diversity as the original image, and for many applications serves as an indicator of usefulness in classification. To make predictions, a probability aggregation framework is applied to probabilities assigned by the CNN to the input image tiles. This technique facilitates the use of large, high-resolution images that would be impractical to analyze unmodified; provides data augmentation for training, which is particularly valuable when image availability is limited; and the ensemble nature of the input for prediction enhances its accuracy.
△ Less
Submitted 4 May, 2020; v1 submitted 29 July, 2019;
originally announced July 2019.