-
Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations
Authors:
Arie Cattan,
Alon Jacovi,
Alex Fabrikant,
Jonathan Herzig,
Roee Aharoni,
Hannah Rashkin,
Dror Marcus,
Avinatan Hassidim,
Yossi Matias,
Idan Szpektor,
Avi Caciularu
Abstract:
Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot examp…
▽ More
Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+23\% on average across models) on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.
△ Less
Submitted 23 June, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
Authors:
Gal Yona,
Roee Aharoni,
Mor Geva
Abstract:
We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on th…
▽ More
We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on the gap between the model's intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed. This example-level metric reliably indicates whether the model reflects its uncertainty, as it penalizes both excessive and insufficient hedging. We evaluate a variety of aligned LLMs at faithfully communicating uncertainty on several knowledge-intensive question answering tasks. Our results provide strong evidence that modern LLMs are poor at faithfully conveying their uncertainty, and that better alignment is necessary to improve their trustworthiness.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Authors:
Zorik Gekhman,
Gal Yona,
Roee Aharoni,
Matan Eyal,
Amir Feder,
Roi Reichart,
Jonathan Herzig
Abstract:
When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of…
▽ More
When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.
△ Less
Submitted 13 May, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Representation Surgery: Theory and Practice of Affine Steering
Authors:
Shashwat Singh,
Shauli Ravfogel,
Jonathan Herzig,
Roee Aharoni,
Ryan Cotterell,
Ponnurangam Kumaraguru
Abstract:
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probabi…
▽ More
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
△ Less
Submitted 5 July, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Authors:
Alon Jacovi,
Yonatan Bitton,
Bernd Bohnet,
Jonathan Herzig,
Or Honovich,
Michael Tseng,
Michael Collins,
Roee Aharoni,
Mor Geva
Abstract:
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enabl…
▽ More
Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/ .
△ Less
Submitted 21 May, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers
Authors:
Gal Yona,
Roee Aharoni,
Mor Geva
Abstract:
Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this…
▽ More
Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Multilingual Instruction Tuning With Just a Pinch of Multilinguality
Authors:
Uri Shaham,
Jonathan Herzig,
Roee Aharoni,
Idan Szpektor,
Reut Tsarfaty,
Matan Eyal
Abstract:
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-follo…
▽ More
As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.
△ Less
Submitted 21 May, 2024; v1 submitted 3 January, 2024;
originally announced January 2024.
-
2-covers of wide Young diagrams
Authors:
Ron Aharoni,
Eli Berger,
He Guo,
Daniel Kotlar
Abstract:
A Young diagram $Y$ is called wide if every sub-diagram $Z$ formed by a subset of the rows of $Y$ dominates $Z'$, the conjugate of $Z$. A Young diagram $Y$ is called Latin if its squares can be assigned numbers so that for each $i$, the $i$th row is filled injectively with the numbers $1, \ldots ,a_i$, where $a_i$ is the length of $i$th row of $Y$, and every column is also filled injectively. A co…
▽ More
A Young diagram $Y$ is called wide if every sub-diagram $Z$ formed by a subset of the rows of $Y$ dominates $Z'$, the conjugate of $Z$. A Young diagram $Y$ is called Latin if its squares can be assigned numbers so that for each $i$, the $i$th row is filled injectively with the numbers $1, \ldots ,a_i$, where $a_i$ is the length of $i$th row of $Y$, and every column is also filled injectively. A conjecture of Chow and Taylor, publicized by Chow, Fan, Goemans, and Vondrak is that a wide Young diagram is Latin. We prove a dual version of the conjecture.
△ Less
Submitted 11 December, 2023; v1 submitted 29 November, 2023;
originally announced November 2023.
-
A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Authors:
Alon Jacovi,
Avi Caciularu,
Jonathan Herzig,
Roee Aharoni,
Bernd Bohnet,
Mor Geva
Abstract:
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baseli…
▽ More
A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*.
△ Less
Submitted 28 December, 2023; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Looms
Authors:
Ron Aharoni,
Eli Berger,
Joseph Briggs,
He Guo
Abstract:
A pair $(A,B)$ of hypergraphs is called orthogonal if $|a \cap b|=1$ for every pair of edges $a \in A$ and $b \in B$. An orthogonal pair of hypergraphs is called a loom if each of its two members is the set of minimum covers of the other. Looms appear naturally in the context of a conjecture of Gyárfás and Lehel on the covering number of cross-intersecting hypergraphs. We study their properties an…
▽ More
A pair $(A,B)$ of hypergraphs is called orthogonal if $|a \cap b|=1$ for every pair of edges $a \in A$ and $b \in B$. An orthogonal pair of hypergraphs is called a loom if each of its two members is the set of minimum covers of the other. Looms appear naturally in the context of a conjecture of Gyárfás and Lehel on the covering number of cross-intersecting hypergraphs. We study their properties and ways of construction, and prove special cases of a conjecture that if true would imply the Gyárfás--Lehel conjecture.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Authors:
Paul Roit,
Johan Ferret,
Lior Shani,
Roee Aharoni,
Geoffrey Cideron,
Robert Dadashi,
Matthieu Geist,
Sertan Girgin,
Léonard Hussenot,
Orgad Keller,
Nikola Momchev,
Sabela Ramos,
Piotr Stanczyk,
Nino Vieillard,
Olivier Bachem,
Gal Elidan,
Avinatan Hassidim,
Olivier Pietquin,
Idan Szpektor
Abstract:
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this p…
▽ More
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Authors:
Benjamin Muller,
John Wieting,
Jonathan H. Clark,
Tom Kwiatkowski,
Sebastian Ruder,
Livio Baldini Soares,
Roee Aharoni,
Jonathan Herzig,
Xinyi Wang
Abstract:
Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve tr…
▽ More
Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve trustworthiness in these systems, a promising direction is to attribute the answer to a retrieved source, possibly in a content-rich language different from the query. Our work is the first to study attribution for cross-lingual question answering. First, we collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. To our surprise, we find that a substantial portion of the answers is not attributable to any retrieved passages (up to 50% of answers exactly matching a gold reference) despite the system being able to attend directly to the retrieved text. Second, to address this poor attribution level, we experiment with a wide range of attribution detection techniques. We find that Natural Language Inference models and PaLM 2 fine-tuned on a very small amount of attribution data can accurately detect attribution. Based on these models, we improve the attribution level of a cross-lingual question-answering system. Overall, we show that current academic generative cross-lingual QA systems have substantial shortcomings in attribution and we build tooling to mitigate these issues.
△ Less
Submitted 15 November, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Authors:
Elizabeth Clark,
Shruti Rijhwani,
Sebastian Gehrmann,
Joshua Maynez,
Roee Aharoni,
Vitaly Nikolaev,
Thibault Sellam,
Aditya Siddhant,
Dipanjan Das,
Ankur P. Parikh
Abstract:
Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensi…
▽ More
Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation.
△ Less
Submitted 1 November, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Authors:
Zorik Gekhman,
Jonathan Herzig,
Roee Aharoni,
Chen Elkind,
Idan Szpektor
Abstract:
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited…
▽ More
Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.
△ Less
Submitted 18 October, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
What You See is What You Read? Improving Text-Image Alignment Evaluation
Authors:
Michal Yarom,
Yonatan Bitton,
Soravit Changpinyo,
Roee Aharoni,
Jonathan Herzig,
Oran Lang,
Eran Ofek,
Idan Szpektor
Abstract:
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to…
▽ More
Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.
△ Less
Submitted 26 December, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Surfacing Biases in Large Language Models using Contrastive Input Decoding
Authors:
Gal Yona,
Or Honovich,
Itay Laish,
Roee Aharoni
Abstract:
Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next…
▽ More
Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
q2d: Turning Questions into Dialogs to Teach Models How to Search
Authors:
Yonatan Bitton,
Shlomi Cohen-Ganor,
Ido Hakimi,
Yoad Lewenberg,
Roee Aharoni,
Enav Weinreb
Abstract:
One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from ques…
▽ More
One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions. We prompt a large language model (PaLM) to create conversational versions of question answering datasets, and use it to improve query generation models that communicate with external search APIs to ground dialog responses. Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale. Our experiments demonstrate that: (1) For query generation on the QReCC dataset, models trained on our synthetically-generated data achieve 90%--97% of the performance of models trained on the human-generated data; (2) We can successfully generate data for training dialog models in new domains without any existing dialog data as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We perform a thorough analysis of the generated dialogs showing that humans find them of high quality and struggle to distinguish them from human-written dialogs.
△ Less
Submitted 26 December, 2023; v1 submitted 27 April, 2023;
originally announced April 2023.
-
Tight infinite matrices
Authors:
Ron Aharoni,
He Guo
Abstract:
We give a simple proof of a recent result of Gollin and Joó: if a possibly infinite system of homogeneous linear equations $A\vec{x} = \vec{0}$, where $A = (a_{i, j})$ is an $I \times J$ matrix, has only the trivial solution, then there exists an injection $φ: J \to I$, such that $a_{φ(j), j} \neq 0$ for all $j \in J$.
We give a simple proof of a recent result of Gollin and Joó: if a possibly infinite system of homogeneous linear equations $A\vec{x} = \vec{0}$, where $A = (a_{i, j})$ is an $I \times J$ matrix, has only the trivial solution, then there exists an injection $φ: J \to I$, such that $a_{φ(j), j} \neq 0$ for all $j \in J$.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
mFACE: Multilingual Summarization with Factual Consistency Evaluation
Authors:
Roee Aharoni,
Shashi Narayan,
Joshua Maynez,
Jonathan Herzig,
Elizabeth Clark,
Mirella Lapata
Abstract:
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically det…
▽ More
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation.
△ Less
Submitted 5 January, 2024; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Multilingual Sequence-to-Sequence Models for Hebrew NLP
Authors:
Matan Eyal,
Hila Noga,
Roee Aharoni,
Idan Szpektor,
Reut Tsarfaty
Abstract:
Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is b…
▽ More
Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for LLMs in the case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a specialized, morpheme-based, separately fine-tuned decoder. Using this approach, our experiments show substantial improvements over previously published results on existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Authors:
Bernd Bohnet,
Vinh Q. Tran,
Pat Verga,
Roee Aharoni,
Daniel Andor,
Livio Baldini Soares,
Massimiliano Ciaramita,
Jacob Eisenstein,
Kuzman Ganchev,
Jonathan Herzig,
Kai Hui,
Tom Kwiatkowski,
Ji Ma,
Jianmo Ni,
Lierni Sestorain Saralegui,
Tal Schuster,
William W. Cohen,
Michael Collins,
Dipanjan Das,
Donald Metzler,
Slav Petrov,
Kellie Webster
Abstract:
Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of…
▽ More
Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of attributed LLMs. We propose a reproducible evaluation framework for the task and benchmark a broad set of architectures. We take human annotations as a gold standard and show that a correlated automatic metric is suitable for development. Our experimental work gives concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third (How to build LLMs with attribution?).
△ Less
Submitted 10 February, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
Authors:
Ella Neeman,
Roee Aharoni,
Or Honovich,
Leshem Choshen,
Idan Szpektor,
Omri Abend
Abstract:
Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA mo…
▽ More
Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
Strongly maximal matchings and strongly minimal covers
Authors:
Ron Aharoni
Abstract:
This is a not-to-be-journal-published paper, aimed to serve as reference. It is a summary of the main ideas on the topic appearing in the title, and an opportunity to state correctly the main conjecture in the field.
This is a not-to-be-journal-published paper, aimed to serve as reference. It is a summary of the main ideas on the topic appearing in the title, and an opportunity to state correctly the main conjecture in the field.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
TRUE: Re-evaluating Factual Consistency Evaluation
Authors:
Or Honovich,
Roee Aharoni,
Jonathan Herzig,
Hagai Taitelbaum,
Doron Kukliansy,
Vered Cohen,
Thomas Scialom,
Idan Szpektor,
Avinatan Hassidim,
Yossi Matias
Abstract:
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evalu…
▽ More
Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods.
△ Less
Submitted 3 May, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Rainbow cycles for families of matchings
Authors:
Ron Aharoni,
He Guo
Abstract:
Given a graph G and a coloring of its edges, a subgraph of G is called rainbow if its edges have distinct colors. The rainbow girth of an edge coloring of G is the minimum length of a rainbow cycle in G. A generalization of the famous Caccetta-Haggkvist conjecture (CHC), proposed by the first author, is that if G has n vertices, G is n-edge-colored and the size of every color class is k, then the…
▽ More
Given a graph G and a coloring of its edges, a subgraph of G is called rainbow if its edges have distinct colors. The rainbow girth of an edge coloring of G is the minimum length of a rainbow cycle in G. A generalization of the famous Caccetta-Haggkvist conjecture (CHC), proposed by the first author, is that if G has n vertices, G is n-edge-colored and the size of every color class is k, then the rainbow girth is at most \lceil \frac{n}{k} \rceil. In the only known example showing sharpness of this conjecture, that stems from an example for the sharpness of CHC, the color classes are stars. This suggests that in the antipodal case to stars, namely matchings, the result can be improved. Indeed, we show that the rainbow girth of n matchings of size at least 2 is O(\log n), as compared with the general bound of \lceil \frac{n}{2} \rceil.
△ Less
Submitted 24 October, 2022; v1 submitted 27 October, 2021;
originally announced October 2021.
-
Non-uniform degrees and rainbow versions of the Caccetta-Häggkvist conjecture
Authors:
Ron Aharoni,
Eli Berger,
Maria Chudnovsky,
He Guo,
Shira Zerbib
Abstract:
The Caccetta-Häggkvist conjecture (denoted below CHC) states that the directed girth (the smallest length of a directed cycle) $dgirth(D)$ of a directed graph $D$ on $n$ vertices is at most $\lceil \frac{n}{δ^+(D)}\rceil$, where $δ^+(D)$ is the minimum out-degree of~$D$. We consider a version involving all out-degrees, not merely the minimum one, and prove that if $D$ does not contain a sink, then…
▽ More
The Caccetta-Häggkvist conjecture (denoted below CHC) states that the directed girth (the smallest length of a directed cycle) $dgirth(D)$ of a directed graph $D$ on $n$ vertices is at most $\lceil \frac{n}{δ^+(D)}\rceil$, where $δ^+(D)$ is the minimum out-degree of~$D$. We consider a version involving all out-degrees, not merely the minimum one, and prove that if $D$ does not contain a sink, then $dgirth(D) \le 2 \sum_{v\in V(D)} \frac{1}{deg^+(v)+1}$. In the spirit of a generalization of the CHC to rainbow cycles in \cite{ADH2019}, this suggests the conjecture that given non-empty sets $F_1, \ldots,F_n$ of edges of $K_n$, there exists a rainbow cycle of length at most $2\sum_{1\le i \le n}\frac{1}{|F_i|+1}$. We prove a bit stronger result when $1\le |F_i|\le 2$, thereby strengthening a result of DeVos et. al \cite{DDFGGHMM2021}. We prove a logarithmic bound on the rainbow girth in the case that the sets $F_i$ are triangles.
△ Less
Submitted 7 October, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Choice Functions
Authors:
Ron Aharoni,
Joseph Briggs
Abstract:
This is a survey paper on rainbow sets (another name for ``choice functions''). The main theme is the distinction between two types of choice functions: those having a large (in the sense of belonging to some specified filter, namely closed up set of sets) image, and those that have a large domain and small image, where ``smallness'' means belonging to some specified complex (a closed-down set). T…
▽ More
This is a survey paper on rainbow sets (another name for ``choice functions''). The main theme is the distinction between two types of choice functions: those having a large (in the sense of belonging to some specified filter, namely closed up set of sets) image, and those that have a large domain and small image, where ``smallness'' means belonging to some specified complex (a closed-down set). The paper contains some new results: (1) theorems on scrambled versions, in which the sets are re-shuffled before choosing the rainbow set, and (2) results on weighted and cooperative versions - to be defined below.
△ Less
Submitted 27 July, 2021;
originally announced July 2021.
-
$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Authors:
Or Honovich,
Leshem Choshen,
Roee Aharoni,
Ella Neeman,
Idan Szpektor,
Omri Abend
Abstract:
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic…
▽ More
Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.
△ Less
Submitted 9 September, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Rainbow paths and large rainbow matchings
Authors:
Ron Aharoni,
Eli Berger,
Maria Chudnovsky,
Shira Zerbib
Abstract:
A conjecture of the first two authors is that $n$ matchings of size $n$ in any graph have a rainbow matching of size $n-1$. We prove a lower bound of $\frac{2}{3}n-1$, improving on the trivial $\frac{1}{2}n$, and an analogous result for hypergraphs. For $\{C_3,C_5\}$-free graphs and for disjoint matchings we obtain a lower bound of $\frac{3n}{4}-O(1)$. We also discuss a conjecture on rainbow alter…
▽ More
A conjecture of the first two authors is that $n$ matchings of size $n$ in any graph have a rainbow matching of size $n-1$. We prove a lower bound of $\frac{2}{3}n-1$, improving on the trivial $\frac{1}{2}n$, and an analogous result for hypergraphs. For $\{C_3,C_5\}$-free graphs and for disjoint matchings we obtain a lower bound of $\frac{3n}{4}-O(1)$. We also discuss a conjecture on rainbow alternating paths, that if true would yield a lower bound of $n-\sqrt{2n}$. We prove the non-alternating (ordinary paths) version of this conjecture.
△ Less
Submitted 7 October, 2021; v1 submitted 29 December, 2020;
originally announced December 2020.
-
Fractionally balanced hypergraphs and rainbow KKM theorems
Authors:
Ron Aharoni,
Eli Berger,
Joseph Briggs,
Erel Segal-Halevi,
Shira Zerbib
Abstract:
A d-partite hypergraph is called *fractionally balanced* if there exists a non-negative, not identically zero, function on its edge set that has constant degrees in each vertex side. Using a topological version of Hall's theorem we prove lower bounds on the matching number of such hypergraphs. These bounds yield rainbow versions of the KKM theorem for products of simplices, which in turn are used…
▽ More
A d-partite hypergraph is called *fractionally balanced* if there exists a non-negative, not identically zero, function on its edge set that has constant degrees in each vertex side. Using a topological version of Hall's theorem we prove lower bounds on the matching number of such hypergraphs. These bounds yield rainbow versions of the KKM theorem for products of simplices, which in turn are used to obtain some results on multiple-cake division, and on rainbow matchings in families of d-intervals.
△ Less
Submitted 14 August, 2022; v1 submitted 2 November, 2020;
originally announced November 2020.
-
KoBE: Knowledge-Based Machine Translation Evaluation
Authors:
Zorik Gekhman,
Roee Aharoni,
Genady Beryozkin,
Markus Freitag,
Wolfgang Macherey
Abstract:
We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our…
▽ More
We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
Real-Time Sign Language Detection using Human Pose Estimation
Authors:
Amit Moryossef,
Ioannis Tsochantaridis,
Roee Aharoni,
Sarah Ebling,
Srini Narayanan
Abstract:
We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accurac…
▽ More
We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications.
△ Less
Submitted 13 September, 2020; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Rainbow odd cycles
Authors:
Ron Aharoni,
Joseph Briggs,
Ron Holzman,
Zilin Jiang
Abstract:
We prove that every family of (not necessarily distinct) odd cycles $O_1, \dots, O_{2\lceil n/2 \rceil-1}$ in the complete graph $K_n$ on $n$ vertices has a rainbow odd cycle (that is, a set of edges from distinct $O_i$'s, forming an odd cycle). As part of the proof, we characterize those families of $n$ odd cycles in $K_{n+1}$ that do not have any rainbow odd cycle. We also characterize those fam…
▽ More
We prove that every family of (not necessarily distinct) odd cycles $O_1, \dots, O_{2\lceil n/2 \rceil-1}$ in the complete graph $K_n$ on $n$ vertices has a rainbow odd cycle (that is, a set of edges from distinct $O_i$'s, forming an odd cycle). As part of the proof, we characterize those families of $n$ odd cycles in $K_{n+1}$ that do not have any rainbow odd cycle. We also characterize those families of $n$ cycles in $K_{n+1}$, as well as those of $n$ edge-disjoint nonempty subgraphs of $K_{n+1}$, without any rainbow cycle.
△ Less
Submitted 20 September, 2021; v1 submitted 19 July, 2020;
originally announced July 2020.
-
Badges and rainbow matchings
Authors:
Ron Aharoni,
Joseph Briggs,
**ha Kim,
Minki Kim
Abstract:
Drisko proved that $2n-1$ matchings of size $n$ in a bipartite graph have a rainbow matching of size $n$. For general graphs it is conjectured that $2n$ matchings suffice for this purpose (and that $2n-1$ matchings suffice when $n$ is even). The known graphs showing sharpness of this conjecture for $n$ even are called badges. We improve the previously best known bound from $3n-2$ to $3n-3$, using…
▽ More
Drisko proved that $2n-1$ matchings of size $n$ in a bipartite graph have a rainbow matching of size $n$. For general graphs it is conjectured that $2n$ matchings suffice for this purpose (and that $2n-1$ matchings suffice when $n$ is even). The known graphs showing sharpness of this conjecture for $n$ even are called badges. We improve the previously best known bound from $3n-2$ to $3n-3$, using a new line of proof that involves analysis of the appearance of badges. We also prove a "cooperative" generalization: for $t>0$ and $n \geq 3$, any $3n-4+t$ sets of edges, the union of every $t$ of which contains a matching of size $n$, have a rainbow matching of size $n$.
△ Less
Submitted 15 February, 2021; v1 submitted 16 April, 2020;
originally announced April 2020.
-
Unsupervised Domain Clusters in Pretrained Language Models
Authors:
Roee Aharoni,
Yoav Goldberg
Abstract:
The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domain…
▽ More
The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision -- suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle.
△ Less
Submitted 1 May, 2020; v1 submitted 5 April, 2020;
originally announced April 2020.
-
Cooperative conditions for the existence of rainbow matchings
Authors:
Ron Aharoni,
Joseph Briggs,
Minho Cho,
**ha Kim
Abstract:
Let $k>1$, and let $\mathcal{F}$ be a family of $2n+k-3$ non-empty sets of edges in a bipartite graph. If the union of every $k$ members of $\mathcal{F}$ contains a matching of size $n$, then there exists an $\mathcal{F}$-rainbow matching of size $n$. Replacing $2n+k-3$ by $2n+k-2$, the result is true also for $k=1$, and it can be proved (for all $k$) both topologically and by a relatively simple…
▽ More
Let $k>1$, and let $\mathcal{F}$ be a family of $2n+k-3$ non-empty sets of edges in a bipartite graph. If the union of every $k$ members of $\mathcal{F}$ contains a matching of size $n$, then there exists an $\mathcal{F}$-rainbow matching of size $n$. Replacing $2n+k-3$ by $2n+k-2$, the result is true also for $k=1$, and it can be proved (for all $k$) both topologically and by a relatively simple combinatorial argument. The main effort is in gaining the last $1$, which makes the result sharp.
△ Less
Submitted 28 December, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets
Authors:
Ohad Rozen,
Vered Shwartz,
Roee Aharoni,
Ido Dagan
Abstract:
Phenomenon-specific "adversarial" datasets have been recently designed to perform targeted stress-tests for particular inference types. Recent work (Liu et al., 2019a) proposed that such datasets can be utilized for training NLI and other types of models, often allowing to learn the phenomenon in focus and improve on the challenge dataset, indicating a "blind spot" in the original training data. Y…
▽ More
Phenomenon-specific "adversarial" datasets have been recently designed to perform targeted stress-tests for particular inference types. Recent work (Liu et al., 2019a) proposed that such datasets can be utilized for training NLI and other types of models, often allowing to learn the phenomenon in focus and improve on the challenge dataset, indicating a "blind spot" in the original training data. Yet, although a model can improve in such a training process, it might still be vulnerable to other challenge datasets targeting the same phenomenon but drawn from a different distribution, such as having a different syntactic complexity level. In this work, we extend this method to drive conclusions about a model's ability to learn and generalize a target phenomenon rather than to "learn" a dataset, by controlling additional aspects in the adversarial datasets. We demonstrate our approach on two inference phenomena - dative alternation and numerical reasoning, elaborating, and in some cases contradicting, the results of Liu et al.. Our methodology enables building better challenge datasets for creating more robust models, and may yield better model understanding and subsequent overarching improvements.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Rainbow independent sets in certain classes of graphs
Authors:
Ron Aharoni,
Joseph Briggs,
**ha Kim,
Minki Kim
Abstract:
For a given class $\mathcal{C}$ of graphs and given integers $m \leq n$, let $f_\mathcal{C}(n,m)$ be the minimal number $k$ such that every $k$ independent $n$-sets in any graph belonging to $\mathcal{C}$ have a (possibly partial) rainbow independent $m$-set. Motivated by known results on the finiteness and actual value of $f_\mathcal{C}(n,m)$ when $\mathcal{C}$ is the class of line graphs of grap…
▽ More
For a given class $\mathcal{C}$ of graphs and given integers $m \leq n$, let $f_\mathcal{C}(n,m)$ be the minimal number $k$ such that every $k$ independent $n$-sets in any graph belonging to $\mathcal{C}$ have a (possibly partial) rainbow independent $m$-set. Motivated by known results on the finiteness and actual value of $f_\mathcal{C}(n,m)$ when $\mathcal{C}$ is the class of line graphs of graphs, we study this function for various other classes.
△ Less
Submitted 28 September, 2019;
originally announced September 2019.
-
The Missing Ingredient in Zero-Shot Neural Machine Translation
Authors:
Naveen Arivazhagan,
Ankur Bapna,
Orhan Firat,
Roee Aharoni,
Melvin Johnson,
Wolfgang Macherey
Abstract:
Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on…
▽ More
Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on parameter sharing, fail to generalize to unseen language pairs. We then propose auxiliary losses on the NMT encoder that impose representational invariance across languages. Our simple approach vastly improves zero-shot translation quality without regressing on supervised directions. For the first time, on WMT14 English-FrenchGerman, we achieve zero-shot performance that is on par with pivoting. We also demonstrate the easy scalability of our approach to multiple languages on the IWSLT 2017 shared task.
△ Less
Submitted 17 March, 2019;
originally announced March 2019.
-
Filling Gender & Number Gaps in Neural Machine Translation with Black-box Context Injection
Authors:
Amit Moryossef,
Roee Aharoni,
Yoav Goldberg
Abstract:
When translating from a language that does not morphologically mark information such as gender and number into a language that does, translation systems must "guess" this missing information, often leading to incorrect translations in the given context. We propose a black-box approach for injecting the missing information to a pre-trained neural machine translation system, allowing to control the…
▽ More
When translating from a language that does not morphologically mark information such as gender and number into a language that does, translation systems must "guess" this missing information, often leading to incorrect translations in the given context. We propose a black-box approach for injecting the missing information to a pre-trained neural machine translation system, allowing to control the morphological variations in the generated translations without changing the underlying model or training data. We evaluate our method on an English to Hebrew translation task, and show that it is effective in injecting the gender and number information and that supplying the correct information improves the translation accuracy in up to 2.3 BLEU on a female-speaker test set for a state-of-the-art online black-box system. Finally, we perform a fine-grained syntactic analysis of the generated translations that shows the effectiveness of our method.
△ Less
Submitted 8 March, 2019;
originally announced March 2019.
-
Massively Multilingual Neural Machine Translation
Authors:
Roee Aharoni,
Melvin Johnson,
Orhan Firat
Abstract:
Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English wit…
▽ More
Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT.
△ Less
Submitted 2 July, 2019; v1 submitted 28 February, 2019;
originally announced March 2019.
-
A rainbow version of Mantel's Theorem
Authors:
Ron Aharoni,
Matt DeVos,
Sebastián González Hermosillo de la Maza,
Amanda Montejano,
Robert Šámal
Abstract:
Mantel's Theorem asserts that a simple $n$ vertex graph with more than $\frac{1}{4}n^2$ edges has a triangle (three mutually adjacent vertices). Here we consider a rainbow variant of this problem. We prove that whenever $G_1, G_2, G_3$ are simple graphs on a common set of $n$ vertices and $|E(G_i)| > ( \frac{ 26 - 2 \sqrt{7} }{81})n^2 \approx 0.2557 n^2$ for $1 \le i \le 3$, then there exist disti…
▽ More
Mantel's Theorem asserts that a simple $n$ vertex graph with more than $\frac{1}{4}n^2$ edges has a triangle (three mutually adjacent vertices). Here we consider a rainbow variant of this problem. We prove that whenever $G_1, G_2, G_3$ are simple graphs on a common set of $n$ vertices and $|E(G_i)| > ( \frac{ 26 - 2 \sqrt{7} }{81})n^2 \approx 0.2557 n^2$ for $1 \le i \le 3$, then there exist distinct vertices $v_1,v_2,v_3$ so that (working with the indices modulo 3) we have $v_i v_{i+1} \in E(G_i)$ for $1 \le i \le 3$. We provide an example to show this bound is best possible. This also answers a question of Diwan and Mubayi. We include a new short proof of Mantel's Theorem we obtained as a byproduct.
△ Less
Submitted 25 February, 2020; v1 submitted 31 December, 2018;
originally announced December 2018.
-
Cooperative colorings of trees and of bipartite graphs
Authors:
Ron Aharoni,
Eli Berger,
Maria Chudnovsky,
Frédéric Havet,
Zilin Jiang
Abstract:
Given a system $(G_1, \ldots ,G_m)$ of graphs on the same vertex set $V$, a cooperative coloring is a choice of vertex sets $I_1, \ldots ,I_m$, such that $I_j$ is independent in $G_j$ and $\bigcup_{j=1}^{m}I_j = V$. For a class $\mathcal{G}$ of graphs, let $m_{\mathcal{G}}(d)$ be the minimal $m$ such that every $m$ graphs from $\mathcal{G}$ with maximum degree $d$ have a cooperative coloring. We p…
▽ More
Given a system $(G_1, \ldots ,G_m)$ of graphs on the same vertex set $V$, a cooperative coloring is a choice of vertex sets $I_1, \ldots ,I_m$, such that $I_j$ is independent in $G_j$ and $\bigcup_{j=1}^{m}I_j = V$. For a class $\mathcal{G}$ of graphs, let $m_{\mathcal{G}}(d)$ be the minimal $m$ such that every $m$ graphs from $\mathcal{G}$ with maximum degree $d$ have a cooperative coloring. We prove that $Ω(\log\log d) \le m_\mathcal{T}(d) \le O(\log d)$ and $Ω(\log d)\le m_\mathcal{B}(d) \le O(d/\log d)$, where $\mathcal{T}$ is the class of trees and $\mathcal{B}$ is the class of bipartite graphs.
△ Less
Submitted 23 January, 2020; v1 submitted 16 June, 2018;
originally announced June 2018.
-
Rainbow fractional matchings
Authors:
Ron Aharoni,
Ron Holzman,
Zilin Jiang
Abstract:
We prove that any family $E_1, \ldots , E_{\lceil rn \rceil}$ of (not necessarily distinct) sets of edges in an $r$-uniform hypergraph, each having a fractional matching of size $n$, has a rainbow fractional matching of size $n$ (that is, a set of edges from distinct $E_i$'s which supports such a fractional matching). When the hypergraph is $r$-partite and $n$ is an integer, the number of sets nee…
▽ More
We prove that any family $E_1, \ldots , E_{\lceil rn \rceil}$ of (not necessarily distinct) sets of edges in an $r$-uniform hypergraph, each having a fractional matching of size $n$, has a rainbow fractional matching of size $n$ (that is, a set of edges from distinct $E_i$'s which supports such a fractional matching). When the hypergraph is $r$-partite and $n$ is an integer, the number of sets needed goes down from $rn$ to $rn-r+1$. The problem solved here is a fractional version of the corresponding problem about rainbow matchings, which was solved by Drisko and by Aharoni and Berger in the case of bipartite graphs, but is open for general graphs as well as for $r$-partite hypergraphs with $r>2$. Our topological proof is based on a result of Kalai and Meshulam about a simplicial complex and a matroid on the same vertex set.
△ Less
Submitted 6 May, 2019; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Split and Rephrase: Better Evaluation and a Stronger Baseline
Authors:
Roee Aharoni,
Yoav Goldberg
Abstract:
Splitting and rephrasing a complex sentence into several shorter sentences that convey the same meaning is a challenging problem in NLP. We show that while vanilla seq2seq models can reach high scores on the proposed benchmark (Narayan et al., 2017), they suffer from memorization of the training set which contains more than 89% of the unique simple sentences from the validation and test sets. To a…
▽ More
Splitting and rephrasing a complex sentence into several shorter sentences that convey the same meaning is a challenging problem in NLP. We show that while vanilla seq2seq models can reach high scores on the proposed benchmark (Narayan et al., 2017), they suffer from memorization of the training set which contains more than 89% of the unique simple sentences from the validation and test sets. To aid this, we present a new train-development-test data split and neural models augmented with a copy-mechanism, outperforming the best reported baseline by 8.68 BLEU and fostering further progress on the task.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.
-
Rainbow triangles and the Caccetta-Häggkvist conjecture
Authors:
Ron Aharoni,
Ron Holzman,
Matthew DeVos
Abstract:
A famous conjecture of Caccetta and Häggkvist is that in a digraph on $n$ vertices and minimum out-degree at least $\frac{n}{r}$ there is a directed cycle of length $r$ or less. We consider the following generalization: in an undirected graph on $n$ vertices, any collection of $n$ disjoint sets of edges, each of size at least $\frac{n}{r}$, has a rainbow cycle of length $r$ or less. We focus on th…
▽ More
A famous conjecture of Caccetta and Häggkvist is that in a digraph on $n$ vertices and minimum out-degree at least $\frac{n}{r}$ there is a directed cycle of length $r$ or less. We consider the following generalization: in an undirected graph on $n$ vertices, any collection of $n$ disjoint sets of edges, each of size at least $\frac{n}{r}$, has a rainbow cycle of length $r$ or less. We focus on the case $r=3$, and prove the existence of a rainbow triangle under somewhat stronger conditions than in the conjecture. For any fixed $k$ and large enough $n$, we determine the maximum number of edges in an $n$-vertex edge-coloured graph where all colour classes have size at most $k$ and there is no rainbow triangle. Moreover, we characterize the extremal graphs for this problem.
△ Less
Submitted 4 April, 2018;
originally announced April 2018.
-
Weighted domination of independent sets
Authors:
Ron Aharoni,
Irina Gorelik
Abstract:
The {\em independent domination number} $γ^i(G)$ of a graph $G$ is the maximum, over all independent sets $I$, of the minimal number of vertices needed to dominate $I$. It is known \cite{abz} that in chordal graphs $γ^i$ is equal to $γ$, the ordinary domination number. The weighted version of this result is not true, but we show that it does hold for interval graphs, and for the intersection (that…
▽ More
The {\em independent domination number} $γ^i(G)$ of a graph $G$ is the maximum, over all independent sets $I$, of the minimal number of vertices needed to dominate $I$. It is known \cite{abz} that in chordal graphs $γ^i$ is equal to $γ$, the ordinary domination number. The weighted version of this result is not true, but we show that it does hold for interval graphs, and for the intersection (that is, line) graphs of subtrees of a given tree, where each subtree is a single edge.
△ Less
Submitted 28 September, 2017;
originally announced September 2017.
-
Ramsey-nice families of graphs
Authors:
Ron Aharoni,
Noga Alon,
Michal Amir,
Penny Haxell,
Dan Hefetz,
Zilin Jiang,
Gal Kronenberg,
Alon Naor
Abstract:
For a finite family $\mathcal{F}$ of fixed graphs let $R_k(\mathcal{F})$ be the smallest integer $n$ for which every $k$-coloring of the edges of the complete graph $K_n$ yields a monochromatic copy of some $F\in\mathcal{F}$. We say that $\mathcal{F}$ is $k$-nice if for every graph $G$ with $χ(G)=R_k(\mathcal{F})$ and for every $k$-coloring of $E(G)$ there exists a monochromatic copy of some…
▽ More
For a finite family $\mathcal{F}$ of fixed graphs let $R_k(\mathcal{F})$ be the smallest integer $n$ for which every $k$-coloring of the edges of the complete graph $K_n$ yields a monochromatic copy of some $F\in\mathcal{F}$. We say that $\mathcal{F}$ is $k$-nice if for every graph $G$ with $χ(G)=R_k(\mathcal{F})$ and for every $k$-coloring of $E(G)$ there exists a monochromatic copy of some $F\in\mathcal{F}$. It is easy to see that if $\mathcal{F}$ contains no forest, then it is not $k$-nice for any $k$. It seems plausible to conjecture that a (weak) converse holds, namely, for any finite family of graphs $\mathcal{F}$ that contains at least one forest, and for all $k\geq k_0(\mathcal{F})$ (or at least for infinitely many values of $k$), $\mathcal{F}$ is $k$-nice. We prove several (modest) results in support of this conjecture, showing, in particular, that it holds for each of the three families consisting of two connected graphs with 3 edges each and observing that it holds for any family $\mathcal{F}$ containing a forest with at most 2 edges. We also study some related problems and disprove a conjecture by Aharoni, Charbit and Howard regarding the size of matchings in regular 3-partite 3-uniform hypergraphs.
△ Less
Submitted 16 April, 2018; v1 submitted 24 August, 2017;
originally announced August 2017.
-
Finding a best approximation pair of points for two polyhedra
Authors:
Ron Aharoni,
Yair Censor,
Zilin Jiang
Abstract:
Given two disjoint convex polyhedra, we look for a best approximation pair relative to them, i.e., a pair of points, one in each polyhedron, attaining the minimum distance between the sets. Cheney and Goldstein showed that alternating projections onto the two sets, starting from an arbitrary point, generate a sequence whose two interlaced subsequences converge to a best approximation pair. We prop…
▽ More
Given two disjoint convex polyhedra, we look for a best approximation pair relative to them, i.e., a pair of points, one in each polyhedron, attaining the minimum distance between the sets. Cheney and Goldstein showed that alternating projections onto the two sets, starting from an arbitrary point, generate a sequence whose two interlaced subsequences converge to a best approximation pair. We propose a process based on projections onto the half-spaces defining the two polyhedra, which are more negotiable than projections on the polyhedra themselves. A central component in the proposed process is the Halpern--Lions--Wittmann--Bauschke algorithm for approaching the projection of a given point onto a convex set.
△ Less
Submitted 22 June, 2018; v1 submitted 30 July, 2017;
originally announced July 2017.
-
Towards String-to-Tree Neural Machine Translation
Authors:
Roee Aharoni,
Yoav Goldberg
Abstract:
We present a simple method to incorporate syntactic information about the target language in a neural machine translation system by translating into linearized, lexicalized constituency trees. An experiment on the WMT16 German-English news translation task resulted in an improved BLEU score when compared to a syntax-agnostic NMT baseline trained on the same dataset. An analysis of the translations…
▽ More
We present a simple method to incorporate syntactic information about the target language in a neural machine translation system by translating into linearized, lexicalized constituency trees. An experiment on the WMT16 German-English news translation task resulted in an improved BLEU score when compared to a syntax-agnostic NMT baseline trained on the same dataset. An analysis of the translations from the syntax-aware system shows that it performs more reordering during translation in comparison to the baseline. A small-scale human evaluation also showed an advantage to the syntax-aware system.
△ Less
Submitted 6 May, 2017; v1 submitted 16 April, 2017;
originally announced April 2017.