-
InspectorRAGet: An Introspection Platform for RAG Evaluation
Authors:
Kshitij Fadnis,
Siva Sankalp Patel,
Odellia Boni,
Yannis Katsis,
Sara Rosenthal,
Benjamin Sznajder,
Marina Danilevsky
Abstract:
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present Inspect…
▽ More
Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for RAG evaluation. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmic metrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community. The demo video is available at https://youtu.be/MJhe8QIXcEc
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Label-Efficient Model Selection for Text Generation
Authors:
Shir Ashury-Tahan,
Ariel Gera,
Benjamin Sznajder,
Leshem Choshen,
Liat Ein-Dor,
Eyal Shnarch
Abstract:
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluatio…
▽ More
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
△ Less
Submitted 6 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers
Authors:
Ariel Gera,
Roni Friedman,
Ofir Arviv,
Chulaka Gunasekara,
Benjamin Sznajder,
Noam Slonim,
Eyal Shnarch
Abstract:
Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Spec…
▽ More
Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Heuristic-based Inter-training to Improve Few-shot Multi-perspective Dialog Summarization
Authors:
Benjamin Sznajder,
Chulaka Gunasekara,
Guy Lev,
Sachin Joshi,
Eyal Shnarch,
Noam Slonim
Abstract:
Many organizations require their customer-care agents to manually summarize their conversations with customers. These summaries are vital for decision making purposes of the organizations. The perspective of the summary that is required to be created depends on the application of the summaries. With this work, we study the multi-perspective summarization of customer-care conversations between supp…
▽ More
Many organizations require their customer-care agents to manually summarize their conversations with customers. These summaries are vital for decision making purposes of the organizations. The perspective of the summary that is required to be created depends on the application of the summaries. With this work, we study the multi-perspective summarization of customer-care conversations between support agents and customers. We observe that there are different heuristics that are associated with summaries of different perspectives, and explore these heuristics to create weak-labeled data for intermediate training of the models before fine-tuning with scarce human annotated summaries. Most importantly, we show that our approach supports models to generate multi-perspective summaries with a very small amount of annotated data. For example, our approach achieves 94\% of the performance (Rouge-2) of a model trained with the original data, by training only with 7\% of the original data.
△ Less
Submitted 30 March, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
TWEETSUMM -- A Dialog Summarization Dataset for Customer Service
Authors:
Guy Feigenblat,
Chulaka Gunasekara,
Benjamin Sznajder,
Sachindra Joshi,
David Konopnicki,
Ranit Aharonov
Abstract:
In a typical customer service chat scenario, customers contact a support center to ask for help or raise complaints, and human agents try to solve the issues. In most cases, at the end of the conversation, agents are asked to write a short summary emphasizing the problem and the proposed solution, usually for the benefit of other agents that may have to deal with the same customer or issue. The go…
▽ More
In a typical customer service chat scenario, customers contact a support center to ask for help or raise complaints, and human agents try to solve the issues. In most cases, at the end of the conversation, agents are asked to write a short summary emphasizing the problem and the proposed solution, usually for the benefit of other agents that may have to deal with the same customer or issue. The goal of the present article is advancing the automation of this task. We introduce the first large scale, high quality, customer care dialog summarization dataset with close to 6500 human annotated summaries. The data is based on real-world customer support dialogs and includes both extractive and abstractive summaries. We also introduce a new unsupervised, extractive summarization method specific to dialogs.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles
Authors:
Odellia Boni,
Guy Feigenblat,
Guy Lev,
Michal Shmueli-Scheuer,
Benjamin Sznajder,
David Konopnicki
Abstract:
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS), which targets the use-case of generating actionable instructions from a set of sources. This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios. We employed automatic method…
▽ More
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS), which targets the use-case of generating actionable instructions from a set of sources. This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios. We employed automatic methods, and leveraged statistics from existing human-crafted qMDS datasets, to create HowSumm from wikiHow website articles and the sources they cite. We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora. Automatic and human evaluations of both extractive and abstractive summarization models on the dataset reveal that there is room for improvement.
△ Less
Submitted 8 October, 2021; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Summary Grounded Conversation Generation
Authors:
Chulaka Gunasekara,
Guy Feigenblat,
Benjamin Sznajder,
Sachindra Joshi,
David Konopnicki
Abstract:
Many conversation datasets have been constructed in the recent years using crowdsourcing. However, the data collection process can be time consuming and presents many challenges to ensure data quality. Since language generation has improved immensely in recent years with the advancement of pre-trained language models, we investigate how such models can be utilized to generate entire conversations,…
▽ More
Many conversation datasets have been constructed in the recent years using crowdsourcing. However, the data collection process can be time consuming and presents many challenges to ensure data quality. Since language generation has improved immensely in recent years with the advancement of pre-trained language models, we investigate how such models can be utilized to generate entire conversations, given only a summary of a conversation as the input. We explore three approaches to generate summary grounded conversations, and evaluate the generated conversations using automatic measures and human judgements. We also show that the accuracy of conversation summarization can be improved by augmenting a conversation summarization dataset with generated conversations.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Financial Event Extraction Using Wikipedia-Based Weak Supervision
Authors:
Liat Ein-Dor,
Ariel Gera,
Orith Toledo-Ronen,
Alon Halfon,
Benjamin Sznajder,
Lena Dankin,
Yonatan Bilu,
Yoav Katz,
Noam Slonim
Abstract:
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of…
▽ More
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.
△ Less
Submitted 28 November, 2022; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Corpus Wide Argument Mining -- a Working Solution
Authors:
Liat Ein-Dor,
Eyal Shnarch,
Lena Dankin,
Alon Halfon,
Benjamin Sznajder,
Ariel Gera,
Carlos Alzate,
Martin Gleize,
Leshem Choshen,
Yufang Hou,
Yonatan Bilu,
Ranit Aharonov,
Noam Slonim
Abstract:
One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehen…
▽ More
One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.
△ Less
Submitted 25 November, 2019;
originally announced November 2019.
-
Argument Invention from First Principles
Authors:
Yonatan Bilu,
Ariel Gera,
Daniel Hershcovich,
Benjamin Sznajder,
Dan Lahav,
Guy Moshkowich,
Anael Malet,
Assaf Gavron,
Noam Slonim
Abstract:
Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on "first principles", commonplace arguments which are relevant to many topics, and which they have refined in past debates.
In this work we aim to explicitly define a taxon…
▽ More
Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on "first principles", commonplace arguments which are relevant to many topics, and which they have refined in past debates.
In this work we aim to explicitly define a taxonomy of such principled recurring arguments, and, given a controversial topic, to automatically identify which of these arguments are relevant to the topic.
As far as we know, this is the first time that this approach to argument invention is formalized and made explicit in the context of NLP.
The main goal of this work is to show that it is possible to define such a taxonomy. While the taxonomy suggested here should be thought of as a "first attempt" it is nonetheless coherent, covers well the relevant topics and coincides with what professional debaters actually argue in their speeches, and facilitates automatic argument invention for new topics.
△ Less
Submitted 22 August, 2019;
originally announced August 2019.
-
Controversy in Context
Authors:
Benjamin Sznajder,
Ariel Gera,
Yonatan Bilu,
Dafna Sheinwald,
Ella Rabinovich,
Ranit Aharonov,
David Konopnicki,
Noam Slonim
Abstract:
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and,…
▽ More
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.
-
Fast End-to-End Wikification
Authors:
Ilya Shnayderman,
Liat Ein-Dor,
Yosi Mass,
Alon Halfon,
Benjamin Sznajder,
Artem Spector,
Yoav Katz,
Dafna Sheinwald,
Ranit Aharonov,
Noam Slonim
Abstract:
Wikification of large corpora is beneficial for various NLP applications. Existing methods focus on quality performance rather than run-time, and are therefore non-feasible for large data. Here, we introduce RedW, a run-time oriented Wikification solution, based on Wikipedia redirects, that can Wikify massive corpora with competitive performance. We further propose an efficient method for estimati…
▽ More
Wikification of large corpora is beneficial for various NLP applications. Existing methods focus on quality performance rather than run-time, and are therefore non-feasible for large data. Here, we introduce RedW, a run-time oriented Wikification solution, based on Wikipedia redirects, that can Wikify massive corpora with competitive performance. We further propose an efficient method for estimating RedW confidence, opening the door for applying more demanding methods only on top of RedW lower-confidence results. Our experimental results support the validity of the proposed approach.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
Learning Concept Abstractness Using Weak Supervision
Authors:
Ella Rabinovich,
Benjamin Sznajder,
Artem Spector,
Ilya Shnayderman,
Ranit Aharonov,
David Konopnicki,
Noam Slonim
Abstract:
We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The results imply the applicability of this appro…
▽ More
We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The results imply the applicability of this approach to additional properties of concepts, additional languages, and resource-scarce scenarios.
△ Less
Submitted 4 September, 2018;
originally announced September 2018.