-
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Authors:
Elron Bandel,
Yotam Perlitz,
Elad Venezian,
Roni Friedman-Melamed,
Ofir Arviv,
Matan Orbach,
Shachar Don-Yehyia,
Dafna Sheinwald,
Ariel Gera,
Leshem Choshen,
Michal Shmueli-Scheuer,
Yoav Katz
Abstract:
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we p…
▽ More
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Knowledge is a Region in Weight Space for Fine-tuned Language Models
Authors:
Almog Gueta,
Elad Venezian,
Colin Raffel,
Noam Slonim,
Yoav Katz,
Leshem Choshen
Abstract:
Research on neural networks has focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, particularly those trained or tested on different datasets. We address this by studying how the weight space and the underlying loss landscape of different models are interconnected.
Specifically, we demonstrate…
▽ More
Research on neural networks has focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, particularly those trained or tested on different datasets. We address this by studying how the weight space and the underlying loss landscape of different models are interconnected.
Specifically, we demonstrate that finetuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versa -- that any model that resides anywhere in those regions also exhibits high performance. Notably, we show that language models that have been finetuned on the same dataset form a tight cluster in the weight space, while models finetuned on different datasets from the same underlying task form a looser cluster. Moreover, traversing around the region between the models leads to new models that perform comparably or even better than models obtained via finetuning, even on tasks that the original models were not finetuned on.
Our findings provide insight into the relationships between models, demonstrating that a model positioned between two similar models can acquire the knowledge of both. We leverage this and design a method for selecting a better model for efficient finetuning. Specifically, we show that starting from the center of the region is as effective, if not more, than using the pretrained model in 11 out of 12 datasets, resulting in an average accuracy improvement of 3.06.
△ Less
Submitted 12 October, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
SimpleStyle: An Adaptable Style Transfer Approach
Authors:
Elron Bandel,
Yoav Katz,
Noam Slonim,
Liat Ein-Dor
Abstract:
Attribute-controlled text rewriting, also known as text style-transfer, has a crucial role in regulating attributes and biases of textual training data and a machine generated text. In this work we present SimpleStyle, a minimalist yet effective approach for style-transfer composed of two simple ingredients: controlled denoising and output filtering. Despite the simplicity of our approach, which c…
▽ More
Attribute-controlled text rewriting, also known as text style-transfer, has a crucial role in regulating attributes and biases of textual training data and a machine generated text. In this work we present SimpleStyle, a minimalist yet effective approach for style-transfer composed of two simple ingredients: controlled denoising and output filtering. Despite the simplicity of our approach, which can be succinctly described with a few lines of code, it is competitive with previous state-of-the-art methods both in automatic and in human evaluation. To demonstrate the adaptability and practical value of our system beyond academic data, we apply SimpleStyle to transfer a wide range of text attributes appearing in real-world textual data from social networks. Additionally, we introduce a novel "soft noising" technique that further improves the performance of our system. We also show that teaching a student model to generate the output of SimpleStyle can result in a system that performs style transfer of equivalent quality with only a single greedy-decoded sample. Finally, we suggest our method as a remedy for the fundamental incompatible baseline issue that holds progress in the field. We offer our protocol as a simple yet strong baseline for works that wish to make incremental advancements in the field of attribute controlled text rewriting.
△ Less
Submitted 22 December, 2022; v1 submitted 20 December, 2022;
originally announced December 2022.
-
ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning
Authors:
Shachar Don-Yehiya,
Elad Venezian,
Colin Raffel,
Noam Slonim,
Yoav Katz,
Leshem Choshen
Abstract:
We propose a new paradigm to continually evolve pretrained models, denoted ColD Fusion. It provides the benefits of multitask learning but leverages distributed computation with limited communication and eliminates the need for shared data. Consequentially, ColD Fusion can give rise to a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are b…
▽ More
We propose a new paradigm to continually evolve pretrained models, denoted ColD Fusion. It provides the benefits of multitask learning but leverages distributed computation with limited communication and eliminates the need for shared data. Consequentially, ColD Fusion can give rise to a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based upon. We show that ColD Fusion yields comparable benefits to multitask training by producing a model that (a) attains strong performance on all of the datasets it was trained on; and (b) is a better starting point for finetuning on unseen datasets. We show that ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusion-based model outperforms RoBERTa by 2.33 points on average without any changes to the architecture.
△ Less
Submitted 13 September, 2023; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Where to start? Analyzing the potential value of intermediate models
Authors:
Leshem Choshen,
Elad Venezian,
Shachar Don-Yehia,
Noam Slonim,
Yoav Katz
Abstract:
Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis su…
▽ More
Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed independently for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture https://ibm.github.io/model-recycling/.
△ Less
Submitted 10 November, 2022; v1 submitted 31 October, 2022;
originally announced November 2022.
-
Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours
Authors:
Eyal Shnarch,
Alon Halfon,
Ariel Gera,
Marina Danilevsky,
Yannis Katsis,
Leshem Choshen,
Martin Santillan Cooper,
Dina Epelboim,
Zheng Zhang,
Dakuo Wang,
Lucy Yip,
Liat Ein-Dor,
Lena Dankin,
Ilya Shnayderman,
Ranit Aharonov,
Yunyao Li,
Naftali Liberman,
Philip Levin Slesarev,
Gwilym Newton,
Shila Ofek-Koifman,
Noam Slonim,
Yoav Katz
Abstract:
Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) be…
▽ More
Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a no-code system, making NLP accessible to non-experts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient -- from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.
△ Less
Submitted 31 October, 2022; v1 submitted 2 August, 2022;
originally announced August 2022.
-
VIRATrustData: A Trust-Annotated Corpus of Human-Chatbot Conversations About COVID-19 Vaccines
Authors:
Roni Friedman,
João Sedoc,
Shai Gretz,
Assaf Toledo,
Rose Weeks,
Naor Bar-Zeev,
Yoav Katz,
Noam Slonim
Abstract:
Public trust in medical information is crucial for successful application of public health policies such as vaccine uptake. This is especially true when the information is offered remotely, by chatbots, which have become increasingly popular in recent years. Here, we explore the challenging task of human-bot turn-level trust classification. We rely on a recently released data of observationally-co…
▽ More
Public trust in medical information is crucial for successful application of public health policies such as vaccine uptake. This is especially true when the information is offered remotely, by chatbots, which have become increasingly popular in recent years. Here, we explore the challenging task of human-bot turn-level trust classification. We rely on a recently released data of observationally-collected (rather than crowdsourced) dialogs with VIRA chatbot, a COVID-19 Vaccine Information Resource Assistant. These dialogs are centered around questions and concerns about COVID-19 vaccines, where trust is particularly acute. We annotated $3k$ VIRA system-user conversational turns for Low Institutional Trust or Low Agent Trust vs. Neutral or High Trust. We release the labeled dataset, VIRATrustData, the first of its kind to the best of our knowledge. We demonstrate how this task is non-trivial and compare several models that predict the different levels of trust.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Benchmark Data and Evaluation Framework for Intent Discovery Around COVID-19 Vaccine Hesitancy
Authors:
Shai Gretz,
Assaf Toledo,
Roni Friedman,
Dan Lahav,
Rose Weeks,
Naor Bar-Zeev,
João Sedoc,
Pooja Sangha,
Yoav Katz,
Noam Slonim
Abstract:
The COVID-19 pandemic has made a huge global impact and cost millions of lives. As COVID-19 vaccines were rolled out, they were quickly met with widespread hesitancy. To address the concerns of hesitant people, we launched VIRA, a public dialogue system aimed at addressing questions and concerns surrounding the COVID-19 vaccines. Here, we release VIRADialogs, a dataset of over 8k dialogues conduct…
▽ More
The COVID-19 pandemic has made a huge global impact and cost millions of lives. As COVID-19 vaccines were rolled out, they were quickly met with widespread hesitancy. To address the concerns of hesitant people, we launched VIRA, a public dialogue system aimed at addressing questions and concerns surrounding the COVID-19 vaccines. Here, we release VIRADialogs, a dataset of over 8k dialogues conducted by actual users with VIRA, providing a unique real-world conversational dataset. In light of rapid changes in users' intents, due to updates in guidelines or in response to new information, we highlight the important task of intent discovery in this use-case. We introduce a novel automatic evaluation framework for intent discovery, leveraging the existing intent classifier of VIRA. We use this framework to report baseline intent discovery results over VIRADialogs, that highlight the difficulty of this task.
△ Less
Submitted 11 October, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Multi-Domain Targeted Sentiment Analysis
Authors:
Orith Toledo-Ronen,
Matan Orbach,
Yoav Katz,
Noam Slonim
Abstract:
Targeted Sentiment Analysis (TSA) is a central task for generating insights from consumer reviews. Such content is extremely diverse, with sites like Amazon or Yelp containing reviews on products and businesses from many different domains. A real-world TSA system should gracefully handle that diversity. This can be achieved by a multi-domain model -- one that is robust to the domain of the analyze…
▽ More
Targeted Sentiment Analysis (TSA) is a central task for generating insights from consumer reviews. Such content is extremely diverse, with sites like Amazon or Yelp containing reviews on products and businesses from many different domains. A real-world TSA system should gracefully handle that diversity. This can be achieved by a multi-domain model -- one that is robust to the domain of the analyzed texts, and performs well on various domains. To address this scenario, we present a multi-domain TSA system based on augmenting a given training set with diverse weak labels from assorted domains. These are obtained through self-training on the Yelp reviews corpus. Extensive experiments with our approach on three evaluation datasets across different domains demonstrate the effectiveness of our solution. We further analyze how restrictions imposed on the available labeled data affect the performance, and compare the proposed method to the costly alternative of manually gathering diverse TSA labeled data. Our results and analysis show that our approach is a promising step towards a practical domain-robust TSA system.
△ Less
Submitted 8 May, 2022;
originally announced May 2022.
-
Fusing finetuned models for better pretraining
Authors:
Leshem Choshen,
Elad Venezian,
Noam Slonim,
Yoav Katz
Abstract:
Pretrained models are the standard starting point for training. This approach consistently outperforms the use of a random initialization. However, pretraining is a costly endeavour that few can undertake.
In this paper, we create better base models at hardly any cost, by fusing multiple existing fine tuned models into one. Specifically, we fuse by averaging the weights of these models. We show…
▽ More
Pretrained models are the standard starting point for training. This approach consistently outperforms the use of a random initialization. However, pretraining is a costly endeavour that few can undertake.
In this paper, we create better base models at hardly any cost, by fusing multiple existing fine tuned models into one. Specifically, we fuse by averaging the weights of these models. We show that the fused model results surpass the pretrained model ones. We also show that fusing is often better than intertraining.
We find that fusing is less dependent on the target task. Furthermore, weight decay nullifies intertraining effects but not those of fusing.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
Overview of the 2021 Key Point Analysis Shared Task
Authors:
Roni Friedman,
Lena Dankin,
Yufang Hou,
Ranit Aharonov,
Yoav Katz,
Noam Slonim
Abstract:
We describe the 2021 Key Point Analysis (KPA-2021) shared task on key point analysis that we organized as a part of the 8th Workshop on Argument Mining (ArgMining 2021) at EMNLP 2021. We outline various approaches and discuss the results of the shared task. We expect the task and the findings reported in this paper to be relevant for researchers working on text summarization and argument mining.
We describe the 2021 Key Point Analysis (KPA-2021) shared task on key point analysis that we organized as a part of the 8th Workshop on Argument Mining (ArgMining 2021) at EMNLP 2021. We outline various approaches and discuss the results of the shared task. We expect the task and the findings reported in this paper to be relevant for researchers working on text summarization and argument mining.
△ Less
Submitted 20 October, 2021;
originally announced October 2021.
-
Project Debater APIs: Decomposing the AI Grand Challenge
Authors:
Roy Bar-Haim,
Yoav Kantor,
Elad Venezian,
Yoav Katz,
Noam Slonim
Abstract:
Project Debater was revealed in 2019 as the first AI system that can debate human experts on complex topics. Engaging in a live debate requires a diverse set of skills, and Project Debater has been developed accordingly as a collection of components, each designed to perform a specific subtask. Project Debater APIs provide access to many of these capabilities, as well as to more recently developed…
▽ More
Project Debater was revealed in 2019 as the first AI system that can debate human experts on complex topics. Engaging in a live debate requires a diverse set of skills, and Project Debater has been developed accordingly as a collection of components, each designed to perform a specific subtask. Project Debater APIs provide access to many of these capabilities, as well as to more recently developed ones. This diverse set of web services, publicly available for academic use, includes core NLP services, argument mining and analysis capabilities, and higher-level services for content summarization. We describe these APIs and their performance, and demonstrate how they can be used for building practical solutions. In particular, we will focus on Key Point Analysis, a novel technology that identifies the main points and their prevalence in a collection of texts such as survey responses and user reviews.
△ Less
Submitted 3 October, 2021;
originally announced October 2021.
-
YASO: A Targeted Sentiment Analysis Evaluation Dataset for Open-Domain Reviews
Authors:
Matan Orbach,
Orith Toledo-Ronen,
Artem Spector,
Ranit Aharonov,
Yoav Katz,
Noam Slonim
Abstract:
Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO - a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sente…
▽ More
Current TSA evaluation in a cross-domain setup is restricted to the small set of review domains available in existing datasets. Such an evaluation is limited, and may not reflect true performance on sites like Amazon or Yelp that host diverse reviews from many domains. To address this gap, we present YASO - a new TSA evaluation dataset of open-domain user reviews. YASO contains 2,215 English sentences from dozens of review domains, annotated with target terms and their sentiment. Our analysis verifies the reliability of these annotations, and explores the characteristics of the collected data. Benchmark results using five contemporary TSA systems show there is ample room for improvement on this challenging new dataset. YASO is available at https://github.com/IBM/yaso-tsa.
△ Less
Submitted 13 September, 2021; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Financial Event Extraction Using Wikipedia-Based Weak Supervision
Authors:
Liat Ein-Dor,
Ariel Gera,
Orith Toledo-Ronen,
Alon Halfon,
Benjamin Sznajder,
Lena Dankin,
Yonatan Bilu,
Yoav Katz,
Noam Slonim
Abstract:
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of…
▽ More
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.
△ Less
Submitted 28 November, 2022; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Fast End-to-End Wikification
Authors:
Ilya Shnayderman,
Liat Ein-Dor,
Yosi Mass,
Alon Halfon,
Benjamin Sznajder,
Artem Spector,
Yoav Katz,
Dafna Sheinwald,
Ranit Aharonov,
Noam Slonim
Abstract:
Wikification of large corpora is beneficial for various NLP applications. Existing methods focus on quality performance rather than run-time, and are therefore non-feasible for large data. Here, we introduce RedW, a run-time oriented Wikification solution, based on Wikipedia redirects, that can Wikify massive corpora with competitive performance. We further propose an efficient method for estimati…
▽ More
Wikification of large corpora is beneficial for various NLP applications. Existing methods focus on quality performance rather than run-time, and are therefore non-feasible for large data. Here, we introduce RedW, a run-time oriented Wikification solution, based on Wikipedia redirects, that can Wikify massive corpora with competitive performance. We further propose an efficient method for estimating RedW confidence, opening the door for applying more demanding methods only on top of RedW lower-confidence results. Our experimental results support the validity of the proposed approach.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
Learning to combine Grammatical Error Corrections
Authors:
Yoav Kantor,
Yoav Katz,
Leshem Choshen,
Edo Cohen-Karlik,
Naftali Liberman,
Assaf Toledo,
Amir Menczel,
Noam Slonim
Abstract:
The field of Grammatical Error Correction (GEC) has produced various systems to deal with focused phenomena or general text editing. We propose an automatic way to combine black-box systems. Our method automatically detects the strength of a system or the combination of several systems per error type, improving precision and recall while optimizing $F$ score directly. We show consistent improvemen…
▽ More
The field of Grammatical Error Correction (GEC) has produced various systems to deal with focused phenomena or general text editing. We propose an automatic way to combine black-box systems. Our method automatically detects the strength of a system or the combination of several systems per error type, improving precision and recall while optimizing $F$ score directly. We show consistent improvement over the best standalone system in all the configurations tested. This approach also outperforms average ensembling of different RNN models with random initializations.
In addition, we analyze the use of BERT for GEC - reporting promising results on this end. We also present a spellchecker created for this task which outperforms standard spellcheckers tested on the task of spellchecking.
This paper describes a system submission to Building Educational Applications 2019 Shared Task: Grammatical Error Correction.
Combining the output of top BEA 2019 shared task systems using our approach, currently holds the highest reported score in the open phase of the BEA 2019 shared task, improving F0.5 by 3.7 points over the best result reported.
△ Less
Submitted 10 June, 2019;
originally announced June 2019.