-
Statute-enhanced lexical retrieval of court cases for COLIEE 2022
Authors:
Tobias Fink,
Gabor Recski,
Wojciech Kusa,
Allan Hanbury
Abstract:
We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage…
▽ More
We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage level runs to the competition, which achieve high recall but low precision.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
POTATO: exPlainable infOrmation exTrAcTion framewOrk
Authors:
Ádám Kovács,
Kinga Gémes,
Eszter Iklódi,
Gábor Recski
Abstract:
We present POTATO, a task- and languageindependent framework for human-in-the-loop (HITL) learning of rule-based text classifiers using graph-based features. POTATO handles any type of directed graph and supports parsing text into Abstract Meaning Representations (AMR), Universal Dependencies (UD), and 4lang semantic graphs. A streamlit-based user interface allows users to build rule systems from…
▽ More
We present POTATO, a task- and languageindependent framework for human-in-the-loop (HITL) learning of rule-based text classifiers using graph-based features. POTATO handles any type of directed graph and supports parsing text into Abstract Meaning Representations (AMR), Universal Dependencies (UD), and 4lang semantic graphs. A streamlit-based user interface allows users to build rule systems from graph patterns, provides real-time evaluation based on ground truth data, and suggests rules by ranking graph features using interpretable machine learning models. Users can also provide patterns over graphs using regular expressions, and POTATO can recommend refinements of such rules. POTATO is applied in projects across domains and languages, including classification tasks on German legal text and English social media data. All components of our system are written in Python, can be installed via pip, and are released under an MIT License on GitHub.
△ Less
Submitted 16 October, 2022; v1 submitted 31 January, 2022;
originally announced January 2022.
-
The Gutenberg Dialogue Dataset
Authors:
Richard Csaky,
Gabor Recski
Abstract:
Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and proc…
▽ More
Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters. We also built a web demo for interacting with our models: https://ricsinaruto.github.io/chatbot.html.
△ Less
Submitted 22 January, 2021; v1 submitted 27 April, 2020;
originally announced April 2020.
-
Improving Neural Conversational Models with Entropy-Based Data Filtering
Authors:
Richard Csaky,
Patrik Purgai,
Gabor Recski
Abstract:
Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain res…
▽ More
Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain response generation focused on either the underlying model or the training objective, we present a method of filtering dialog datasets by removing generic utterances from training data using a simple entropy-based approach that does not require human supervision. We conduct extensive experiments with different variations of our method, and compare dialog models across 17 evaluation metrics to show that training on datasets filtered this way results in better conversational quality as chatbots learn to output more diverse responses.
△ Less
Submitted 2 August, 2019; v1 submitted 14 May, 2019;
originally announced May 2019.