Search | arXiv e-print repository

Statute-enhanced lexical retrieval of court cases for COLIEE 2022

Authors: Tobias Fink, Gabor Recski, Wojciech Kusa, Allan Hanbury

Abstract: We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage… ▽ More We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage level runs to the competition, which achieve high recall but low precision. △ Less

Submitted 17 April, 2023; originally announced April 2023.

Comments: Sixteenth International Workshop on Juris-informatics (JURISIN). 2022

arXiv:2201.13230 [pdf, other]

doi 10.1145/3511808.3557196

POTATO: exPlainable infOrmation exTrAcTion framewOrk

Authors: Ádám Kovács, Kinga Gémes, Eszter Iklódi, Gábor Recski

Abstract: We present POTATO, a task- and languageindependent framework for human-in-the-loop (HITL) learning of rule-based text classifiers using graph-based features. POTATO handles any type of directed graph and supports parsing text into Abstract Meaning Representations (AMR), Universal Dependencies (UD), and 4lang semantic graphs. A streamlit-based user interface allows users to build rule systems from… ▽ More We present POTATO, a task- and languageindependent framework for human-in-the-loop (HITL) learning of rule-based text classifiers using graph-based features. POTATO handles any type of directed graph and supports parsing text into Abstract Meaning Representations (AMR), Universal Dependencies (UD), and 4lang semantic graphs. A streamlit-based user interface allows users to build rule systems from graph patterns, provides real-time evaluation based on ground truth data, and suggests rules by ranking graph features using interpretable machine learning models. Users can also provide patterns over graphs using regular expressions, and POTATO can recommend refinements of such rules. POTATO is applied in projects across domains and languages, including classification tasks on German legal text and English social media data. All components of our system are written in Python, can be installed via pip, and are released under an MIT License on GitHub. △ Less

Submitted 16 October, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Comments: 4 pages

arXiv:2004.12752 [pdf, other]

The Gutenberg Dialogue Dataset

Authors: Richard Csaky, Gabor Recski

Abstract: Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and proc… ▽ More Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters. We also built a web demo for interacting with our models: https://ricsinaruto.github.io/chatbot.html. △ Less

Submitted 22 January, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

Comments: Accepted at EACL 2021

arXiv:1905.05471 [pdf, other]

Improving Neural Conversational Models with Entropy-Based Data Filtering

Authors: Richard Csaky, Patrik Purgai, Gabor Recski

Abstract: Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain res… ▽ More Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain response generation focused on either the underlying model or the training objective, we present a method of filtering dialog datasets by removing generic utterances from training data using a simple entropy-based approach that does not require human supervision. We conduct extensive experiments with different variations of our method, and compare dialog models across 17 evaluation metrics to show that training on datasets filtered this way results in better conversational quality as chatbots learn to output more diverse responses. △ Less

Submitted 2 August, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

Comments: 20 pages. same as ACL version: https://www.aclweb.org/anthology/P19-1567

Journal ref: Proceedings of the 57th Conference of the ACL (2019) 5650-5669

Showing 1–4 of 4 results for author: Recski, G