Search | arXiv e-print repository

Are We Done with MMLU?

Authors: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

Abstract: Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive fr… ▽ More Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux. △ Less

Submitted 7 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2310.13092 [pdf, other]

Do Language Models Learn about Legal Entity Types during Pretraining?

Authors: Claire Barale, Michael Rovatsos, Nehal Bhuta

Abstract: Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Ty**, serving as a… ▽ More Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Ty**, serving as a proxy for evaluating legal knowledge as an essential aspect of text comprehension, and a foundational task to numerous downstream legal NLP applications. Through systematic evaluation and analysis and two types of prompting (cloze sentences and QA-based templates) and to clarify the nature of these acquired cues, we compare diverse types and lengths of entities both general and domain-specific entities, semantics or syntax signals, and different LM pretraining corpus (generic and legal-oriented) and architectures (encoder BERT-based and decoder-only with Llama2). We show that (1) Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates, (2) law-oriented LMs show inconsistent performance, possibly due to variations in their training corpus, (3) LMs demonstrate the ability to type entities even in the case of multi-token entities, (4) all models struggle with entities belonging to sub-domains of the law (5) Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted for publication at the 5th Natural Legal Language Processing Workshop (NLLP) hosted at EMNLP2023

arXiv:2308.11541 [pdf, ps, other]

Refugee status determination: how cooperation with machine learning tools can lead to more justice

Authors: Claire Barale

Abstract: Previous research on refugee status adjudications has shown that prediction of the outcome of an application can be derived from very few features with satisfactory accuracy. Recent research work has achieved between 70 and 90% accuracy using text analytics on various legal fields among which refugee status determination. Some studies report predictions derived from the judge identity only. Additi… ▽ More Previous research on refugee status adjudications has shown that prediction of the outcome of an application can be derived from very few features with satisfactory accuracy. Recent research work has achieved between 70 and 90% accuracy using text analytics on various legal fields among which refugee status determination. Some studies report predictions derived from the judge identity only. Additionally most features used for prediction are non-substantive and external features ranging from news reports, date and time of the hearing or weather. On the other hand, literature shows that noise is ubiquitous in human judgments and significantly affects the outcome of decisions. It has been demonstrated that noise is a significant factor impacting legal decisions. We use the term "noise" in the sense described by D. Kahneman, as a measure of how human beings are unavoidably influenced by external factors when making a decision. In the context of refugee status determination, it means for instance that two judges would take different decisions when presented with the same application. This article explores ways that machine learning can help reduce noise in refugee law decision making. We are not suggesting that this proposed methodology should be exclusive from other approaches to improve decisions such as training of decision makers, skills acquisition or judgment aggregation, but rather that it is a path worth exploring. We investigate how artificial intelligence and specifically data-driven applications can be used to benefit all parties involved in refugee status adjudications. We specifically look at decisions taken in Canada and in the United States. Our research aims at reducing arbitrariness and unfairness that derive from noisy decisions, based on the assumption that if two cases or applications are alike they should be treated in the same way and induce the same outcome. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Scottish Law and Innovation Network (SCOTLIN) 2022, Early Career Scholars Symposium

arXiv:2308.11531 [pdf, other]

Empowering Refugee Claimants and their Lawyers: Using Machine Learning to Examine Decision-Making in Refugee Law

Authors: Claire Barale

Abstract: Our project aims at hel** and supporting stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, in order to make better decisions through data-driven intelligence and increase the understanding and transparency of the refugee application process for all involved parties. This PhD project has two primary objectives: (1) to retrieve past cases, and… ▽ More Our project aims at hel** and supporting stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, in order to make better decisions through data-driven intelligence and increase the understanding and transparency of the refugee application process for all involved parties. This PhD project has two primary objectives: (1) to retrieve past cases, and (2) to analyze legal decision-making processes on a dataset of Canadian cases. In this paper, we present the current state of our work, which includes a completed experiment on part (1) and ongoing efforts related to part (2). We believe that NLP-based solutions are well-suited to address these challenges, and we investigate the feasibility of automating all steps involved. In addition, we introduce a novel benchmark for future NLP research in refugee law. Our methodology aims to be inclusive to all end-users and stakeholders, with expected benefits including reduced time-to-decision, fairer and more transparent outcomes, and improved decision quality. △ Less

Submitted 21 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: 19th International Conference on Artificial Intelligence and Law - ICAIL 2023, Doctoral Consortium (Best Paper Award)

arXiv:2305.15533 [pdf, other]

doi 10.18653/v1/2023.findings-acl.187

Automated Refugee Case Analysis: An NLP Pipeline for Supporting Legal Practitioners

Authors: Claire Barale, Michael Rovatsos, Nehal Bhuta

Abstract: In this paper, we introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases. We investigate an under-studied legal domain with a case study on refugee law in Canada. Searching case law for past similar cases is a key part of legal work for both lawyers and judges, the potential end-users of our prototype. While traditional named-entity recogn… ▽ More In this paper, we introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases. We investigate an under-studied legal domain with a case study on refugee law in Canada. Searching case law for past similar cases is a key part of legal work for both lawyers and judges, the potential end-users of our prototype. While traditional named-entity recognition labels such as dates provide meaningful information in legal work, we propose to extend existing models and retrieve a total of 19 useful categories of items from refugee cases. After creating a novel data set of cases, we perform information extraction based on state-of-the-art neural named-entity recognition (NER). We test different architectures including two transformer models, using contextual and non-contextual embeddings, and compare general purpose versus domain-specific pre-training. The results demonstrate that models pre-trained on legal data perform best despite their smaller size, suggesting that domain matching had a larger effect than network architecture. We achieve a F1 score above 90% on five of the targeted categories and over 80% on four further categories. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 9 pages, preprint of long paper accepted to Findings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2023

Showing 1–5 of 5 results for author: Barale, C