Search | arXiv e-print repository

Are We Done with MMLU?

Authors: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

Abstract: Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive fr… ▽ More Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux. △ Less

Submitted 7 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2404.09841 [pdf, other]

Anatomy of Industrial Scale Multilingual ASR

Authors: Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

Abstract: This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed descriptio… ▽ More This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale. △ Less

Submitted 16 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

arXiv:2307.10169 [pdf, other]

Challenges and Applications of Large Language Models

Authors: Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy

Abstract: Large Language Models (LLMs) went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas. In this paper, we aim to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field's current… ▽ More Large Language Models (LLMs) went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas. In this paper, we aim to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field's current state more quickly and become productive. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Comments: 72 pages. v01. Work in progress. Feedback and comments are highly appreciated!

arXiv:1902.11145 [pdf, other]

Adversarial Training for Satire Detection: Controlling for Confounding Variables

Authors: Robert McHardy, Heike Adel, Roman Klinger

Abstract: The automatic detection of satire vs. regular news is relevant for downstream applications (for instance, knowledge base population) and to improve the understanding of linguistic characteristics of satire. Recent approaches build upon corpora which have been labeled automatically based on article sources. We hypothesize that this encourages the models to learn characteristics for different public… ▽ More The automatic detection of satire vs. regular news is relevant for downstream applications (for instance, knowledge base population) and to improve the understanding of linguistic characteristics of satire. Recent approaches build upon corpora which have been labeled automatically based on article sources. We hypothesize that this encourages the models to learn characteristics for different publication sources (e.g., "The Onion" vs. "The Guardian") rather than characteristics of satire, leading to poor generalization performance to unseen publication sources. We therefore propose a novel model for satire detection with an adversarial component to control for the confounding variable of publication source. On a large novel data set collected from German news (which we make available to the research community), we observe comparable satire classification performance and, as desired, a considerable drop in publication classification performance with adversarial training. Our analysis shows that the adversarial component is crucial for the model to learn to pay attention to linguistic properties of satire. △ Less

Submitted 1 March, 2019; v1 submitted 28 February, 2019; originally announced February 2019.

Comments: Accepted for publication at NAACL 2019

arXiv:1307.2189 [pdf, other]

On the Topology of the Facebook Page Network

Authors: R. E. Slattery, R. R. McHardy, R. Bairathi

Abstract: The Facebook Page Network (FPN) is a platform for Businesses, Public Figures and Organizations (BPOs) to connect with individuals and other BPOs in the digital space. For over a decade scale-free networks have most appropriately described a variety of seemingly disparate physical, biological and social real-world systems unified by similar network properties such as scale-invariance, growth via a… ▽ More The Facebook Page Network (FPN) is a platform for Businesses, Public Figures and Organizations (BPOs) to connect with individuals and other BPOs in the digital space. For over a decade scale-free networks have most appropriately described a variety of seemingly disparate physical, biological and social real-world systems unified by similar network properties such as scale-invariance, growth via a preferential attachment mechanism, and a power law degree distribution P(k) = ck^-λ where typically 2<λ<3. In this paper we show that both the Facebook Page Network and its BPO-BPO subnetwork suggest power law and scale-free characteristics. We argue that social media analysts must consider the logarithmic and non-linear properties of social media audiences of scale. △ Less

Submitted 8 July, 2013; originally announced July 2013.

Comments: 3 pages, 1 figure

Showing 1–5 of 5 results for author: McHardy, R