Search | arXiv e-print repository

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Authors: Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

Abstract: Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice ques… ▽ More Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions. We explain this phenomenon by conducting systematic experiments over three broad categories of benchmark perturbations and identifying the sources of this behavior. Our analysis results in several best-practice recommendations, including the advantage of a hybrid scoring method for answer selection. Our study highlights the dangers of relying on simple benchmark evaluations and charts the path for more robust evaluation schemes on the existing benchmarks. The code for this paper is available at https://github.com/National-Center-for-AI-Saudi-Arabia/lm-evaluation-harness. △ Less

Submitted 3 July, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: updated with ACL 2024 camera ready version

arXiv:2308.09843 [pdf]

Large thermo-spin effects in Heusler alloy based spin-gapless semiconductor thin films

Authors: Amit Chanda, Deepika Rani, Derick DeTellem, Noha Alzahrani, Dario A. Arena, Sarath Witanachchi, Ratnamala Chatterjee, Manh-Huong Phan, Hariharan Srikanth

Abstract: Recently, Heusler alloys-based spin gapless semiconductors (SGSs) with high Curie temperature (TC) and sizeable spin polarization have emerged as potential candidates for tunable spintronic applications. We report comprehensive investigation of the temperature dependent ANE and intrinsic longitudinal spin Seebeck effect (LSSE) in CoFeCrGa thin films grown on MgO substrates. Our findings show the a… ▽ More Recently, Heusler alloys-based spin gapless semiconductors (SGSs) with high Curie temperature (TC) and sizeable spin polarization have emerged as potential candidates for tunable spintronic applications. We report comprehensive investigation of the temperature dependent ANE and intrinsic longitudinal spin Seebeck effect (LSSE) in CoFeCrGa thin films grown on MgO substrates. Our findings show the anomalous Nernst coefficient for the MgO/CoFeCrGa (95 nm) film is $\cong 1.86$ micro V/K at room temperature which is nearly two orders of magnitude higher than that of the bulk polycrystalline sample of CoFeCrGa (= 0.018 micro V/K) but comparable to that of the magnetic Weyl semimetal Co2MnGa thin film (2-3 micro V/K). Furthermore, the LSSE coefficient for our MgO/CoFeCrGa(95nm)/Pt(5nm) heterostructure is $\cong 20.5$ $μ$V/K/$Ω$ at room temperature which is twice larger than that of the half-metallic ferromagnetic La$_{0.7}$Sr$_{0.3}$MnO$_3$ thin films ($\cong$ 20.5 $μ$V/K/$Ω$). We show that both ANE and LSSE coefficients follow identical temperature dependences and exhibit a maximum at $\cong$ 225 K which is understood as the combined effects of inelastic magnon scatterings and reduced magnon population at low temperatures. Our analyses not only indicate that the extrinsic skew scattering is the dominating mechanism for ANE in these films but also provide critical insights into the functional form of the observed temperature dependent LSSE at low temperatures. Furthermore, by employing radio frequency transverse susceptibility and broadband ferromagnetic resonance in combination with the LSSE measurements, we establish a correlation among the observed LSSE signal, magnetic anisotropy and Gilbert dam** of the CoFeCrGa thin films, which will be beneficial for fabricating tunable and highly efficient Heusler alloys based spincaloritronic nanodevices. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2211.12003 [pdf, other]

Application of property-based testing tools\\ for metamorphic testing

Authors: Nasser Alzahrani, Maria Spichkova, James Harland

Abstract: Metamorphic testing (MT) is a general approach for the testing of a specific kind of software systems -- so-called ``non-testable'', where the ``classical'' testing approaches are difficult to apply. MT is an effective approach for addressing the test oracle problem and test case generation problem. The test oracle problem is when it is difficult to determine the correct expected output of a parti… ▽ More Metamorphic testing (MT) is a general approach for the testing of a specific kind of software systems -- so-called ``non-testable'', where the ``classical'' testing approaches are difficult to apply. MT is an effective approach for addressing the test oracle problem and test case generation problem. The test oracle problem is when it is difficult to determine the correct expected output of a particular test case or to determine whether the actual outputs agree with the expected outcomes. The core concept in MT is metamorphic relations (MRs) which provide formal specification of the system under test. One of the challenges in MT is \emph{effective test generation}. Property-based testing (PBT) is a testing methodology in which test cases are generated according to desired properties of the software. In some sense, MT can be seen as a very specific kind of PBT.\\ In this paper, we show how to use PBT tools to automate test generation and verification of MT. In addition to automation benefit, the proposed method shows how to combine general PBT with MT under the same testing framework. △ Less

Submitted 21 November, 2022; originally announced November 2022.

Comments: Preprint. Accepted to the 17th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2022). Final version published by SCITEPRESS, http://www.scitepress.org

arXiv:1705.10032 [pdf, other]

From Temporal Models to Property-Based Testing

Authors: Nasser Alzahrani, Maria Spichkova, Jan Olaf Blech

Abstract: This paper presents a framework to apply property-based testing (PBT) on top of temporal formal models. The aim of this work is to help software engineers to understand temporal models that are presented formally and to make use of the advantages of formal methods: the core time-based constructs of a formal method are schematically translated to the BeSpaceD extension of the Scala programming lang… ▽ More This paper presents a framework to apply property-based testing (PBT) on top of temporal formal models. The aim of this work is to help software engineers to understand temporal models that are presented formally and to make use of the advantages of formal methods: the core time-based constructs of a formal method are schematically translated to the BeSpaceD extension of the Scala programming language. This allows us to have an executable Scala code that corresponds to the formal model, as well as to perform PBT of the models functionality. To model temporal properties of the systems, in the current work we focus on two formal languages, TLA+ and FocusST. △ Less

Submitted 28 May, 2017; originally announced May 2017.

Comments: Preprint. Accepted to the 12th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2017). Final version published by SCITEPRESS, http://www.scitepress.org

arXiv:1612.01686 [pdf, other]

Spatio-temporal Models for Formal Analysis and Property-based Testing

Authors: Nasser Alzahrani, Maria Spichkova, Jan Olaf Blech

Abstract: This paper presents our ongoing work on spatio-temporal models for formal analysis and property-based testing. Our proposed framework aims at reducing the impedance mismatch between formal methods and practitioners. We introduce a set of formal methods and explain their interplay and benefits in terms of usability. This paper presents our ongoing work on spatio-temporal models for formal analysis and property-based testing. Our proposed framework aims at reducing the impedance mismatch between formal methods and practitioners. We introduce a set of formal methods and explain their interplay and benefits in terms of usability. △ Less

Submitted 9 December, 2016; v1 submitted 6 December, 2016; originally announced December 2016.

Comments: Preprint. Accepted to the Software Technologies: Applications and Foundations (STAF 2016). Final version published by Springer International Publishing AG

arXiv:1512.04743 [pdf, ps, other]

Model comparison with missing data using MCMC and importance sampling

Authors: Panayiota Touloupou, Naif Alzahrani, Peter Neal, Simon E. F. Spencer, Trevelyan J. McKinley

Abstract: Selecting between competing statistical models is a challenging problem especially when the competing models are non-nested. In this paper we offer a simple solution by devising an algorithm which combines MCMC and importance sampling to obtain computationally efficient estimates of the marginal likelihood which can then be used to compare the models. The algorithm is successfully applied to longi… ▽ More Selecting between competing statistical models is a challenging problem especially when the competing models are non-nested. In this paper we offer a simple solution by devising an algorithm which combines MCMC and importance sampling to obtain computationally efficient estimates of the marginal likelihood which can then be used to compare the models. The algorithm is successfully applied to longitudinal epidemic and time series data sets and shown to outperform existing methods for computing the marginal likelihood. △ Less

Submitted 15 December, 2015; originally announced December 2015.

Comments: 34 pages

Showing 1–6 of 6 results for author: Alzahrani, N