Showing 1–1 of 1 results for author: Siska, C

Search v0.5.6 released 2020-02-24

arXiv:2404.16966 [pdf, other]

cs.CL

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Authors: Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono

Abstract: Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that… ▽ More Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points. △ Less

Submitted 5 June, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Search v0.5.6 released 2020-02-24