Showing 1–1 of 1 results for author: Siska, C
-
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Authors:
Melissa Ailem,
Katerina Marazopoulou,
Charlotte Siska,
James Bono
Abstract:
Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that…
▽ More
Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
△ Less
Submitted 5 June, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.