Skip to main content

Showing 1–1 of 1 results for author: Siska, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16966  [pdf, other

    cs.CL

    Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

    Authors: Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono

    Abstract: Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that… ▽ More

    Submitted 5 June, 2024; v1 submitted 25 April, 2024; originally announced April 2024.