Skip to main content

Showing 1–2 of 2 results for author: Boubdir, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.17295  [pdf, other

    cs.CL cs.AI

    Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

    Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

    Abstract: In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fund… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: 22 pages, 7 figures, 2 tables. Revised version of the paper accepted at GEM Workshop, EMNLP 2023

  2. arXiv:2310.14424  [pdf, other

    cs.CL cs.AI

    Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

    Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

    Abstract: Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizi… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: 37 pages, 8 figures