Skip to main content

Showing 1–4 of 4 results for author: Mondorf, P

.
  1. arXiv:2406.18403  [pdf, other

    cs.CL

    LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

    Authors: Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni

    Abstract: There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human anno… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.12546  [pdf, other

    cs.CL

    Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models

    Authors: Philipp Mondorf, Barbara Plank

    Abstract: Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from ind… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 22 pages, 19 figures

  3. arXiv:2404.01869  [pdf, other

    cs.CL cs.AI

    Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

    Authors: Philipp Mondorf, Barbara Plank

    Abstract: Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow ac… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 26 pages, 2 figures

  4. arXiv:2402.14856  [pdf, other

    cs.CL cs.AI

    Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning

    Authors: Philipp Mondorf, Barbara Plank

    Abstract: Deductive reasoning plays a pivotal role in the formulation of sound and cohesive arguments. It allows individuals to draw conclusions that logically follow, given the truth value of the information provided. Recent progress in the domain of large language models (LLMs) has showcased their capability in executing deductive reasoning tasks. Nonetheless, a significant portion of research primarily a… ▽ More

    Submitted 3 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024 main, 31 pages, 19 figures