Skip to main content

Showing 1–1 of 1 results for author: T, A S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.19450  [pdf, other

    cs.AI cs.CL

    Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

    Authors: Saurabh Srivastava, Annarose M B, Anto P V, Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, Sooraj Thomas

    Abstract: We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 37 pages, 10 figures