-
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
Authors:
Martin Riddell,
Ansong Ni,
Arman Cohan
Abstract:
While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into…
▽ More
While large language models have achieved remarkable performance on various code generation benchmarks, there have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data. While recent work has investigated contamination in natural language generation and understanding tasks, there has been less extensive research into how data contamination impacts the evaluation of code generation, which is critical for understanding the robustness and reliability of LLMs in programming contexts. In this work, we perform a comprehensive study of data contamination of popular code generation benchmarks, and precisely quantify their overlap with pretraining corpus through both surface-level and semantic-level matching. In our experiments, we show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training. We also conduct extensive analysis on the factors that affects model memorization and generalization, such as model size, problem difficulty, and question length. We release all resulting files from our matching pipeline for future research.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
Authors:
Ansong Ni,
Pengcheng Yin,
Yilun Zhao,
Martin Riddell,
Troy Feng,
Rui Shen,
Stephen Yin,
Ye Liu,
Semih Yavuz,
Caiming Xiong,
Shafiq Joty,
Yingbo Zhou,
Dragomir Radev,
Arman Cohan
Abstract:
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific task…
▽ More
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising results, there is a notable lack of a comprehensive evaluation of these models language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs. This enables us to identify and analyze the typical failure modes across various tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We also release the evaluation framework and all model outputs, ho** to lay the groundwork for further future research in this domain.
△ Less
Submitted 2 October, 2023; v1 submitted 29 September, 2023;
originally announced September 2023.
-
FOLIO: Natural Language Reasoning with First-Order Logic
Authors:
Simeng Han,
Hailey Schoelkopf,
Yilun Zhao,
Zhenting Qi,
Martin Riddell,
Wenfei Zhou,
James Coady,
David Peng,
Yujie Qiao,
Luke Benson,
Lucy Sun,
Alex Wardle-Solano,
Hannah Szabo,
Ekaterina Zubova,
Matthew Burtell,
Jonathan Fan,
Yixin Liu,
Brian Wong,
Malcolm Sailor,
Ansong Ni,
Linyong Nan,
Jungo Kasai,
Tao Yu,
Rui Zhang,
Alexander R. Fabbri
, et al. (10 additional authors not shown)
Abstract:
Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FO…
▽ More
Large language models (LLMs) have achieved remarkable performance on a variety of natural language understanding tasks. However, existing benchmarks are inadequate in measuring the complex logical reasoning capabilities of a model. We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL), equipped with first-order logic (FOL) annotations. FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. The logical correctness of the premises and conclusions is ensured by their FOL annotations, which are automatically verified by an FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO constitute a new NL-FOL translation dataset. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models. Our results show that a subset of FOLIO presents a challenge for one of the most capable {Large Language Model (LLM)} publicly available, GPT-4.
△ Less
Submitted 17 May, 2024; v1 submitted 2 September, 2022;
originally announced September 2022.
-
Maximum nullity and zero forcing of circulant graphs
Authors:
Linh Duong,
Brenda K. Kroschel,
Michael Riddell,
Kevin N. Vander Meulen,
Adam Van Tuyl
Abstract:
It is well-known that the zero forcing number of a graph provides a lower bound on the minimum rank of a graph. In this paper we bound and characterize the zero forcing number of certain circulant graphs, including some bipartite circulants, cubic circulants, and circulants which are torus products, to obtain bounds on the minimum rank and the maximum nullity. We also evaluate when the zero forcin…
▽ More
It is well-known that the zero forcing number of a graph provides a lower bound on the minimum rank of a graph. In this paper we bound and characterize the zero forcing number of certain circulant graphs, including some bipartite circulants, cubic circulants, and circulants which are torus products, to obtain bounds on the minimum rank and the maximum nullity. We also evaluate when the zero forcing number will give equality.
△ Less
Submitted 7 June, 2019;
originally announced June 2019.