HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: utfsym

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.07747v1 [cs.CL] 12 Mar 2024

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Abstract

To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.

Keywords: Large Language Models, Mathematical Reasoning Evaluation, Benchmark

\NAT@set@cites

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models


Yan Liu1, Renren **1, Lin Shi2, Zheng Yao3, Deyi Xiong1thanks:  Corresponding author.
1Tian** University, Tian**, China,
2China University of Geosciences, Wuhan, China,
3The University of Queensland, QLD, Australia,
{yan_liu,rr**,dyxiong}@tju.edu.cn
{lingshi0265}@gmail.com
{zheng.yao1}@uq.net.au

Abstract content

1.   Introduction

Mathematics has always been an important part of the evaluation of LLMs (Wei et al., 2022), which not only assesses the ability of LLMs in understanding and solving mathematical problems, but also profoundly measures the essential capability of LLMs in abstract conceptualization, logical reasoning and so on. Therefore, a high-quality mathematical evaluation benchmark is of great importance to a comprehensive LLM evaluation.

Previous works (Hosseini et al., 2014; Roy and Roth, 2015) curate mathematical test sets in English, which serve as a repository for grade school math word problems with accuracy being used as the evaluation metric. Recent years have witnessed a substantial progress in Chinese LLMs. Hence, mathematical evaluation datasets in Chinese (Wei et al., 2023; Yang et al., 2023) have been created correspondingly. These two previous Chinese datasets categorize testing instances by grade levels, providing a preliminary evaluation of Chinese LLMs on these levels. Their evaluation results show that the accuracy of GPT-4 for any grade surpasses or is close to 60%. However, a simple accuracy does not help us understand which mathematical concepts or skills LLMs have mastered. There is an urgent need for a comprehensive test set that can provide fine-grained evaluation results.

Refer to caption
Figure 1: FineMath can evaluate LLMs’ mathematical ability from three aspects: accuracy of understanding abstract mathematical concepts, accuracy of reasoning, and overall accuracy.

Apart from arithmetic operations, mathematical ability involves diverse reasoning capabilities. We believe that the evaluation of the mathematical ability of LLMs should include two aspects:

  • Providing diverse abstract mathematical concepts.

  • Evaluating the logical and mathematic reasoning abilities of LLMs over mathematical problems at different difficulty levels.

In pursuit of these aspects, we propose FineMath, a benchmark composed of Math Word Problems (MWPs), designed to comprehensively assess LLMs’ mathematical capability in a fine-grained way. FineMath organizes MWPs according to key mathematical concepts taught in elementary school, and each type of MWPs contains three levels of difficulty, facilitating detailed reasoning ability analysis.

Specifically, FineMath consists of 17 types of MWPs. For defining and collecting these MWP types, we have referred to the mathematics curriculum standards established by China’s Ministry of Education and the principles and standards for school mathematics set by the American National Council of Teachers of Mathematics (NCTM). The key concepts and skills in grade school include Number & Operations, Algebra, Geometry, Measurement, Data Analysis & Probability, Problem Solving, and Reasoning. Different key concepts involve the use of different knowledge and abilities. We have also annotated the reasoning steps and process for each MWP, categorizing them into questions requiring one reasoning step, two reasoning steps, and three or more reasoning steps.

Based on the curated benchmark, we have conducted a thorough analysis of the evaluation process and methods. Evaluations in mathematics have always emphasized the accuracy of results. However, we have observed factors that greatly influence the model’s results, thus affecting our understanding of its capabilities:

  • The model is sensitive to the prompts used during the evaluation, and the results vary accordingly.

  • The methods of evaluation can also affect the model’s results. We have compared the model’s performance in selecting the final answers from the options of multi-choice questions, demonstrating that the form of evaluation tasks and options can influence the model’s results to a certain extent.

  • The length of the LLM-generated answers, to some degree, reflects the model’s “confidence” when handling questions.

The main contributions of our work are as follows:

  • We propose a fine-grained elementary school MWPs benchmark for Chinese LLMs, which can assess the mathematical capabilities of LLMs from three aspects: accuracy of understanding abstract mathematical concepts, accuracy of reasoning, and overall accuracy.

  • We conduct an in-depth analysis of the contamination in our dataset, enabling researchers of LLMs to conduct a credibility analysis on the evaluation results.

  • We evaluate GPT-4, GPT-3.5-Turbo, and 8 Chinese LLMs, revealing their mathematical reasoning capabilities, and provide detailed evaluation results in various aspects.

Test Sets Size Language
AddSub (Hosseini et al., 2014) 395 En
MultiArith (Roy and Roth, 2015) 600 En
SingleEq (Koncel-Kedziorski et al., 2015) 508 En
AQUA (Ling et al., 2017) 100K En
AsDiv (Miao et al., 2020) 2,305 En
GSM8K (Cobbe et al., 2021) 8.5k En
SVAMP (Patel et al., 2021) 8.5k En
CMATH (Wei et al., 2023) 1.7K Zh
K6 (Yang et al., 2023) 600 Zh
FineMath (ours) 1,584 Zh
Table 1: An overview of MWP datasets.

2.   Related Work

Traditional MWP datasets like AddSub (Hosseini et al., 2014) and MultiArith (Roy and Roth, 2015) are integrated into a MWP repository. Other similar datasets include SingleEq (Koncel-Kedziorski et al., 2015), AQUA (Ling et al., 2017) and AsDiv (Miao et al., 2020). GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021) take advantage of detailed annotations and have prevailed in recent evaluations.

Our work is most inspired by the MATH (Hendrycks et al., 2021) dataset. MATH collects problems from American high school mathematics competitions and categorizes problems into seven subjects. These subjects are Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra and Precalculus. However, these problems are very challenging, even when humans answer them, the accuracy rate is only 40%. Considering that many LLMs are still in their early versions, overly difficult problems may have limited significance for testing these models.

CMATH proposed by (Wei et al., 2023) and K6 proposed by (Yang et al., 2023) are the two datasets that are relatively similar to ours developed concurrently. All these datasets focus on math word problems of elementary school, and organize instances by grade level. CMATH contains 1.7K problems collected from workbooks and exams on the Internet. K6 is composed of 600 problems collected from an educational institution. However, neither of the two datasets have been publicly released, precluding us from conducting an empirical comparison to them.

Ape210K proposed in (Zhao et al., 2020) is a slightly earlier dataset. It contains 210K enormous Chinese math word problems from elementary school. Test sets alone in Ape210K contain as many as 5,000 problems. However, the test sets do not provide annotations related to LLMs.

An overview of the related MWP datasets is shown in Table 1.

3.   Data Collection and Annotation

We create our dataset by collecting a diverse set of questions. We collect as many questions as possible from textbooks, workbooks and the Internet, from which high-quality questions are selected.

After collecting these questions, we conduct automatical preprocessing on the collected data, which includes removing questions that are not math word problems, discarding questions with fewer than 10 Chinese characters, and retaining only questions that contain definite answers. Additionally, any questions that require reference to images are also discarded.

On the preprocessed data, we further perform manual annotation and processing: MWP categorization, question standardization, reasoning step and answer standardization and multiple-choice question transformation. We elaborate these data curation steps in the following subsections.

3.1.   MWP Categorization

We categorize the collected questions into 17 types, each corresponding to a key or basic concept111https://www.nctm.org/Standards-and-Positions/Principles-and-Standards/Principles,-Standards,-and-Expectations/ inherent in the MWPs. We introduce these key concepts encompassed in our dataset, along with their corresponding categories as follows.

Number & Operations: This mathematical concept requires an understanding of numbers, ways of representing numbers, relationships among numbers, and number systems. It also necessitates an understanding of the meanings of operations and the ability to compute with these operations. This concept includes 7 MWP categories: Percents, Decimals, Fractions, Factors & Multiples, Counting, Proportions and Mixed Operations.

Measurement: Measurement requires an understanding of the measurable attributes of objects and the units, systems, and processes of measurement. It corresponds to two MWP categories, namely Spatial Sense and Time.

Data Analysis & Probability: Data analysis and probability requires one to select and use appropriate statistical methods to analyze data and to apply basic concepts of probability. This concept is related to Central Tendency and Probability.

Algebra: This concept involves understanding patterns, relations and functions. The MWP categories of this concept include Equations and Patterns.

Geometry: Geometry is to analyze the characteristics and properties of two- and three-dimensional geometric shapes, specify locations and describe spatial relationships using coordinate geometry and other representational systems. It contains two MWP categories: Two & Three Dimensional Geometry (Basic Geometry) and Analytic Geometry.

Others: We also categorized two types of special MWPs. Problem 1: simple optimization problems. Problem 2: tree planting problems that involve the relationship between points and segments.

3.2.   Question Standardization

Many questions in the selected data contain multiple queries. We normalize these questions so that each question contains only a single query. Ambiguous queries are also rephrased to enable the model to generate a unique answer.

3.3.   Mathematical Reasoning and Answer Standardization

The process of answering MWPs is manually conducted, and the ground-truth answers are manually double checked by humans. We ask annotators to provide the steps in answering each MWP. Each step should be atomic and indivisible. For calculations that use a fixed solution formula, e.g., computing the area of a circle, we consider them as single-step MWPs.The reasons for annotating the number of required mathematical reasoning steps are two-fold:

1. The number of required reasoning steps can be treated as a proxy to the difficulty level of MWPs. Intuitively, MWPs that require multiple steps to solve are more difficult than those solved in a single step. The progression from one step to the next also represents the reasoning process. Therefore, we categorize the difficulty of MWPs in our dataset into three levels. MWPs that can be solved in a single step are level-1 MWPs; MWPs that require two steps to solve are level-2 MWPs; level-3 MWPs are those that require three or more steps to solve.

2. Presenting the number of reasoning steps facilitates reviewing and analyzing the collected data, thereby ensuring data quality.

3.4.   Multiple-Choice Question Transformation

The original MWPs are accompanied with their single ground-truth answers. To facilitate automatic evaluation, we also transform them into multiple-choice question forms by manually providing additional contrastive answer options, similar to the AQUA dataset (Ling et al., 2017).

4.   Data Statistics and Analysis

We provide data statistics and analysis on contamination of our dataset in this section.

4.1.   Data Statistics

Concepts Type Total Level-1 Level-2 Level-3
Number & Operations Percents 60 20 20 20
Decimals 93 23 30 40
Fractions 81 24 27 30
Fac&Multi 61 20 21 20
Counting 60 20 20 20
Proportions 78 20 20 38
Mix Operations 267 91 107 69
Measurement Spatial Sense 89 20 47 22
Time 64 20 24 20
Data Analysis & Probability Central Tendency 189 22 98 69
Probability 68 28 20 20
Algebra Equations 100 0 25 75
Patterns 60 20 20 20
Geometry Basic Geometry 132 20 48 64
Analytic Geometry 60 20 20 20
Others Problem 1 62 20 21 21
Problem 2 60 20 20 20
Table 2: Overall statistics of FineMath. Level-1/2/3 denotes that a math word problem requires 1/2/3+ mathematical reasoning steps to solve.

The overall data statistics are displayed in Table 2. All 1,584 questions are categorized into five major mathematical concepts and two classic types of MWPs. Each type contains at least 60 questions, and each difficulty level contains at least 20 questions.

4.2.   Analysis on Contamination

Refer to caption
Figure 2: Contamination analysis. The overlap rate between FineMath and the training sets of Ape210k.

FineMath serves as a comprehensive benchmark, encompassing a diverse range of math word problems at the Chinese elementary school level. It is specifically designed to assess the mathematical reasoning capabilities of Chinese large language models. However, these language models are typically trained on a huge amount of data derived from multiple sources, including web pages, books, codes and so on. This raises the potential risk of test data contamination, as some test examples from FineMath may unintentionally be included in the training data of these language models. Test data contamination can lead to an overestimation of a model’s performance, potentially resulting in misleading conclusions regarding the model’s generalization capabilities. Consequently, investigating contamination and its impact on model performance for FineMath is of paramount importance.

Ape210K (Zhao et al., 2020) is a publicly available large-scale Chinese math word problem dataset, which has been splited into training, validation, and test sets. It is commonly utilized as a training dataset for mathematical problem-solving models (Hu and Jiang, 2022; Wu et al., 2021; Liang et al., 2023; Huang et al., 2021; Xiong et al., 2022; Huang et al., 2023; Yang et al., 2023; Liang et al., 2022). To determine potential contamination from Ape210K in FineMath, we adopt the identical methodology leveraged in GPT-3 (Brown et al., 2020) to compute the n-gram overlap between Ape210K and FineMath. In this approach, a test example in FineMath is considered as an overlapped example with Ape210k if any n-gram from this test example also appears in Ape210k. Specifically, we insert white spaces around any Chinese, Japanese, and Korean (CJK) characters, as well as between punctuation marks and words. Subsequently, we tokenize the text based on these white spaces. It is important to note that we disregard letter case when computing n-grams.

To perform a rigorous quantitative assessment of contamination, we define the overlap rate as the fraction of instances within FineMath that exhibit such overlap. Furthermore, for the purposes of computing overlap, we set the value of n to 13. The overlap rate between FineMath and the training sets of Ape210k is depicted in Figure 2. It suggests that the overlap rates of some question types are significantly higher than others, such as Basic Geometry and Proportions. To gain deeper insights into the impact of these overlapped examples on model performance, we partition the test examples into two datasets: a contaminated dataset composed of the overlapped examples, and a clean dataset whose test examples exhibit no overlap with the Ape210k training set. Subsequently, we examine the performance of the model on each of these datasets separately. We select GPT-4 and MathGLM-10B for analysis since GPT-4 is widely recognized as the most advanced LLM currently available and MathGLM-10B has been trained on the Ape210k training set which overlaps with some test examples of FineMath. The experimental results are presentend in Table 3. Notably, MathGLM-10B performs significantly better on the contaminated dataset compared to the clean dataset. In contrast, GPT-4 exhibits comparable performance on both datasets. This suggests that MathGLM-10B may be overfitting to the overlapped examples and that contamination can inflate a model’s performance. Consequently, to ensure a fair comparison between models and to draw accurate conclusions from the FineMath benchmark, we recommend filtering out overlapped examples between the training set and the FineMath benchmark.

Model/Dataset Level-1 Level-2 Level-3 Overall
MathGLM-10B (Clean) 0.45 0.42 0.22 0.37
MathGLM-10B (Contaminated) 0.65 0.75 0.70 0.71
GPT-4 (Clean) 0.83 0.76 0.61 0.74
GPT-4 (Contaminated) 0.83 0.65 0.63 0.67
Table 3: Accuracy results of GPT-4 and MathGLM-10B on the contaminated dataset and clean dataset.

5.   Experiments

We conducted experiments on the proposed FineMath to evaluate a series of LLMs, assessing the mathematical reasoning capabilities of them.

5.1.   Evaluated LLMs

We assessed three classes of LLMs: GPT-4 and GPT-3.5-Turbo developed by OpenAI; LLMs developed for Chinese; and LLMs finetuned with Chinese mathematics data. Specific information can be found in Table 4.

Model RLHF Parameters Training Token
GPT-4 w - -
GPT-3.5-Turbo w - -
GhatGLM2-6B w 6B 1.4T
Moss-SFT-16B w/o 16B 120B
InternLM-Chat-7B w 7B 1.6T
Qwen-7B-Chat w 7B 2.4T
Baichuan-7B w/o 7B 1.2T
Baichuan2-7B-Chat w 7B 2.6T
MathGLM-10B - 10B -
MathGLM-335M - 335M -
Table 4: All the LLMs that we evaluated in this paper. MathGLM-10B and MathGLM-335M, both of which were fine-tuned using an arithmetic training dataset.
Prompt 0: Nothing is provided, only the question is input into the model.
Prompt 1: Here is a math word problem; please provide the answer to this question.
Do not explain the reason.
Prompt 2: Here is a math word problem; please select the correct option.
Do not explain the reason.
Prompt 3: Here is a math word problem, please give the answer to this question.
And explain why.
Prompt 4: Answer:….
Table 5: Prompts used for evaluation and analysis.

5.2.   Prompts

All experiments were conducted under the zero-shot. We tried several prompts for evaluation and analysis, which are shown in Table 5.

5.3.   Main Results

Refer to caption
Figure 3: Main results of different evaluated LLMs on our dataset (under Prompt 0).
Model Percents Decimals Fractions Factors & Counting Proportions Mix Spatial Time Central Probability Equations Patterns Basic Analytic Problem 1 Problem 2
Multiples Operations Sense Tendency Geometry Geometry
GPT-4 0.8 0.76 0.67 0.74 0.38 0.78 0.89 0.71 0.77 0.87 0.68 0.64 0.68 0.7 0.68 0.39 0.55
GPT-3.5-Turbo 0.75 0.7 0.69 0.7 0.33 0.65 0.81 0.57 0.69 0.71 0.34 0.71 0.47 0.49 0.42 0.23 0.55
ChatGLM2-6B 0.65 0.46 0.3 0.52 0.3 0.33 0.63 0.63 0.38 0.35 0.24 0.38 0.2 0.46 0.18 0.1 0.3
Moss-SFT-16B 0.13 0.05 0.07 0.11 0.03 0.06 0.15 0.06 0.09 0.03 0.12 0.05 0.07 0.05 0.03 0.08 0.05
InternLM-Chat-7B 0.4 0.35 0.15 0.34 0.05 0.31 0.42 0.36 0.23 0.17 0.18 0.23 0.22 0.18 0.25 0.21 0.32
Qwen-7B-Chat 0.58 0.53 0.41 0.39 0.1 0.32 0.58 0.42 0.28 0.31 0.35 0.34 0.32 0.38 0.2 0.18 0.25
Baichuan-7B 0.08 0.13 0.01 0.08 0.05 0.03 0.14 0.08 0.05 0.1 0.12 0.06 0.07 0.07 0.07 0.05 0.08
Baichuan2-7B-Chat 0.58 0.52 0.48 0.54 0.2 0.4 0.65 0.58 0.25 0.32 0.26 0.4 0.25 0.37 0.13 0.11 0.22
MathGLM-10B 0.52 0.42 0.68 0.57 0.05 0.64 0.61 0.38 0.45 0.53 0.29 0.36 0.18 0.74 0.23 0.4 0.65
MathGLM-335M 0.45 0.3 0.63 0.49 0.07 0.47 0.57 0.34 0.3 0.47 0.29 0.31 0.08 0.68 0.23 0.39 0.58
Table 6: Results across the 17 MWP categories (under Prompt 0).
Refer to caption
Figure 4: Results in terms of the number of mathematical reasoning steps (under Prompt 0).

The overall accuracy results of assessed LLMs are visualized in Figure 3. GPT-4 and GPT-3.5-turbo perform outstandingly, with their accuracies reaching as high as 73% and 62%, respectively.

Among the evaluated Chinese LLMs, MathGLM-10B, MathGLM-335M, ChatGLM2-6B and Baichuan2-7B-Chat obtain an accuracy of > 40%. Qwen-7B-Chat and InternLM-Chat-7B are at a slightly below-average level. However, both Baichuan-7B and Moss-SFT-16B perform poorly on our dataset, with an accuracy of < 10%. Upon examining the responses generated by these two models, we find that their answers often stray from the MWPs, generating a lot of irrelevant content or repeatedly producing the same questions.

By considering both model accuracy and the detailed information provided in Table 4, we deduce that the lower accuracy of Moss-SFT-16B is due to an insufficient amount of training data. The performance of Baichuan-7B is hampered because it has not undergone RLHF fine-tuning, which prevents the model from fully understanding the question. In contrast, the accuracy of Baichuan2-7B-Chat, which has been fine-tuned, has significantly improved. In summary, RLHF fine-tuning, having model parameters exceeding 6 billion, and training data reaching the trillion level are all crucial for training an LLM with problem-solving and reasoning capabilities.

5.4.   Results across 17 MWP Categories

Results across the 17 MWP categories are displayed in Table 6. It is evident that the MWP types “Counting” and “Problem1” are more challenging than other MWP categories according to the results. This could be due to the complexities involved in the counting of Chinese numerals and their conversion to Arabic numerals, and the common sense issues encountered in the optimization problem for “Problem1”. All models demonstrate better performance on “Mixed Operation” than on other categories. We also observe that the performance of different models vary significantly.

GPT-4 outperforms all other models. It achieves an accuracy below 40% on only two MWP categories, surpasses 80% on three MWP categories, exceeds 70% in six MWP categories, and goes beyond 60% in five MWP categories. GPT-4 scores the highest, 89%, on the ’Mixed Operation’ MWPs, while it scores the lowest, 38%, on the ’Counting’ MWPs. Notably, GPT-4 excels in Probability and Analytic Geometry, outperforming all other models with an improvement of over 25%. GPT-3.5-Turbo’s performance on different MWP categories is similar to that of GPT-4, but there is a significant difference (over 20%) on Probability, Basic Geometry, and Analytic Geometry.

MathGLM-10B and ChatGLM2-6B exhibit similar performance trends across most MWP categories. However, significant differences are observed on Fractions, Counting, Spatial Sense, Basic Geometry, Proportions, Problem1, and Problem2. Due to contamination, MathGLM-10B surpasses GPT-4 on Basic Geometry, Problem 1 and Problem 2.

Baichuan2-7B-Chat, Qwen-7B-Chat and InternLM-Chat-7B are still at a medium or slightly below-average level. Both Moss-SFT-16B and Baichuan-7B perform poorly across all MWP categories, failing to achieve an accuracy of 20%.

Prompt Level-1 Level-2 Level-3 Overall
Prompt 0 0.83 0.77 0.62 0.73
Prompt 1 0.81 0.69 0.34 0.59
Prompt 4 0.8 0.65 0.36 0.58
Table 7: Accuracy results of GPT-4 with Prompt 0, Prompt 1, Prompt 4.
Question

学校买来120本课外读物,其中的40%分给高年级,剩下的课外读物按5:3分给中年级和低年级,低年级分到多少本?

The school bought 120 supplementary books, 40% of which were given to the upper grades. The remaining books were divided between the middle and early grades in a ratio of 5:3. How many books did the early grades receive?

Prompt 0+Question

GPT-4 Response: 高年级分到的课外读物为120*40%=48本;剩下的课外读物为120-48=72本;低年级分到的课外读物为72*3/(5+3)=27本。答:低年级分到27本。 Results: \usym2714

GPT-4 Response: The number of supplementary books allocated to the upper grades is 120*40%=48 books. The remaining books total 120-48=72 books. The supplementary books allocated to the early grades are calculated as 72*3/(5+3)=27 books. Answer: The early grades are allocated 27 books.

Prompt 1+Question

GPT-4 Response: 36本 Results: \usym2717

GPT-4 Response: 36 books

Prompt 3+Question

GPT-4 Response: 36本。首先,学校将120本书的40%分给高年级,即120*40%=48本。剩下的书就是120-48=72本。然后,这72本书按照5:3的比例分给中年级和低年级。这个比例的总和是5+3=8,所以低年级分到的书就是72*(3/8)=27本。Results: \usym2717

GPT-4 Response: 36 books. Firstly, the school allocates 40% of the 120 books to the upper grades. This totals to 120*40%=48 books. The remaining books amount to 120-48 = 72 books. Then, these 72 books are divided between the middle and early grades at a ratio of 5:3. The total of this ratio is 5+3=8. Therefore, the number of books allocated to the early grades is 72*(3/8) =27 books.

Prompt 4+Question

GPT-4 Response: 36本 Results: \usym2717

GPT-4 Response: 36 books

Prompt 0+Question

GPT-4 Response: C:27

GPT-4 Response: C:27

(with MCQs)
A:72 B:48 C:27 D:80

Results: \usym2714

Prompt 0+Question

GPT-4 Response: D:36

GPT-4 Response: D:36

(with D replaced)
A:72 B:48 C:27 D:36

Results: \usym2717

Table 8: Different prompts and their corresponding responses from GPT-4. MCQs: multiple-choice questions.

5.5.   Results in terms of the Number of Mathematical Reasoning Steps

We further conducted experiments to evaluate LLMs reasoning ability in terms of the number of mathematical reasoning steps. Results are shown in Figure 4. We observe that the performance of LLMs decreases as the number of reasoning steps increases. GPT-4 maintains an accuracy of over 60% at all difficulty levels, reaching as high as 82% on MWPs that require only one step of reasoning. The accuracy of GPT-3.5-Turbo is, on average, 10% lower than that of GPT-4. While ChatGLM2-6B, Baichuan2-7B-Chat and Qwen-7B-Chat outperform MathGLM-335M and MathGLM-10B on Level-1 MWPs, its accuracy falls below those of MathGLM-335M and MathGLM-10B on Level-2/3 MWPs. Similar to their performance across MWP categories, Moss-SFT-16B and Baichuan-7B show a significant difference in performance at all difficulty levels compared to the other models.

The accuracy difference between Qwen-7B-Chat and InternLM-Chat-7B on different reasoning steps is quite substantial, exceeding 30%. In the case of Qwen-7B-Chat, the accuracy on problems requiring only one-step reasoning is 62%, but this figure drops to just 21% for problems requiring three or more reasoning steps. This phenomenon suggests that the model may need more training in terms of inference.

6.   Analysis

Unlike other studies that only evaluate accuracy, we have further analyzed factors in the evaluation process. These overlooked factors greatly affect the evaluation results and our understanding of the true mathematical reasoning capabilities of LLMs.

6.1.   Prompts Really Does Matter

During the evaluation, instructions are generally used to guide the assessed model to produce answers. For instance, we might say, “Here is a math problem, please provide the answer to this question. Do not explain the reason.” Alternatively, we might provide an answer template such as “Question: … Answer:”. However, our experiments showed that even a single word like “Answer:” can significantly affect the model’s accuracy. We tested three prompt 0, prompt 1 and Prompt4 on GPT-4, results are shown in Table 7.

We can see that the overall accuracy results with the three prompts are 73%, 59%, and 58%, respectively, with a gap reaching up to 15%.

Prompt like “Answer:” appear to encourage the model to forego reasoning and directly provide the answer, which increases the likelihood of generating incorrect responses. An example is shown in Table 8: Prompt 4+Question.

6.2.   Evaluation Methods: Generation vs. Option Prediction

In our preliminary experiments, we have discovered that some newly developed LLMs do not follow instructions well, often generating large chunks of tokens unrelated to the answer. Therefore, we decide to transform our data into multiple-choice questions, for which the evaluated model can then select the correct answer option.

Model MCQs Level-1 Level-2 Level-3 Overall
GPT-4 w/o 0.83 0.77 0.62 0.73
w 0.67 0.6 0.6 0.62
GPT-3.5-Turbo w/o 0.73 0.65 0.51 0.62
w 0.65 0.66 0.57 0.63
ChatGLM2-6B w/o 0.58 0.46 0.25 0.42
w 0.69 0.62 0.46 0.58
Moss-SFT-16B w/o 0.18 0.04 0.05 0.08
w 0.16 0.13 0.14 0.14
InternLM-Chat-7B w/o 0.45 0.27 0.14 0.27
w 0.29 0.26 0.29 0.28
Qwen-7B-Chat w/o 0.62 0.39 0.21 0.38
w 0.47 0.42 0.39 0.42
Baichuan-7B w/o 0.15 0.09 0.03 0.08
w 0.32 0.25 0.25 0.27
Baichuan2-7B-Chat w/o 0.55 0.45 0.27 0.41
w 0.44 0.42 0.32 0.39
Table 9: Accuracy of LLMs with different evaluation methods: generation vs. option prediction.

Comparison results are displayed in Table 9. We can observe a significant difference in accuracy between option prediction (with multiple-choice questions) and direct answer generation, with a gap that can exceed 10%. Interestingly, restructuring the task in the form of multiple-choice questions seems to reduce the accuracy of high-performing models while increasing the accuracy of models that perform poorly. Upon examining instances, we have found that the answer option can act as another type of prompt influencing the model’s performance. For example in Table 8: Prompt 0+Question (with MCQs) and Prompt 0+Question (with D replaced).

Examples of GPT-4 outputs with different prompts and task forms are shown in Table 8. We want to understand why GPT-4 would provide the incorrect answer “36” under Prompt 1. Therefore, we utilize Prompt 3 to have GPT-4 explain its reasoning for choosing “36”. Interestingly, GPT-4 mentions the correct number “27” in the explanation, but still provides the incorrect answer, “36”. Given the seeming importance of “36”, we replace one of the options in the multiple-choice question with “36”. The model, which initially selected the correct answer, abandoned it in favor of “36”. Further examples can be observed in the responses generated by other models. For instance, the reasoning process may be incorrect, yet the correct answer is ultimately chosen. The options provided to the model also seem to influence the model’s generation probability to a certain degree. Therefore, we recommend using generation, rather than option prediction, for a more accurate evaluation of LLMs.

6.3.   Comparison of Response Lengths

We conducted a statistical analysis on the length of the responses generated by the models. We discover two phenomena. First, models like GPT-4 and GPT-3.5-Turbo tend to generate responses that are closely centered around the question, with shorter text. This may demonstrate the characteristics of models with high accuracy. Second, the more reasoning steps an MWP requires, the longer the response tends to be. Please refer to Table 10 for more details.

Model Level-1 Level-2 Level-3
GPT-4 33.18 56.92 102.37
GPT-4+Prompt 4 8.67 13.12 26.73
GPT-3.5-Turbo 71.10 104.78 173.87
GhatGLM2-6B 167.75 224.02 381.18
InternLM-Chat-7B 65.65 72.20 119.69
Qwen-7B-Chat 62.29 93.57 138.21
Baichuan2-7B-Chat 114.72 156.33 202.52
Baichuan-7B 128.48 174.73 130.21
Moss-SFT-16B 105.43 131.29 149.48
Table 10: Different response lengths of models (Since MathGLM-10B and MathGLM-335M, after fine-tuning, generates more formulas but fewer language descriptions, is not compared in this context).

We speculate that the model’s “confidence” in answering questions influences the length of its response. This tendency is also observed in some models that do not adhere strictly to instructions. For instance, even when instructed to provide only the answer without explaining, these models still generate logical reasoning for difficult MWPs.

7.   Conclusion

We present a fine-grained benchmark, FineMath, to comprehensively evaluate the mathematical capabilities of Chinese LLMs. We strive to evaluate as many LLMs as possible. We also conduct a contamination analysis, enabling researchers to examine whether the training data influences the evaluation results. Through testing various evaluation methods and processes, we demonstrate their potential to influence the results, underscoring the necessity for fair and effective evaluations to consider the interference caused by these two aspects.

8.   Bibliographical References

\c@NAT@ctr