FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

Abstract

To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.

Keywords: Large Language Models, Mathematical Reasoning Evaluation, Benchmark

\NAT@set@cites

Yan Liu¹, Renren **¹, Lin Shi², Zheng Yao³, Deyi Xiong¹^†^†thanks: Corresponding author.

¹Tian** University, Tian**, China,

²China University of Geosciences, Wuhan, China,

³The University of Queensland, QLD, Australia,

{yan_liu,rr**,dyxiong}@tju.edu.cn

{lingshi0265}@gmail.com

{zheng.yao1}@uq.net.au

Abstract content

1. Introduction

Mathematics has always been an important part of the evaluation of LLMs (Wei et al., 2022), which not only assesses the ability of LLMs in understanding and solving mathematical problems, but also profoundly measures the essential capability of LLMs in abstract conceptualization, logical reasoning and so on. Therefore, a high-quality mathematical evaluation benchmark is of great importance to a comprehensive LLM evaluation.

Previous works (Hosseini et al., 2014; Roy and Roth, 2015) curate mathematical test sets in English, which serve as a repository for grade school math word problems with accuracy being used as the evaluation metric. Recent years have witnessed a substantial progress in Chinese LLMs. Hence, mathematical evaluation datasets in Chinese (Wei et al., 2023; Yang et al., 2023) have been created correspondingly. These two previous Chinese datasets categorize testing instances by grade levels, providing a preliminary evaluation of Chinese LLMs on these levels. Their evaluation results show that the accuracy of GPT-4 for any grade surpasses or is close to 60%. However, a simple accuracy does not help us understand which mathematical concepts or skills LLMs have mastered. There is an urgent need for a comprehensive test set that can provide fine-grained evaluation results.

Refer to caption — Figure 1: FineMath can evaluate LLMs’ mathematical ability from three aspects: accuracy of understanding abstract mathematical concepts, accuracy of reasoning, and overall accuracy.

Apart from arithmetic operations, mathematical ability involves diverse reasoning capabilities. We believe that the evaluation of the mathematical ability of LLMs should include two aspects:

•

Providing diverse abstract mathematical concepts.
•

Evaluating the logical and mathematic reasoning abilities of LLMs over mathematical problems at different difficulty levels.

In pursuit of these aspects, we propose FineMath, a benchmark composed of Math Word Problems (MWPs), designed to comprehensively assess LLMs’ mathematical capability in a fine-grained way. FineMath organizes MWPs according to key mathematical concepts taught in elementary school, and each type of MWPs contains three levels of difficulty, facilitating detailed reasoning ability analysis.

Specifically, FineMath consists of 17 types of MWPs. For defining and collecting these MWP types, we have referred to the mathematics curriculum standards established by China’s Ministry of Education and the principles and standards for school mathematics set by the American National Council of Teachers of Mathematics (NCTM). The key concepts and skills in grade school include Number & Operations, Algebra, Geometry, Measurement, Data Analysis & Probability, Problem Solving, and Reasoning. Different key concepts involve the use of different knowledge and abilities. We have also annotated the reasoning steps and process for each MWP, categorizing them into questions requiring one reasoning step, two reasoning steps, and three or more reasoning steps.

Based on the curated benchmark, we have conducted a thorough analysis of the evaluation process and methods. Evaluations in mathematics have always emphasized the accuracy of results. However, we have observed factors that greatly influence the model’s results, thus affecting our understanding of its capabilities:

•

The model is sensitive to the prompts used during the evaluation, and the results vary accordingly.
•

The methods of evaluation can also affect the model’s results. We have compared the model’s performance in selecting the final answers from the options of multi-choice questions, demonstrating that the form of evaluation tasks and options can influence the model’s results to a certain extent.
•

The length of the LLM-generated answers, to some degree, reflects the model’s “confidence” when handling questions.

The main contributions of our work are as follows:

•

We propose a fine-grained elementary school MWPs benchmark for Chinese LLMs, which can assess the mathematical capabilities of LLMs from three aspects: accuracy of understanding abstract mathematical concepts, accuracy of reasoning, and overall accuracy.
•

We conduct an in-depth analysis of the contamination in our dataset, enabling researchers of LLMs to conduct a credibility analysis on the evaluation results.
•

We evaluate GPT-4, GPT-3.5-Turbo, and 8 Chinese LLMs, revealing their mathematical reasoning capabilities, and provide detailed evaluation results in various aspects.

Test Sets	Size	Language
AddSub (Hosseini et al., 2014)	395	En
MultiArith (Roy and Roth, 2015)	600	En
SingleEq (Koncel-Kedziorski et al., 2015)	508	En
AQUA (Ling et al., 2017)	100K	En
AsDiv (Miao et al., 2020)	2,305	En
GSM8K (Cobbe et al., 2021)	8.5k	En
SVAMP (Patel et al., 2021)	8.5k	En
CMATH (Wei et al., 2023)	1.7K	Zh
K6 (Yang et al., 2023)	600	Zh
FineMath (ours)	1,584	Zh

Table 1: An overview of MWP datasets.

2. Related Work

Traditional MWP datasets like AddSub (Hosseini et al., 2014) and MultiArith (Roy and Roth, 2015) are integrated into a MWP repository. Other similar datasets include SingleEq (Koncel-Kedziorski et al., 2015), AQUA (Ling et al., 2017) and AsDiv (Miao et al., 2020). GSM8K (Cobbe et al., 2021) and SVAMP (Patel et al., 2021) take advantage of detailed annotations and have prevailed in recent evaluations.

Our work is most inspired by the MATH (Hendrycks et al., 2021) dataset. MATH collects problems from American high school mathematics competitions and categorizes problems into seven subjects. These subjects are Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra and Precalculus. However, these problems are very challenging, even when humans answer them, the accuracy rate is only 40%. Considering that many LLMs are still in their early versions, overly difficult problems may have limited significance for testing these models.

CMATH proposed by (Wei et al., 2023) and K6 proposed by (Yang et al., 2023) are the two datasets that are relatively similar to ours developed concurrently. All these datasets focus on math word problems of elementary school, and organize instances by grade level. CMATH contains 1.7K problems collected from workbooks and exams on the Internet. K6 is composed of 600 problems collected from an educational institution. However, neither of the two datasets have been publicly released, precluding us from conducting an empirical comparison to them.

Ape210K proposed in (Zhao et al., 2020) is a slightly earlier dataset. It contains 210K enormous Chinese math word problems from elementary school. Test sets alone in Ape210K contain as many as 5,000 problems. However, the test sets do not provide annotations related to LLMs.

An overview of the related MWP datasets is shown in Table 1.

3. Data Collection and Annotation

We create our dataset by collecting a diverse set of questions. We collect as many questions as possible from textbooks, workbooks and the Internet, from which high-quality questions are selected.

After collecting these questions, we conduct automatical preprocessing on the collected data, which includes removing questions that are not math word problems, discarding questions with fewer than 10 Chinese characters, and retaining only questions that contain definite answers. Additionally, any questions that require reference to images are also discarded.

On the preprocessed data, we further perform manual annotation and processing: MWP categorization, question standardization, reasoning step and answer standardization and multiple-choice question transformation. We elaborate these data curation steps in the following subsections.

3.1. MWP Categorization

We categorize the collected questions into 17 types, each corresponding to a key or basic concept¹¹1https://www.nctm.org/Standards-and-Positions/Principles-and-Standards/Principles,-Standards,-and-Expectations/ inherent in the MWPs. We introduce these key concepts encompassed in our dataset, along with their corresponding categories as follows.

Number & Operations: This mathematical concept requires an understanding of numbers, ways of representing numbers, relationships among numbers, and number systems. It also necessitates an understanding of the meanings of operations and the ability to compute with these operations. This concept includes 7 MWP categories: Percents, Decimals, Fractions, Factors & Multiples, Counting, Proportions and Mixed Operations.

Measurement: Measurement requires an understanding of the measurable attributes of objects and the units, systems, and processes of measurement. It corresponds to two MWP categories, namely Spatial Sense and Time.

Data Analysis & Probability: Data analysis and probability requires one to select and use appropriate statistical methods to analyze data and to apply basic concepts of probability. This concept is related to Central Tendency and Probability.

Algebra: This concept involves understanding patterns, relations and functions. The MWP categories of this concept include Equations and Patterns.

Geometry: Geometry is to analyze the characteristics and properties of two- and three-dimensional geometric shapes, specify locations and describe spatial relationships using coordinate geometry and other representational systems. It contains two MWP categories: Two & Three Dimensional Geometry (Basic Geometry) and Analytic Geometry.

Others: We also categorized two types of special MWPs. Problem 1: simple optimization problems. Problem 2: tree planting problems that involve the relationship between points and segments.

3.2. Question Standardization

Many questions in the selected data contain multiple queries. We normalize these questions so that each question contains only a single query. Ambiguous queries are also rephrased to enable the model to generate a unique answer.

3.3. Mathematical Reasoning and Answer Standardization

The process of answering MWPs is manually conducted, and the ground-truth answers are manually double checked by humans. We ask annotators to provide the steps in answering each MWP. Each step should be atomic and indivisible. For calculations that use a fixed solution formula, e.g., computing the area of a circle, we consider them as single-step MWPs.The reasons for annotating the number of required mathematical reasoning steps are two-fold:

1. The number of required reasoning steps can be treated as a proxy to the difficulty level of MWPs. Intuitively, MWPs that require multiple steps to solve are more difficult than those solved in a single step. The progression from one step to the next also represents the reasoning process. Therefore, we categorize the difficulty of MWPs in our dataset into three levels. MWPs that can be solved in a single step are level-1 MWPs; MWPs that require two steps to solve are level-2 MWPs; level-3 MWPs are those that require three or more steps to solve.

2. Presenting the number of reasoning steps facilitates reviewing and analyzing the collected data, thereby ensuring data quality.

3.4. Multiple-Choice Question Transformation

The original MWPs are accompanied with their single ground-truth answers. To facilitate automatic evaluation, we also transform them into multiple-choice question forms by manually providing additional contrastive answer options, similar to the AQUA dataset (Ling et al., 2017).

4. Data Statistics and Analysis

We provide data statistics and analysis on contamination of our dataset in this section.

4.1. Data Statistics

Concepts	Type	Total	Level-1	Level-2	Level-3
Number & Operations	Percents	60	20	20	20
	Decimals	93	23	30	40
	Fractions	81	24	27	30
	Fac&Multi	61	20	21	20
	Counting	60	20	20	20
	Proportions	78	20	20	38
	Mix Operations	267	91	107	69
Measurement	Spatial Sense	89	20	47	22
Measurement	Time	64	20	24	20
Data Analysis & Probability	Central Tendency	189	22	98	69
Data Analysis & Probability	Probability	68	28	20	20
Algebra	Equations	100	0	25	75
Algebra	Patterns	60	20	20	20
Geometry	Basic Geometry	132	20	48	64
Geometry	Analytic Geometry	60	20	20	20
Others	Problem 1	62	20	21	21
Others	Problem 2	60	20	20	20

Table 2: Overall statistics of FineMath. Level-1/2/3 denotes that a math word problem requires 1/2/3+ mathematical reasoning steps to solve.

The overall data statistics are displayed in Table 2. All 1,584 questions are categorized into five major mathematical concepts and two classic types of MWPs. Each type contains at least 60 questions, and each difficulty level contains at least 20 questions.

4.2. Analysis on Contamination

FineMath serves as a comprehensive benchmark, encompassing a diverse range of math word problems at the Chinese elementary school level. It is specifically designed to assess the mathematical reasoning capabilities of Chinese large language models. However, these language models are typically trained on a huge amount of data derived from multiple sources, including web pages, books, codes and so on. This raises the potential risk of test data contamination, as some test examples from FineMath may unintentionally be included in the training data of these language models. Test data contamination can lead to an overestimation of a model’s performance, potentially resulting in misleading conclusions regarding the model’s generalization capabilities. Consequently, investigating contamination and its impact on model performance for FineMath is of paramount importance.

Ape210K (Zhao et al., 2020) is a publicly available large-scale Chinese math word problem dataset, which has been splited into training, validation, and test sets. It is commonly utilized as a training dataset for mathematical problem-solving models (Hu and Jiang, 2022; Wu et al., 2021; Liang et al., 2023; Huang et al., 2021; Xiong et al., 2022; Huang et al., 2023; Yang et al., 2023; Liang et al., 2022). To determine potential contamination from Ape210K in FineMath, we adopt the identical methodology leveraged in GPT-3 (Brown et al., 2020) to compute the n-gram overlap between Ape210K and FineMath. In this approach, a test example in FineMath is considered as an overlapped example with Ape210k if any n-gram from this test example also appears in Ape210k. Specifically, we insert white spaces around any Chinese, Japanese, and Korean (CJK) characters, as well as between punctuation marks and words. Subsequently, we tokenize the text based on these white spaces. It is important to note that we disregard letter case when computing n-grams.

To perform a rigorous quantitative assessment of contamination, we define the overlap rate as the fraction of instances within FineMath that exhibit such overlap. Furthermore, for the purposes of computing overlap, we set the value of n to 13. The overlap rate between FineMath and the training sets of Ape210k is depicted in Figure 2. It suggests that the overlap rates of some question types are significantly higher than others, such as Basic Geometry and Proportions. To gain deeper insights into the impact of these overlapped examples on model performance, we partition the test examples into two datasets: a contaminated dataset composed of the overlapped examples, and a clean dataset whose test examples exhibit no overlap with the Ape210k training set. Subsequently, we examine the performance of the model on each of these datasets separately. We select GPT-4 and MathGLM-10B for analysis since GPT-4 is widely recognized as the most advanced LLM currently available and MathGLM-10B has been trained on the Ape210k training set which overlaps with some test examples of FineMath. The experimental results are presentend in Table 3. Notably, MathGLM-10B performs significantly better on the contaminated dataset compared to the clean dataset. In contrast, GPT-4 exhibits comparable performance on both datasets. This suggests that MathGLM-10B may be overfitting to the overlapped examples and that contamination can inflate a model’s performance. Consequently, to ensure a fair comparison between models and to draw accurate conclusions from the FineMath benchmark, we recommend filtering out overlapped examples between the training set and the FineMath benchmark.

Model/Dataset	Level-1	Level-2	Level-3	Overall
MathGLM-10B (Clean)	0.45	0.42	0.22	0.37
MathGLM-10B (Contaminated)	0.65	0.75	0.70	0.71
GPT-4 (Clean)	0.83	0.76	0.61	0.74
GPT-4 (Contaminated)	0.83	0.65	0.63	0.67

Table 3: Accuracy results of GPT-4 and MathGLM-10B on the contaminated dataset and clean dataset.

5. Experiments

We conducted experiments on the proposed FineMath to evaluate a series of LLMs, assessing the mathematical reasoning capabilities of them.

5.1. Evaluated LLMs

We assessed three classes of LLMs: GPT-4 and GPT-3.5-Turbo developed by OpenAI; LLMs developed for Chinese; and LLMs finetuned with Chinese mathematics data. Specific information can be found in Table 4.

Model	RLHF	Parameters	Training Token
GPT-4	w	-	-
GPT-3.5-Turbo	w	-	-
GhatGLM2-6B	w	6B	1.4T
Moss-SFT-16B	w/o	16B	120B
InternLM-Chat-7B	w	7B	1.6T
Qwen-7B-Chat	w	7B	2.4T
Baichuan-7B	w/o	7B	1.2T
Baichuan2-7B-Chat	w	7B	2.6T
MathGLM-10B	-	10B	-
MathGLM-335M	-	335M	-

Table 4: All the LLMs that we evaluated in this paper. MathGLM-10B and MathGLM-335M, both of which were fine-tuned using an arithmetic training dataset.

Prompt 0:	Nothing is provided, only the question is input into the model.
Prompt 1:	Here is a math word problem; please provide the answer to this question.
Prompt 1:	Do not explain the reason.
Prompt 2:	Here is a math word problem; please select the correct option.
Prompt 2:	Do not explain the reason.
Prompt 3:	Here is a math word problem, please give the answer to this question.
Prompt 3:	And explain why.
Prompt 4:	Answer:….

Table 5: Prompts used for evaluation and analysis.

5.2. Prompts

All experiments were conducted under the zero-shot. We tried several prompts for evaluation and analysis, which are shown in Table 5.

5.3. Main Results

Model	Percents	Decimals	Fractions	Factors &	Counting	Proportions	Mix	Spatial	Time	Central	Probability	Equations	Patterns	Basic	Analytic	Problem 1	Problem 2
Model	Percents	Decimals	Fractions	Multiples	Counting	Proportions	Operations	Sense	Time	Tendency	Probability	Equations	Patterns	Geometry	Geometry	Problem 1	Problem 2
GPT-4	0.8	0.76	0.67	0.74	0.38	0.78	0.89	0.71	0.77	0.87	0.68	0.64	0.68	0.7	0.68	0.39	0.55
GPT-3.5-Turbo	0.75	0.7	0.69	0.7	0.33	0.65	0.81	0.57	0.69	0.71	0.34	0.71	0.47	0.49	0.42	0.23	0.55
ChatGLM2-6B	0.65	0.46	0.3	0.52	0.3	0.33	0.63	0.63	0.38	0.35	0.24	0.38	0.2	0.46	0.18	0.1	0.3
Moss-SFT-16B	0.13	0.05	0.07	0.11	0.03	0.06	0.15	0.06	0.09	0.03	0.12	0.05	0.07	0.05	0.03	0.08	0.05
InternLM-Chat-7B	0.4	0.35	0.15	0.34	0.05	0.31	0.42	0.36	0.23	0.17	0.18	0.23	0.22	0.18	0.25	0.21	0.32
Qwen-7B-Chat	0.58	0.53	0.41	0.39	0.1	0.32	0.58	0.42	0.28	0.31	0.35	0.34	0.32	0.38	0.2	0.18	0.25
Baichuan-7B	0.08	0.13	0.01	0.08	0.05	0.03	0.14	0.08	0.05	0.1	0.12	0.06	0.07	0.07	0.07	0.05	0.08
Baichuan2-7B-Chat	0.58	0.52	0.48	0.54	0.2	0.4	0.65	0.58	0.25	0.32	0.26	0.4	0.25	0.37	0.13	0.11	0.22
MathGLM-10B	0.52	0.42	0.68	0.57	0.05	0.64	0.61	0.38	0.45	0.53	0.29	0.36	0.18	0.74	0.23	0.4	0.65
MathGLM-335M	0.45	0.3	0.63	0.49	0.07	0.47	0.57	0.34	0.3	0.47	0.29	0.31	0.08	0.68	0.23	0.39	0.58

Table 6: Results across the 17 MWP categories (under Prompt 0).

The overall accuracy results of assessed LLMs are visualized in Figure 3. GPT-4 and GPT-3.5-turbo perform outstandingly, with their accuracies reaching as high as 73% and 62%, respectively.

Among the evaluated Chinese LLMs, MathGLM-10B, MathGLM-335M, ChatGLM2-6B and Baichuan2-7B-Chat obtain an accuracy of > 40%. Qwen-7B-Chat and InternLM-Chat-7B are at a slightly below-average level. However, both Baichuan-7B and Moss-SFT-16B perform poorly on our dataset, with an accuracy of < 10%. Upon examining the responses generated by these two models, we find that their answers often stray from the MWPs, generating a lot of irrelevant content or repeatedly producing the same questions.

By considering both model accuracy and the detailed information provided in Table 4, we deduce that the lower accuracy of Moss-SFT-16B is due to an insufficient amount of training data. The performance of Baichuan-7B is hampered because it has not undergone RLHF fine-tuning, which prevents the model from fully understanding the question. In contrast, the accuracy of Baichuan2-7B-Chat, which has been fine-tuned, has significantly improved. In summary, RLHF fine-tuning, having model parameters exceeding 6 billion, and training data reaching the trillion level are all crucial for training an LLM with problem-solving and reasoning capabilities.

5.4. Results across 17 MWP Categories

Results across the 17 MWP categories are displayed in Table 6. It is evident that the MWP types “Counting” and “Problem1” are more challenging than other MWP categories according to the results. This could be due to the complexities involved in the counting of Chinese numerals and their conversion to Arabic numerals, and the common sense issues encountered in the optimization problem for “Problem1”. All models demonstrate better performance on “Mixed Operation” than on other categories. We also observe that the performance of different models vary significantly.

GPT-4 outperforms all other models. It achieves an accuracy below 40% on only two MWP categories, surpasses 80% on three MWP categories, exceeds 70% in six MWP categories, and goes beyond 60% in five MWP categories. GPT-4 scores the highest, 89%, on the ’Mixed Operation’ MWPs, while it scores the lowest, 38%, on the ’Counting’ MWPs. Notably, GPT-4 excels in Probability and Analytic Geometry, outperforming all other models with an improvement of over 25%. GPT-3.5-Turbo’s performance on different MWP categories is similar to that of GPT-4, but there is a significant difference (over 20%) on Probability, Basic Geometry, and Analytic Geometry.

MathGLM-10B and ChatGLM2-6B exhibit similar performance trends across most MWP categories. However, significant differences are observed on Fractions, Counting, Spatial Sense, Basic Geometry, Proportions, Problem1, and Problem2. Due to contamination, MathGLM-10B surpasses GPT-4 on Basic Geometry, Problem 1 and Problem 2.

Baichuan2-7B-Chat, Qwen-7B-Chat and InternLM-Chat-7B are still at a medium or slightly below-average level. Both Moss-SFT-16B and Baichuan-7B perform poorly across all MWP categories, failing to achieve an accuracy of 20%.

Prompt	Level-1	Level-2	Level-3	Overall
Prompt 0	0.83	0.77	0.62	0.73
Prompt 1	0.81	0.69	0.34	0.59
Prompt 4	0.8	0.65	0.36	0.58

Table 7: Accuracy results of GPT-4 with Prompt 0, Prompt 1, Prompt 4.

Question	学校买来120本课外读物，其中的40%分给高年级，剩下的课外读物按5:3分给中年级和低年级，低年级分到多少本?	The school bought 120 supplementary books, 40% of which were given to the upper grades. The remaining books were divided between the middle and early grades in a ratio of 5:3. How many books did the early grades receive?
Prompt 0+Question	GPT-4 Response: 高年级分到的课外读物为12040%=48本;剩下的课外读物为120-48=72本;低年级分到的课外读物为723/(5+3)=27本。答：低年级分到27本。 Results: \usym2714	GPT-4 Response: The number of supplementary books allocated to the upper grades is 12040%=48 books. The remaining books total 120-48=72 books. The supplementary books allocated to the early grades are calculated as 723/(5+3)=27 books. Answer: The early grades are allocated 27 books.
Prompt 1+Question	GPT-4 Response: 36本 Results: \usym2717	GPT-4 Response: 36 books
Prompt 3+Question	GPT-4 Response: 36本。首先，学校将120本书的40%分给高年级，即12040%=48本。剩下的书就是120-48=72本。然后，这72本书按照5:3的比例分给中年级和低年级。这个比例的总和是5+3=8，所以低年级分到的书就是72(3/8)=27本。Results: \usym2717	GPT-4 Response: 36 books. Firstly, the school allocates 40% of the 120 books to the upper grades. This totals to 12040%=48 books. The remaining books amount to 120-48 = 72 books. Then, these 72 books are divided between the middle and early grades at a ratio of 5:3. The total of this ratio is 5+3=8. Therefore, the number of books allocated to the early grades is 72(3/8) =27 books.
Prompt 4+Question	GPT-4 Response: 36本 Results: \usym2717	GPT-4 Response: 36 books
Prompt 0+Question	GPT-4 Response: C:27	GPT-4 Response: C:27
(with MCQs)
A:72 B:48 C:27 D:80	Results: \usym2714
Prompt 0+Question	GPT-4 Response: D:36	GPT-4 Response: D:36
(with D replaced)
A:72 B:48 C:27 D:36	Results: \usym2717

Table 8: Different prompts and their corresponding responses from GPT-4. MCQs: multiple-choice questions.

5.5. Results in terms of the Number of Mathematical Reasoning Steps

We further conducted experiments to evaluate LLMs reasoning ability in terms of the number of mathematical reasoning steps. Results are shown in Figure 4. We observe that the performance of LLMs decreases as the number of reasoning steps increases. GPT-4 maintains an accuracy of over 60% at all difficulty levels, reaching as high as 82% on MWPs that require only one step of reasoning. The accuracy of GPT-3.5-Turbo is, on average, 10% lower than that of GPT-4. While ChatGLM2-6B, Baichuan2-7B-Chat and Qwen-7B-Chat outperform MathGLM-335M and MathGLM-10B on Level-1 MWPs, its accuracy falls below those of MathGLM-335M and MathGLM-10B on Level-2/3 MWPs. Similar to their performance across MWP categories, Moss-SFT-16B and Baichuan-7B show a significant difference in performance at all difficulty levels compared to the other models.

The accuracy difference between Qwen-7B-Chat and InternLM-Chat-7B on different reasoning steps is quite substantial, exceeding 30%. In the case of Qwen-7B-Chat, the accuracy on problems requiring only one-step reasoning is 62%, but this figure drops to just 21% for problems requiring three or more reasoning steps. This phenomenon suggests that the model may need more training in terms of inference.

6. Analysis

Unlike other studies that only evaluate accuracy, we have further analyzed factors in the evaluation process. These overlooked factors greatly affect the evaluation results and our understanding of the true mathematical reasoning capabilities of LLMs.

6.1. Prompts Really Does Matter

During the evaluation, instructions are generally used to guide the assessed model to produce answers. For instance, we might say, “Here is a math problem, please provide the answer to this question. Do not explain the reason.” Alternatively, we might provide an answer template such as “Question: … Answer:”. However, our experiments showed that even a single word like “Answer:” can significantly affect the model’s accuracy. We tested three prompt 0, prompt 1 and Prompt4 on GPT-4, results are shown in Table 7.

We can see that the overall accuracy results with the three prompts are 73%, 59%, and 58%, respectively, with a gap reaching up to 15%.

Prompt like “Answer:” appear to encourage the model to forego reasoning and directly provide the answer, which increases the likelihood of generating incorrect responses. An example is shown in Table 8: Prompt 4+Question.

6.2. Evaluation Methods: Generation vs. Option Prediction

In our preliminary experiments, we have discovered that some newly developed LLMs do not follow instructions well, often generating large chunks of tokens unrelated to the answer. Therefore, we decide to transform our data into multiple-choice questions, for which the evaluated model can then select the correct answer option.

Model	MCQs	Level-1	Level-2	Level-3	Overall
GPT-4	w/o	0.83	0.77	0.62	0.73
GPT-4	w	0.67	0.6	0.6	0.62
GPT-3.5-Turbo	w/o	0.73	0.65	0.51	0.62
GPT-3.5-Turbo	w	0.65	0.66	0.57	0.63
ChatGLM2-6B	w/o	0.58	0.46	0.25	0.42
ChatGLM2-6B	w	0.69	0.62	0.46	0.58
Moss-SFT-16B	w/o	0.18	0.04	0.05	0.08
Moss-SFT-16B	w	0.16	0.13	0.14	0.14
InternLM-Chat-7B	w/o	0.45	0.27	0.14	0.27
InternLM-Chat-7B	w	0.29	0.26	0.29	0.28
Qwen-7B-Chat	w/o	0.62	0.39	0.21	0.38
Qwen-7B-Chat	w	0.47	0.42	0.39	0.42
Baichuan-7B	w/o	0.15	0.09	0.03	0.08
Baichuan-7B	w	0.32	0.25	0.25	0.27
Baichuan2-7B-Chat	w/o	0.55	0.45	0.27	0.41
Baichuan2-7B-Chat	w	0.44	0.42	0.32	0.39

Table 9: Accuracy of LLMs with different evaluation methods: generation vs. option prediction.

Comparison results are displayed in Table 9. We can observe a significant difference in accuracy between option prediction (with multiple-choice questions) and direct answer generation, with a gap that can exceed 10%. Interestingly, restructuring the task in the form of multiple-choice questions seems to reduce the accuracy of high-performing models while increasing the accuracy of models that perform poorly. Upon examining instances, we have found that the answer option can act as another type of prompt influencing the model’s performance. For example in Table 8: Prompt 0+Question (with MCQs) and Prompt 0+Question (with D replaced).

Examples of GPT-4 outputs with different prompts and task forms are shown in Table 8. We want to understand why GPT-4 would provide the incorrect answer “36” under Prompt 1. Therefore, we utilize Prompt 3 to have GPT-4 explain its reasoning for choosing “36”. Interestingly, GPT-4 mentions the correct number “27” in the explanation, but still provides the incorrect answer, “36”. Given the seeming importance of “36”, we replace one of the options in the multiple-choice question with “36”. The model, which initially selected the correct answer, abandoned it in favor of “36”. Further examples can be observed in the responses generated by other models. For instance, the reasoning process may be incorrect, yet the correct answer is ultimately chosen. The options provided to the model also seem to influence the model’s generation probability to a certain degree. Therefore, we recommend using generation, rather than option prediction, for a more accurate evaluation of LLMs.

6.3. Comparison of Response Lengths

We conducted a statistical analysis on the length of the responses generated by the models. We discover two phenomena. First, models like GPT-4 and GPT-3.5-Turbo tend to generate responses that are closely centered around the question, with shorter text. This may demonstrate the characteristics of models with high accuracy. Second, the more reasoning steps an MWP requires, the longer the response tends to be. Please refer to Table 10 for more details.

Model	Level-1	Level-2	Level-3
GPT-4	33.18	56.92	102.37
GPT-4+Prompt 4	8.67	13.12	26.73
GPT-3.5-Turbo	71.10	104.78	173.87
GhatGLM2-6B	167.75	224.02	381.18
InternLM-Chat-7B	65.65	72.20	119.69
Qwen-7B-Chat	62.29	93.57	138.21
Baichuan2-7B-Chat	114.72	156.33	202.52
Baichuan-7B	128.48	174.73	130.21
Moss-SFT-16B	105.43	131.29	149.48

Table 10: Different response lengths of models (Since MathGLM-10B and MathGLM-335M, after fine-tuning, generates more formulas but fewer language descriptions, is not compared in this context).

We speculate that the model’s “confidence” in answering questions influences the length of its response. This tendency is also observed in some models that do not adhere strictly to instructions. For instance, even when instructed to provide only the answer without explaining, these models still generate logical reasoning for difficult MWPs.

7. Conclusion

We present a fine-grained benchmark, FineMath, to comprehensively evaluate the mathematical capabilities of Chinese LLMs. We strive to evaluate as many LLMs as possible. We also conduct a contamination analysis, enabling researchers to examine whether the training data influences the evaluation results. Through testing various evaluation methods and processes, we demonstrate their potential to influence the results, underscoring the necessity for fair and effective evaluations to consider the interference caused by these two aspects.

8. Bibliographical References

\c@NAT@ctr

Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: general language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 320–335. Association for Computational Linguistics.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
Hosseini et al. (2014) Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, Doha, Qatar. Association for Computational Linguistics.
Hu and Jiang (2022) Zijian Hu and Meng Jiang. 2022. Heterogeneous line graph transformer for math word problems. CoRR, abs/2208.05645.
Huang et al. (2021) Shifeng Huang, Jiawei Wang, Jiao Xu, Da Cao, and Ming Yang. 2021. Recall and learn: A memory-augmented solver for math word problems. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 786–796. Association for Computational Linguistics.
Huang et al. (2023) Zeyu Huang, Xiaofeng Zhang, Jun Bai, Wenge Rong, Yuanxin Ouyang, and Zhang Xiong. 2023. Solving math word problems following logically consistent template. In International Joint Conference on Neural Networks, IJCNN 2023, Gold Coast, Australia, June 18-23, 2023, pages 1–8. IEEE.
Koncel-Kedziorski et al. (2015) Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. 2015. Parsing algebraic word problems into equations. Trans. Assoc. Comput. Linguistics, 3:585–597.
Liang et al. (2023) Zhenwen Liang, Dian Yu, Xiaoman Pan, Wenlin Yao, Qingkai Zeng, Xiangliang Zhang, and Dong Yu. 2023. Mint: Boosting generalization in mathematical reasoning via multi-view fine-tuning. CoRR, abs/2307.07951.
Liang et al. (2022) Zhenwen Liang, Jipeng Zhang, Lei Wang, Wei Qin, Yunshi Lan, Jie Shao, and Xiangliang Zhang. 2022. MWP-BERT: numeracy-augmented pre-training for math word problem solving. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 997–1009. Association for Computational Linguistics.
Ling et al. (2017) Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
Miao et al. (2020) Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2020. A diverse corpus for evaluating and develo** English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984, Online. Association for Computational Linguistics.
Nijkamp et al. (2023) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
OpenAI (2023) OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2080–2094. Association for Computational Linguistics.
Roy and Roth (2015) Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
Stiennon et al. (2020) Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
Wei et al. (2023) Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, and Bin Wang. 2023. CMATH: can your language model pass chinese elementary school math test? CoRR, abs/2306.16636.
Wu et al. (2021) Qinzhuo Wu, Qi Zhang, Zhongyu Wei, and Xuan**g Huang. 2021. Math word problem solving with explicit numerical values. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5859–5869. Association for Computational Linguistics.
Xiong et al. (2022) **g ** Hu, Min Yang, and Chengming Li. 2022. Self-consistent reasoning for solving math word problems. CoRR, abs/2210.15373.
Yang et al. (2023) Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, **feng Bai, and Jie Tang. 2023. GPT can solve mathematical problems without a calculator. CoRR, abs/2309.03241.
Zhao et al. (2020) Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and **gming Liu. 2020. Ape210k: A large-scale and template-rich dataset of math word problems.