We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao¹ , Qiuna Tan¹^∗, Guanting Dong¹, Minhui Wu², Chong Sun², Xiaoshuai Song¹,
Zhuoma GongQue¹, Shanglin Lei³, Zhe Wei¹, Miaoxuan Zhang¹, Runfeng Qiao⁴,
Yifan Zhang¹, Xiao Zong¹, Yida Xu¹, Muxi Diao¹, Zhimin Bao²,
Chen Li², Honggang Zhang¹
¹Bei**g University of Posts and Telecommunications,²Wechat, Tencent Inc.,
³Huazhong University of Science and Technology, ⁴Bei**g Institute of Technology
https://We-Math.github.io
Equal contribution.Corresponding author

Abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks focus more on the result-oriented performance, but neglecting the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce We-Math, the first benchmark specifically designed to explore the problem-solving principles beyond the end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. We firstly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) to hierarchically assess inherent issues in LMMs’ reasoning process. With We-Math, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving step and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategy. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization – they correctly solve composite problems involving multiple knowledge concepts, yet fail in answering sub-problems. We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning for LMMs. The We-Math data and evaluation code are available at https://github.com/We-Math/We-Math.

Refer to caption — Figure 1: Overview of LMMs’ performances on We-Math. Figures from left to right illustrates the (1) accuracy of different LMMs on various problem-solving steps, (2) the performance in different visual mathematics categories and (3) the result in knowledge based reasoning evaluation.

1 Introduction

“I think, therefore I am.” — René Descartes

Human cognitive and reasoning patterns have profoundly shaped the progress of deep learning [1]. Initially, the design of neural networks [2] is inspired by the brain’s neuronal mechanisms. It uses convolution kernels and hierarchical network to mimic human cognitive process of knowledge acquisition. Recently, Transformers [3] employ attention mechanisms to handle multiple information flows and quickly focus on critical content, thereby achieving more efficient and in-depth sequential learning. Owing the scalability of the Transformer architecture and pre-training techniques, Large Language Models (LLMs) [4, 5, 6, 7] and Large Multimodal Models (LMMs) [8, 9, 10, 11, 12, 13, 14, 15, 16] showcases strong reasoning abilities that parallel human performance across a wide range of tasks and provide a glimpse into the early outlines of Artificial General Intelligence (AGI).

Mathematical reasoning is a critical capability of foundational models. Existing methods employ Chain of Thought (COT) [17], Program of Thought (POT) [18, 19], Tool-integrated techniques [20, 21] and data augmentation strategies [22, 23, 24, 25] to guide LLMs towards emulating human-like reasoning patterns. In a more challenging scenario, Visual mathematical reasoning requires the model to accurately decode the visual information in image and perform reasoning based on the textual problem. With the rapid advancements of large multimodal models (LMMs) [26, 27], researchers progressively utilize the LMMs for solving visual mathematical problems [28, 29]. These studies provide valuable insights into the ongoing improvements in multi-modal logical thinking capabilities.

To systematically evaluate visual mathematical reasoning capabilities, previous efforts [30, 31, 32, 33] have focused on challenging geometric problems. Recently, several benchmarks [34, 35] expand the scope to include a wider range of disciplines. However, these benchmarks rely solely on end-to-end results for assessment, which fails to identify inherent issues within the LMMs’ reasoning process. Moreover, MathVerse [35] attempt to directly evaluate reasoning paths based on reference answers, but limitations remain due to the knowledge-intensive nature of mathematical reasoning. While noticing that humans solve complex math problems through gradually mastering and generalizing the knowledge concepts [36], we claim a fair evaluation of a model’s reasoning process should be based on knowledge concepts. Therefore, we pose two questions about mathematical reasoning evaluation:

Q1: Does the correct answer truly reflect LMM’s ability to reason through such problems accurately?

Q2: Does an incorrect answer suggest a lack of foundational knowledge in LMM’s reasoning process?

As the response, we present We-Math, a pioneering benchmark for conducting an in-depth analysis of the underlying principles of LMMs in visual mathematical reasoning. We-Math consists of over 6.5K meticulously selected visual math problems, which can be categorized into 5 layers of knowledge granularity across 67 knowledge concepts for ensuring a comprehensive coverage. We observe that real-world math problems typically encompass multiple foundational knowledge concepts, and their difficulty is directly related to the number of concepts involved. Upon this, we decouple the model’s ability to solve composite problems with $k$ knowledge concepts into two stages:

1) LMMs can solve $k$ individual sub-problems corresponding its knowledge concept;

2) LMMs reason out the final answer by integrating the k individual knowledge concepts.

The above process can be formulated as follows:

P(Y|X)=\prod_{i=1}^{k}P(y_{i}|x_{i})\cdot P_{\text{reason}},

(1)

where $(X,Y)$ and $(x_{i},y_{i})$ denote the $(question,answer)$ pairs in a composite mathematical problem and the $i$ -th sub-problem, respectively. $P_{\text{reason}}$ stands for the LMMs’ reasoning capacity. It is evident that assessing the reasoning process cannot be based solely on final answers. To decompose a composite problem into individual sub-problems according to the invoked knowledge concept, we select 1.5k high-quality problems with multiple knowledge concepts in We-Math. Following equation 1, these composite problems are gradually decomposed by expert annotators into one-step problem $(x_{i},y_{i})$ . Motivated by human reasoning patterns, We-Math further introduces a four-dimensional metric to precisely evaluate the inherent gaps in LMMs’ problem-solving abilities, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM). To further tackle the fundamental IK issue, we propose a heuristic knowledge concept augmented (KCA) strategy, constructing descriptions for 67 knowledge concepts from Wikipedia [37] and textbooks, thereby providing essential knowledge for LMMs’ reasoning.

Figure 1 illustrates our overview experimental results. Not surprisingly, GPT-4o [38] achieves the best overall performance across different visual mathematics categories. Closed-source LLMs (GPT-4V, Gemini 1.5 Pro) and LMMs with larger parameter scales (LLaVA-NeXT-110B [39]) generally exhibit superior visual mathematical reasoning capabilities. However, most LMMs perform significantly worse on multi-step problems compared to one-step problems, suggesting that the number of knowledge concepts is positively correlated with the question’s difficulty and negatively correlated with LMM performance. In specialized disciplines, most LMMs excel in calculation but consistently struggle with fine-grained visual measurement ("Angles and Length").

For reasoning evaluation, we emphasize that mastery of knowledge concepts is fundamental. Unfortunately, most LMMs still suffer from Insufficient Knowledge issue, especially smaller-scale models (e.g., over 350 IK issues in LLaVA-1.6-7B and DeepSeek-VL-1.3B). GPT-4o significantly addresses this knowledge gap, establishing it as the first LMM advancing towards the knowledge generalization stage. More notably, several LMMs still exhibit a marked inclination towards Rote Memorization (e.g. G-LLaVA-13B nearly 36% in RM (Loose)), raising doubts about whether current LMMs truely possess the mathematical reasoning capability. In addition, our proposed KCA strategy substantially reduces the IK issue in LMMs, and error analysis further provides empirical guidance towards human-like reasoning. We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning in LMMs.

2 We-Math

Overview of We-Math. As previously mentioned, existing benchmarks tend to be result-oriented, while overlooking the essence of solving mathematical problems. This leads to the generation of some counterintuitive evaluation conclusions. For example, conclusions in MathVista [34] indicate that LMMs exhibit superior performance on university-level problems compared to elementary-level ones. Different from existing benchmarks, as shown in Figure 2, We-Math is constructed around textbook knowledge units, decomposing composite problem solutions into sub-problems based on the knowledge concepts. We-Math has the following characteristics:

(1) Hierarchical Knowledge Structure. We-Math strictly adheres to the knowledge presented in mathematics textbooks, featuring a rigorous hierarchical and multi-category architecture. It ensures the independence of knowledge concepts within the same level, while establishing logical relationships among concepts at different hierarchical levels.

(2) Knowledge based Reasoning Evaluation. We-Math is designed to explore how LMMs solve problems. Drawing upon that humans tackle problems incrementally by leveraging fundamental knowledge concepts, we break down complex mathematical problems into more manageable sub-problems. Furthermore, we employ diverse measurement dimensions for meticulous evaluations.

(3) Knowledge Concept Augmentation. To alleviate the inherent issues during the problem-solving process, we heuristically introduce descriptions for 67 knowledge concepts from Wikipedia and textbooks, thereby providing essential knowledge support for the reasoning processes of LMMs.

2.1 Hierachial Structured Dataset Composition

Hierachial Knowledge Structure. We-Math emphasizes fundamental math skills, believing that complex mathematical reasoning is built upon foundation of basic mathematical reasoning processes. Based on extensive research, mathematical problems are categorized into five distinct types, namely Plane Figures, Solid Figures, Transformations and Movements of Shapes, Positions and Directions, Measurements. These five categories can be decomposed into 12 typical problems, which are further decomposed as 67 knowledge concepts (terminal nodes in the structure). We collect problems according to this tree structure and constrain that each terminal node contains a strict range of 10-40 samples. This rule ensures data balance across domains.

Data Collection and Annotation. All problems (6.5K) in We-Math are sourced from publicly authoritative mathematics websites and subsequently organized based on our defined knowledge structure. We employ three expert annotators to manually label each question with knowledge concepts. Cross-validation is performed to ensure at least two experts have identical annotations for the same question. Samples with notably inconsistent labels will be considered of low quality and subsequently excluded. To prepare for the subsequent decomposition of problems, we further annotate problem-solving steps based on the knowledge concepts labels. We categorize each problem into three distinct classes: "One-Step", "Two-Step", and "Three-Step". This categorization enables us to gain a deeper understanding on how LMMs solve problems. Further details about annotation can be found in Appendix. After the annotation, all problems are double checked by an expert team in terms of three aspaces: (1) The consistency between the questions and dagrams; (2) The correctness of the answers to the questions; (3) The alignments between problems and the 67 knowledge concepts.

2.2 Knowledge based Reasoning Evaluation

Problem Definition. For the visual mathematical reasoning task, given text question $Q_{i}$ , image $I_{i}$ and corresponding answer $A_{i}$ . We define the LMMs evaluation dataset $D_{\rm eval}=\{(Q_{i},I_{i},A_{i})|K_{i},C_{i}\}_{i=1}^{N}$ . where $K_{i}$ and $C_{i}$ are two prior constraints for question $Q_{i}$ . In detail, $K_{i}=\{k_{i}\}_{i=1}^{M}$ denote $M$ knowledge concepts within the question. $C_{i}$ represents the prerequisite conditions needed to solve the problem $Q_{i}$ (see Figure 3 for example). For the convenience of presenting, we define the problem containing $k$ knowledge concepts as a " $k$ -step problem" in our paper.

Knowledge-based Data Decomposition. Real-world mathematical problems are composed of multiple atomic knowledge concepts. However, existing benchmarks usually overlook this information, leading to unreasonable evaluation results. Inspired by Euclid’s Elements [36], we argue that the evaluation of mathematical reasoning ability in LMMs essentially involves assessing their mastery of fundamental knowledge concepts. It is quite a natural and objective way to exploit basic knowledge concepts for reasoning evaluation of LMMs. Given an $i$ -th test sample $\{(Q_{i},I_{i},A_{i})|K_{i},C_{i}\}\in D_{\textsc{We-Math}}$ with M concepts $K_{i}=\{k_{i}^{m}\}_{m=1}^{M}$ , we ask human experts to decompose each problem step by step into $M$ sub-problems based on knowledge concepts, which can be formulated as:

\{(q_{i}^{m},i_{i}^{m},a_{i}^{m})|k_{i}^{m},c_{i}^{m}\}_{m=1}^{M}=\mathop{\rm Decompose% }_{(Q_{i},I_{i},A_{i})\in D_{\textsc{We-Math}}}\{(Q_{i},I_{i},A_{i})|K_{i},C_{% i}\}

(2)

where $k_{i}$ , $c_{i}$ denote the individual knowledge and prior condition for the sub-problem. “ $\mathop{\rm Decompose}$ ” represents the Human decomposition process based on $M$ knowledge concepts. To ensure logical coherence of decomposition, the condition $c_{i}^{m}$ is initialized as $C_{i}$ . Then it is recursively computed by concatenating the answer $a_{i}^{m-1}$ and condition $c_{i}^{m-1}$ of the $m-1$ -th concept:

c_{i}^{m}=c_{i}^{m-1}+\textit{a}_{i}^{m-1}\quad\text{for }m=2,3,\ldots,M

(3)

where “ $+$ ” denotes the concatenation operation. In addition, the equation $\left\{\begin{array}[]{l}q_{i}^{M}=Q_{i}\\ a_{i}^{M}=A_{i}\end{array}\right\}$ must be satisfied, which is also a constraint for logical coherence. Finally, we can obtain the original multi-step problem and $M$ one-step sub-problems for reasoning evaluation. The overall pipeline of Knowledge-based Data Decomposition are displayed in the left side of Figure 3.

Metric for Reasoning Evaluation.

Based on the decomposed multi-step problems, we further reveal the inherent issues of LMMs in problem-solving process. We feed both the $M$ one-step sub-problems and the original problem into LMMs, and classifying the responses into four categories:

1. Insufficient Knowledge (IK): Part of one-step problems contain errors, and the multi-step problem is wrong. It is reasonable because model’s insufficient grasp of single knowledge concept may lead to errors in multi-step problem.

2. Inadequate Generalization (IG): One-Step problems are all correct, but the multi-step problem is incorrect. This is also considered reasonable. While LMMs are capable of understanding individual knowledge concepts, they may struggle to generalize that knowledge to solve composite problems.

3. Complete Mastery (CM): One-Step problems are all correct, and multi-step problem is also answered correctly. This result demonstrates that the model’s results are both reliable and accurate.

4. Rote Memorization (RM): One-Step problems contain errors, but the multi-step problem is answered correctly, which contradicts human logical thinking. If a model can solve composite multi-step problems but fails to answer the one-step problems needed in the process, it raises doubts about the model’s reliability.

Considering IK, IG, and CM, it is evident that results falling under the IG category are generally more preferred compared to those classified as IK. The reason is that IK reflects the model’s struggle with both single and multiple knowledge concepts, while IG shows the model’s proficiency one-step problem. By enhancing the model’s generalization ability in reasoning process, we can potentially shift results from IG to CM. Therefore, we establish a reasoning capability hierarchy as $\textit{IK}<\textit{IG}<\textit{CM}$ . We believe that RM is an unreasonable scenario (models can solve multi-step problems without mastering one-step problems, which completely contradicts human reasoning intuition).

In light of the model’s instability, the current criteria for determining whether a result belongs RM is strict. We thus propose a more flexible loose metric. As illustrated in Figure 4, the TFT and FTT situations in the two-step problems are regard as CM (rather than RM), according to the loose metric. We also discuss the situation of four-dimensional metrics on three-problem in Appendix C. We propose the following metric to judge the reliability of the model’s reasoning process:

S_{\rm IK}=\frac{N_{\rm IK}}{N},\quad S_{\rm IG}=\frac{N_{\rm IG}}{N},\quad S_% {\rm CM}=\frac{N_{\rm CM}}{N},\quad S_{\rm RM}=\frac{N_{\rm RM}}{N_{\rm RM}+N_% {\rm CM}}

(4)

where N denotes the total number of samples and $N_{IK}$ , $N_{IG}$ , $N_{CM}$ , $N_{RM}$ represents the number of samples for a specific situation. Therefore, we obtain our final reasoning confidence scores:

\text{Score}_{\text{average}}=\alpha S_{\rm IK}+\beta S_{\rm IG}+S_{\rm CM}

(5)

where $\alpha,\beta$ denotes the weight for each case. To ensure the reasoning capability hierarchy is "IK < IG < CM", we control the params $\alpha<\beta<1$ , and set the default value of $\alpha$ to 0.0 and $\beta$ to 0.5.

2.3 Knowledge Concept Augmentation

In the previous section, we identify the Insufficient Knowledge (IK) as the foundation challenge in mathematical reasoning. To heuristically tackle this issue, we enlist human experts to create 67 knowledge concept cards, which is essential for LMM’s reasoning process. Initially, expert annotators offer precise summaries derived from the definitions in Euclid’s Elements [36], Wikipedia and textbooks. Subsequently, these experts further condense the content examined by a series of questions related to a specific knowledge concept, extracting crucial knowledge hints for incorporation into the knowledge cards. After several rounds of review, we confirm the accuracy and utility of each card. Figure 5 showcases typical knowledge concept cases and their descriptions. Consequently, with a given problem $Q_{i}$ and its respective knowledge concept $K_{i}$ , LMMs utilize the relevant knowledge cards to deduce the answer $A_{i}$ . The detailed information of KCA can be found in Appendix.

3 Experiment

Evaluation Protocols.

To accelerate the evaluation speed, We-Math comprises a testmini set with 1740 samples, including 1215 one-step samples, 360 two-step samples, and 165 three-step samples. In subsequent experiments, we utilize the We-Math testmini subset for evaluation. For automated evaluation, we standardize all samples into a multiple-choice format. We use regex to match the LMMs’ predictions and then calculate their accuracy against the ground-truth answers for main results. For analyses in section 3.2 and 3.3, we utilize the four-dimensional metric described in section 2.2 for assessment. To avoid LMMs deduce answers from options, we introduce an extra uncertain option to mitigate this issue.

Evaluation Models.

We examine the performance of foundation models across two distinct categories on We-Math: (a) Closed-source LMMs: GPT-4o [38], GPT-4V [26], Gemini 1.5 Pro [40], Qwen-VL-Max [13], (b) Open-source LMMs: LLaVA-NeXT-110B, LLaVA-NeXT-70B [39], LLaVA-1.6-13B, LLaVA-1.6-7B [41], DeepSeek-VL-1.3B, DeepSeek-VL-7B [42], Phi3-Vision-4.2B [43], MiniCPM-Llama3-V 2.5 [44], InternLM-XComposer2-VL-7B [45], InternVL-Chat-V1.5 [46], GLM-4V-9B [47], LongVA [48], G-LLaVA-13B [29].

3.1 Main Result

Table 1: Accuracy scores of LMMs on the testmini subset of We-Math. The first 3 columns report the overall performance on one-step, two-step, three-step problems, while the other columns display the result on one-step problems in different problem categories. The highest accuracy for closed-source and open-source LMMs is marked in blue and green respectively. (S1: one-step problem, S2: two-step problem, S3: three-step problem, Mem: Measurement, PF: Plane Figures, SF: Solid Figures, TMF: Transformations and Motion of Figures, PD: Position and Direction. AL: Angles and Length, UCU: Understanding and Conversion of Units, CPF: Calculation of Plane Figures, UPF: Understanding of Plane Figures, CSF: Calculation of Solid Figures, USF: Understanding of Solid Figures, BTF: Basic Transformations of Figures, CCF: Cutting and Combining of Figures, Dir: Direction, Pos: Position, RoM: Route Map, CCP: Correspondence of Coordinates and Positions).

Model	S1	S2	S3	Mem		PF		SF		TMF		PD
Model	S1	S2	S3	UCU	AL	CPF	UPF	CSF	USF	BTF	CCF	Dir	Pos	RoM	CCP
Closed-source
GPT-4o	72.84	58.06	43.64	86.61	39.12	77.35	71.56	84.50	62.27	58.74	69.37	93.10	72.67	47.53	73.33
GPT-4V	65.51	49.17	38.18	82.54	38.42	70.67	60.22	76.58	56.32	57.76	67.67	79.29	57.48	47.80	63.33
Gemini 1.5 Pro	56.13	51.39	33.94	50.99	31.23	61.75	45.03	69.95	57.54	39.24	62.65	68.81	54.13	40.66	60.00
Qwen-VL-Max	40.82	30.28	20.61	19.35	25.26	39.82	41.44	43.64	48.02	43.82	43.39	41.43	35.09	40.66	26.67
Open-source
LLaVA-NeXT-110B	53.74	36.94	31.52	39.48	57.72	59.48	53.06	52.25	50.22	54.09	50.76	54.76	55.86	40.11	40.00
LLaVA-NeXT-72B	42.88	35.56	30.91	31.65	25.26	43.25	42.39	46.14	41.76	44.22	51.02	44.29	38.93	32.97	36.67
InternVL-Chat-V1.5	49.38	30.56	28.48	43.95	29.82	52.23	52.06	44.19	48.15	47.05	46.82	65.71	50.47	36.54	36.67
LLaVA-1.6-13B	29.38	25.28	32.73	21.73	23.16	23.37	34.72	25.26	26.36	37.52	41.65	26.90	28.87	37.09	30.00
G-LLaVA-13B	32.43	30.56	32.73	33.33	29.12	32.04	37.88	19.57	33.51	37.12	32.79	31.19	33.21	25.55	40.00
GLM-4V-9B	47.33	37.22	38.18	53.37	37.02	51.32	46.52	50.60	38.22	44.09	45.22	40.95	49.27	36.81	53.33
MiniCPM-LLaMA3-V 2.5	39.75	31.11	29.70	28.57	37.02	40.81	39.82	40.97	38.61	31.96	42.66	40.95	42.70	43.96	43.33
LongVA-7B	43.54	30.56	28.48	24.50	39.82	45.09	40.75	51.85	42.49	45.60	44.56	44.52	40.74	47.53	20.00
LLaVA-1.6-7B	22.96	20.83	15.76	18.45	20.53	16.92	29.63	15.57	18.60	42.67	24.05	17.62	43.31	28.85	26.67
DeepSeek-VL-7B	32.59	26.67	25.45	16.57	35.09	27.27	38.01	24.18	38.65	50.02	30.09	24.52	41.01	51.65	23.33
InternLM-XComposer2-VL-7B	47.00	33.06	33.33	31.25	46.49	47.70	42.57	51.44	43.87	41.13	50.58	65.48	53.87	55.22	40.00
Phi3-Vision-4.2B	42.14	34.17	27.88	28.67	15.96	47.23	38.83	49.99	44.41	28.76	31.22	48.57	49.19	26.37	50.00
DeepSeek-VL-1.3B	31.44	27.78	23.03	27.78	23.86	22.76	36.92	30.36	34.18	44.46	28.29	48.10	41.77	37.09	33.33

Table 1 shows the overall performance of different LMMs on One-Step / Two-Step / Three-Step problems and different problem domains. We have the following observations:

The Nums of Knowledge Concepts are negatively correlated with LMMs’ Performance. Regarding problems of varying complexities (one-step vs. two-step vs. three-step), GPT-4o consistently achieve an advantage across all settings. Other closed-source models, such as GPT-4V and Gemini 1.5 Pro, also demonstrate competitive performance. However, most LMMs perform significantly worse on multi-step problems compared to one-step problems. For instance, GPT-4o’s accuracy drops from 72.84% to 43.64%. This trend is even more pronounced in stronger models like LLaVA-NeXT-110B and InternVL-Chat-V1.5. These observations suggest that the number of knowledge concepts in a question is positively correlated with its difficulty and negatively correlated with LMMs’ performance, supporting the rationale for decomposing questions to a certain extent.

Larger Parameter Scales in LLMs generally achieve Better Generalization Abilites. To explore what role LLM plays in LMMs, we conduct pairwise comparisons on the LMMs with the same LLM backbone (e.g. LLaVA-NeXT-110B vs LLaVA-NeXT-72B; DeepSeek-VL-7B vs DeepSeek-VL-1.3B). Focusing on the strict metric, we observe that larger parameter scales in LLMs generally perform better, which reveals that the parameter scales in the text decoder is a key factor in achieving the generalization ability in visual mathematical reasoning.

LMMs excel in Calculation but struggle with Fine-grained Visual Measurement. Focusing on different math categories, GPT-4o still maintains impressive results across various subfields. In contrast, as shown in Figure 6, other LMMs generally struggle with "Angle Measurement" and "Unit Conversion". After analyzing these cases, we reveals that the main challenge for LMMs lies in their inability to perform precise visual angle and unit measurements. Furthermore, most LMMs demonstrate better proficiency in calculation (e.g., Calculations of Solid Figures, Calculations of Plane Figures) compared to conceptual understanding (e.g., Understanding of Solid Figures, Understanding of Plane Figures), which indicates that current LMMs excel at directly applying formulas based on given conditions on but are still limited in understanding and comprehensively applying knowledge.

Table 2: The performance of different LMMs on four-dimensional metrics for reasoning evaluation. The best and the two worst performances are marked in blue and red (Avg:

\text{Score}_{\text{average}}

Model	Strict					Loose
Model	Avg ( $\uparrow$ )	IK ( $\downarrow$ )	IG ( $\downarrow$ )	CM ( $\uparrow$ )	RM ( $\downarrow$ )	Avg ( $\uparrow$ )	IK ( $\downarrow$ )	IG ( $\downarrow$ )	CM ( $\uparrow$ )	RM ( $\downarrow$ )
Closed-source
GPT-4o	42.86%	31.24% (164)	15.24% (80)	35.24% (185)	34.16% (96)	60.57%	31.24% (164)	15.24% (80)	52.95% (278)	1.07% (3)
GPT-4V	31.05%	39.81% (209)	14.48% (76)	23.81% (125)	47.92% (115)	51.43%	39.81% (209)	14.48% (76)	44.19% (232)	3.33% (8)
Gemini 1.5 Pro	26.38%	42.86% (225)	11.24% (59)	20.76% (109)	54.77% (132)	46.00%	42.86% (225)	11.24% (59)	40.38% (212)	12.03% (29)
Qwen-VL-Max	10.48%	65.14% (342)	7.62% (40)	6.67% (35)	75.52% (108)	25.52%	65.14% (342)	7.62% (40)	21.71% (114)	20.28% (29)
Open-source
LLaVA-NeXT-110B	19.24%	50.29% (264)	14.48% (76)	12.00% (63)	65.95% (122)	37.90%	50.29% (264)	14.48% (76)	30.67% (161)	12.97% (24)
LLaVA-NeXT-72B	13.43%	58.86% (309)	7.05% (37)	9.90% (52)	70.95% (127)	31.52%	58.86% (309)	7.05% (37)	28.00% (147)	17.88% (32)
InternVL-Chat-V1.5	14.95%	56.19% (295)	13.90% (73)	8.00% (42)	73.25% (115)	32.67%	56.19% (295)	13.90% (73)	25.71% (135)	14.01% (22)
LLaVA-1.6-13B	5.24%	69.14% (363)	3.24% (17)	3.62% (19)	86.90% (126)	22.00%	69.14% (363)	3.24% (17)	20.38% (107)	26.21% (38)
G-LLaVA-13B	6.48%	64.19% (337)	4.57% (24)	4.19% (22)	86.59% (142)	22.29%	64.19% (337)	4.57% (24)	20.00% (105)	35.98% (59)
GLM-4V-9B	14.86%	52.95% (278)	9.52% (50)	10.10% (53)	73.10% (144)	35.05%	52.95% (278)	9.52% (50)	30.29% (159)	19.29% (38)
MiniCPM-LLaMA3-V 2.5	9.52%	60.19% (316)	9.14% (48)	4.95% (26)	83.85% (135)	28.00%	60.19% (316)	9.14% (48)	23.43% (123)	23.60% (38)
LongVA-7B	11.52%	61.14% (321)	8.95% (47)	7.05% (37)	76.43% (120)	27.71%	61.14% (321)	8.95% (47)	23.24% (122)	22.29% (35)
LLaVA-1.6-7B	3.33%	78.29% (411)	2.48% (13)	2.10% (11)	89.11% (90)	13.81%	78.29% (411)	2.48% (13)	12.57% (66)	34.65% (35)
DeepSeek-VL-7B	6.29%	69.14% (363)	4.57% (24)	4.00% (21)	84.78% (117)	20.95%	69.14% (363)	4.57% (24)	18.67% (98)	28.99% (40)
InternLM-XComposer2-VL-7B	12.67%	56.38% (296)	10.48% (55)	7.43% (39)	77.59% (135)	30.95%	56.38% (296)	10.48% (55)	25.71% (135)	22.41% (39)
Phi3-Vision-4.2B	10.57%	58.86% (309)	8.95% (47)	6.10% (32)	81.07% (137)	29.81%	58.86% (309)	8.95% (47)	25.33% (133)	21.30% (36)
DeepSeek-VL-1.3B	5.90%	71.05% (373)	2.67% (14)	4.57% (24)	82.61% (114)	21.52%	71.05% (373)	2.67% (14)	20.19% (106)	23.19% (32)

LMMs exhibit Strong Potential for Parameter Compression. In terms of different LMMs, LLaVA-NeXT-110B demonstrates performance closest to GPT-4. Surprisingly, despite having smaller parameter scales, Phi3-Vision-4.2B and MiniCPM-Llama3-V 2.5 also show competitive performance compared to LLaVA-NeXT-72B. Moreover, the recent GLM-4V-9B and InternVL-Chat-V1.5 have allocated a larger proportion of parameters to the visual encoder (as shown in Table 8), thereby demonstrating notable capabilities. This underscores the importance of optimizing visual representations and suggests that LMMs still have significant potential for parameter compression.significant potential for parameter compression.

3.2 Knowledge based Reasoning Analysis

Table 2 and Figure 7, 8, 9 illustrate the results of knowledge based reasoning evaluation, including four distinct conditions (IK, IG, CM, RM). We have the following observations:

IK is the Greatest Vulnerability of LMMs. All LMMs consistently demonstrate an Insufficient Knowledge issue during the reasoning process, especially in models with smaller parameter scales (LLaVA-1.6-7B, DeepSeek-VL-1.3B). As discussed in section 2.2, addressing IK is crucial for progressing towards Inadequate Generalization (IG) and Complete Mastery (CM). This knowledge gap in solving one-step problems hinders further progress in reasoning about more composite mathematical problems. This finding also supports the rationale behind our proposed KCA strategy.

GPT-4o’s Main Challenge has gradually shifted from IK to IG, highlighting it as the First LMM towards the Knowledge Generalization Stage. Focusing on IK and IG, GPT-4o exhibits a substantial lead in addressing the IK issue, but the weakest performance in IG. Further analyzing the logical relationships between IK, IG, and CM (IK $\rightarrow$ IG $\rightarrow$ CM), we are pleasantly surprised to find that GPT-4o is markedly superior to the open-sourced LLaVA-NeXT-110B in IK (19.05%), suggesting it has successfully converted a considerable amount of IK into IG issue. This revelation indicates that GPT-4o’s challenges in reasoning have shifted from addressing Insufficient Knowledge in one-step problems to the knowledge generalization stage, leading us to speculate that there may have been groundbreaking changes in GPT-4o’s training strategy. However, other LMMs remain stuck at the IK phase. We argue that it is pointless to compare IG without a solid grasp of IK, highlighting the significance of our hierarchical metrics (IK < IG < CM).

The Unreasonable RM issue remains widespread across Most LMMs. GPT-4o achieves a significant lead on the RM issue, particularly on the loose metric ( $S_{RM}<2\%$ ). However, other LMMs still exhibit nearly 25% $S_{RM}$ on the loose metric. When focusing on the changes in $S_{RM}$ between strict and loose metrics, several models (LLaVA-NeXT-110B, GLM-4V-9B, DeepSeek-VL-1.3B, MiniCPM-Llama3-V 2.5) show significant variations. This is a beneficial phenomenon, indicating that these models possess a certain ability to solve one-step problems, but their performance fluctuates due to external factors such as prompting templates and hyper-parameters.

3.3 Quantitative Analysis

The Effectiveness on KCA. Figure 10 displays the quantitative analysis of the LMMs with knowledge concept augmented (KCA). We find that LMMs with different parameter scales show consistent performance improvements on both strict and loose metrics after introducing the KCA strategy. Moreover, the KCA strategy significantly alleviates IK issues but does not noticeably improve IG. This aligns with human intuition, as the knowledge descriptions primarily address gaps in reasoning knowledge. Nevertheless, alleviating IG issues requires a comprehensive enhancement of the LMMs’ knowledge generalization abilities, which we consider a direction for future exploration.

Error Anaysis. Figure 11 shows the occurrence of the four types of errors across the 67 knowledge concepts. Knowledge errors are the most frequent, appearing in over 45 knowledge concepts. Notably, although visual errors are the second most common, they are more concentrated in specific concepts (e.g., "Understanding Angles" >10), and over 38 concepts have no visual errors. This finding underscores the urgent need to enhance the fine-grained measurement capabilities of visual encoders in LMMs for mathematical reasoning, rather than blindly improving their overall capabilities.

4 Related Work

Mathematical Reasoning Benchmarks. Assessing mathematical reasoning abilities is crucial for the development of large foundational models (LLMs and LMMs). Early efforts, such as MathQA [49], focus on solving mathematical word problems and highlight the importance of operation-based reasoning. Following this, datasets like GSM8K [50] and MATH [51] set the stage for evaluating text-based mathematical problems at various difficulty levels. Other benchmarks, such as MMLU [52] and MT-Bench [53], also consider mathematical evaluation as a key part of assessing LLMs. Beyond text-only evaluations, datasets like GeoQA [32], UniGeo [33], and Geometry3K [30] have pioneered the evaluation of geometric problems. Recently, several benchmarks [34] [54] have expanded their scope to cover a broader range of subjects. Additionally, MathVerse [35] aims to evaluate reasoning paths based on reference answers. However, challenges remain due to the complex nature of mathematical reasoning. In this paper, we introduce We-Math, a comprehensive benchmark designed to evaluate the reasoning abilities of LMMs across a wide range of mathematical categories.

Benchmarks for Large Multimodal Model. The rapid advancement of Large Language Models (LLMs) and Large Multimodal Models (LMMs) have highlighted the necessity for more comprehensive evaluation benchmarks. At first, the emergence of a series of text-only benchmarks and evaluations give us a clearer understanding of the strengths and weaknesses of large language models [55, 52, 53, 56, 57, 58, 59, 60, 61, 50, 51, 62]. Focusing on the visual aspect, early benchmarks predominantly focused on narrow tasks like Visual Question Answering (VQA) [63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75] and image captioning [76, 77, 78], showcasing significant progress but not fully addressing the broader spectrum of multimodal perception and reasoning. This gap has driven recent research to assess LMMs from multiple angles. Notable efforts include MMBench [79] and SEED-bench [80, 81], which probe models’ abilities through common-sense queries incorporating multiple-choice formats. For domain-specific expertise, MMMU [54] utilize academic content to gauge deeper knowledge levels. Yet, benchmark such as MMStar [82] reveals that certain evaluations allow models to respond without images, risking data leakage and failing to adequately measure logic and reasoning skills. The challenge of understanding image implications, requiring multi-hop reasoning and theory of mind (ToM) [83, 84, 85, 85, 86, 87], underscores this shortfall. In parallel, the intersection of large language models (LLMs) and Large Multimodal Models (LMMs) has surged, extending the applicability of LMMs evaluations across diverse modalities including 2D images [88, 89, 90], 3D point clouds [91, 92, 93], audio [94, 95, 96, 97], and video [98, 99, 100]. Moreover, a series of works have positioned LMMs as agents with various tools, such as APIs [101, 102, 103], retrievers [104, 105] , thereby broadening the development avenues for the model evaluation community [106, 107, 108, 109].

5 Conclusion

In this paper, we propose We-Math, a comprehensive benchmark for in-depth analysis of LMMs in visual mathematical reasoning. We-Math encompasses 6.5K visual math problems, covering 5 layers and 67 knowledge concepts. Moreover, we pioneeringly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric for fine-grained reasoning evaluation. With We-Math, we thoroughly evaluate existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. Furthermore, we identify IK issues as the greatest vulnerability of LMMs. However, GPT-4o’s main challenge has shifted from IK to IG, highlighting it the first LMM towards the next stage. Lastly, analyses on KCA strategy and error cases further heuristically guides existing LMMs towards human-like visual mathematical reasoning.

References

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[2] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[4] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
[5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[6] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[7] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
[8] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
[10] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[11] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
[12] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
[13] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
[14] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
[15] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
[16] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[17] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
[18] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, 2022.
[19] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, pages 10764–10799, 2023.
[20] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024.
[21] Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. CoRR, 2023.
[22] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023.
[23] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and **gren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.
[24] Longhui Yu, Weisen Jiang, Han Shi, **cheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024.
[25] Chengpeng Li, Zheng Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, and Chang Zhou. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506, 2023.
[26] R OpenAI. Gpt-4v (ision) system card. Citekey: gptvision, 2023.
[27] R. Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
[28] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
[29] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
[30] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021.
[31] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
[32] Jiaqi Chen, Jianheng Tang, **ghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021.
[33] Jiaqi Chen, Tong Li, **ghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022.
[34] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
[35] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024.
[36] Richard Fitzpatrick. Euclid’s elements of geometry, 2008.
[37] Wikipedia contributors. Wikipedia, 2004.
[38] OpenAI. Hello gpt-4o, 2024.
[39] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
[40] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[41] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
[42] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, **gxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024.
[43] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
[44] **yi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023.
[45] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, **gwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
[46] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
[47] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, **g Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
[48] Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, **gkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024.
[49] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Ye** Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[50] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[51] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
[52] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
[53] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
[54] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
[55] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, **ghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023.
[56] Guanting Dong, **xu Zhao, Tingfeng Hui, Daichi Guo, Wenlong Wan, Boqi Feng, Yueyan Qiu, Zhuoma Gongque, Keqing He, Zechen Wang, and Weiran Xu. Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task, 2023.
[57] Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery, 2024.
[58] Xiaoshuai Song, Keqing He, Pei Wang, Guanting Dong, Yutao Mou, **gang Wang, Yunsen Xian, Xunliang Cai, and Weiran Xu. Large language models meet open-world intent discovery and recognition: An evaluation of chatgpt, 2023.
[59] Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. Understand what llm needs: Dual preference alignment for retrieval-augmented generation, 2024.
[60] Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, and **gren Zhou. Occuquest: Mitigating occupational bias for inclusive large language models, 2023.
[61] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. Hellaswag: Can a machine really finish your sentence?, 2019.
[62] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks, 2020.
[63] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[64] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
[65] Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017.
[66] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
[67] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
[68] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
[69] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
[70] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
[71] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[72] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
[73] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2024.
[74] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
[75] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
[76] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[77] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
[78] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
[79] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[80] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
[81] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024.
[82] Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
[83] Poorav Desai, Tanmoy Chakraborty, and Md Shad Akhtar. Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10563–10571, 2022.
[84] Jack Hessel, Ana Marasović, Jena D Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Ye** Choi. Do androids laugh at electric sheep? humor" understanding" benchmarks from the new yorker caption contest. arXiv preprint arXiv:2209.06293, 2022.
[85] Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Robin IM Dunbar, et al. Llms achieve adult human performance on higher-order theory of mind tasks. arXiv preprint arXiv:2405.18870, 2024.
[86] Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and **gren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024.
[87] Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, et al. Ii-bench: An image implication understanding benchmark for multimodal large language models. arXiv preprint arXiv:2406.05862, 2024.
[88] Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, and Chengzhi Mao. Imagenet-d: Benchmarking neural network robustness on diffusion synthetic object, 2024.
[89] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
[90] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
[91] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
[92] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository, 2015.
[93] Andreas Geiger, Philipp Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 32(11):1231–1237, 2013.
[94] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
[95] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 12th Annual Conference of the International Speech Communication Association (Interspeech), pages 121–125, 2017.
[96] Qian Yang, ** Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and **gren Zhou. Air-bench: Benchmarking large audio-language models via generative comprehension, 2024.
[97] Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F. Chen. Audiobench: A universal benchmark for audio large language models, 2024.
[98] Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023.
[99] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. In arXiv preprint arXiv:1609.08675, 2016.
[100] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang **, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023.
[101] Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024.
[102] Chenyu Wang, Weixin Luo, Qianyu Chen, Haonan Mai, **di Guo, Sixun Dong, Xiaohua, Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning, 2024.
[103] Xiao Liu, Jianfeng Lin, and Jiawei Zhang. Beyond text: Unveiling multimodal proficiency of large language models with multiapi benchmark, 2023.
[104] Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models, 2024.
[105] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, and Shafiq Joty. Retrieving multimodal information for augmented generation: A survey. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4736–4756, Singapore, December 2023. Association for Computational Linguistics.
[106] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Multibench: Multiscale benchmarks for multimodal representation learning, 2021.
[107] Wentao Ge, Shunian Chen, Guiming Hardy Chen, Zhihong Chen, Junying Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xinyi Zhang, Yichen Chai, Xiaoyu Liu, Dingjie Song, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, and Benyou Wang. Mllm-bench: Evaluating multimodal llms with per-sample criteria, 2024.
[108] Kaining Ying, Fanqing Meng, ** Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, ** Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024.
[109] Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers, 2023.

Appendix

Appendix A Broaden Impact

Bridging Human-Like Inspiration and Reliability. As previously mentioned, works such as neural networks [2] and attention mechanisms [3] draw their design inspiration from human thinking patterns. This is fundamentally because the purpose of designing AI is to assist humans. Currently, LMMs have already been hel** people in various scenarios, which was unimaginable in the past. Therefore, we firmly believe that a new era is coming, where people will focus not only on the performance of models in specific fields but also on the reliability of a model. In some fundamental scenarios, a reliable model is more important, which is one of the primary motivations behind the creation of We-Math. Furthermore, after completing our experiments, we find that in a loose setting, GPT-4o’s RM metric is only 1.07%, showing us the possibility of a reliable and accurate model emerging in the future.

Fine-grained Evaluation and Versatile Applications. From the model’s perspective, We-Math can provide LMMs with an assessment of mathematical abilities. Additionally, We-Math’s IK, IG, and CM metrics offer a fine-grained evaluation of the model’s capabilities. Furthermore, the RM metric reflects a model’s reliability to address our concern of not desiring a model that can solve complex problems but makes errors on sub-problems within the solution process. Ultimately, we introduce the $\text{Score}_{\text{average}}$ metric to quantify the model’s overall performance. Moreover, since We-Math is constructed from the decomposition of a multi-step problem’s necessary solution process, it provides new perspectives for interactive tasks (multi-turn dialogues), self-supervised learning, information extraction, and other tasks. It also offers crucial references and support for the deployment of models in education and other fields.

Ethics Statement. We ensure that We-Math complies with legal and ethical guidelines throughout its construction process, with no violations. We provide fair compensation to all annotators involved. We-Math focuses on elementary mathematics problems, and during its construction, data collection was sourced from publicly available test questions, textbooks, and professional websites. Since mathematics problems inherently have standard answers, they are not subject to cultural differences. Additionally, we guarantee that We-Math is solely for academic research purposes, and we uphold the strict prohibition of any commercial use. Additionally, we declare that we will bear full responsibility in the event of any rights violations and confirm the data license.

Appendix B More Details on We-Math

B.1 Hierarchical Knowledge Structure

Figure 12, 13 shows the detailed hierarchical structure of We-Math, which includes 5 levels, 99 nodes, and 67 leaf nodes.

In the initial stages of constructing the benchmark, we aimed to address two key objectives. We believe that the purpose of designing a benchmark is to evaluate the performance of models and provide guidance on areas that need improvement. However, existing benchmarks offer only broad guides in these aspects. Additionally, the core contribution mentioned earlier is that We-Math is the first benchmark specifically designed to study the mathematical problem-solving mechanisms of models. Inspired by the learning paradigm of humans, which is based on knowledge concepts, We-Math constructs its dataset with knowledge concepts as the basic unit, resulting in evaluations with rigorous scientific accuracy and better guidance.

B.2 Knowledge-based Data Decomposition

Figures 14, 15 illustrate the process of Knowledge-based Data Decomposition.

Collection. In each example, the Collection section presents specific information about each multi-step problem in the dataset.

Human reasoning. The Human reasoning section shows the process required before decomposing each multi-step problem, where educational experts extract the key information needed for each sub-problem based on the reasoning path for the knowledge concepts included in the multi-step problem.

Decompose. The Decompose section uses the key information extracted in the Human reasoning section to formulate sub-problems, refine the options, and ultimately achieve the decomposition of a multi-step problem.

It is necessary to further explain that to ensure each sub-problem has a rigorous logical relationship and is independent, the text condition for the first sub-problem is derived from the text condition of the multi-step problem, and the image condition for the first sub-problem is the same as the image condition of the multi-step problem.

Furthermore, in constructing the second sub-problem, two situations may arise. The first situation is where the answer of the first sub-problem is injected as a key condition into the image condition of the second sub-problem, presenting the information visually. The second situation is where the answer of the first sub-problem is injected as a key condition into the text condition of the second sub-problem, while the image condition remains unchanged.

In We-Math, the vast majority of cases are of the first type. However, for some information that is extremely difficult to present in images, we opt for the second type, presenting the information in text form. To ensure fairness in the decomposition of the problems, only one of these situations will occur in the decomposition of the same multi-step problem. This approach ensures that the question of the final sub-problem will match the original multi-step problem, completing the decomposition.

Table 3: Prompt templates for response generations.

Type

Prompt Template

Multiple

Choice

Now, we require you to solve a multiple-choice math question. Please briefly

describe your thought process and provide the final answer(option).

Question: <Question>

Option: <Option>

Regarding the format, please answer following the template below, and be

sure to include two <> symbols:

Knowledge Concept

Augmentation

Now, we require you to solve a multiple-choice math question. We will provide

you with the relevant knowledge concepts of this question for your reference.

Please briefly describe your thought process and provide the final answer(option).

Knowledge concept: <Knowledge concept>

Question: <Question>

Option: <Option>

Regarding the format, please answer following the template below, and be

sure to include two <> symbols:

B.3 Knowledge Concepts Augmentation

Table 3 report the prompt templates in our experiments. We concatenate the textual descriptions into the prompt. Additionally, each knowledge concept description is accompanied by its corresponding visual content, which helps the experimenter understand and facilitates further enhancement when models can incorporate sufficient visual information as part of the prompt in the future.

In section F.1, we illustrates the specific content of descriptions for 67 knowledge concepts. For example, as shown in Figure 46, for the knowledge concept "Perimeter of Squares," it is necessary to know that "c=4a", relying solely on textual descriptions is insufficient for understanding this concept, so we include visual information to aid comprehension.

B.4 Details of Data Collection

With the hierarchical knowledge structure, we select geometric problems with images from publicly authoritative mathematics websites from various countries, including professional exams and practice tests (detailed sources list can be found in section F.2). To ensure comprehensive coverage of fundamental and critical areas in primary math, we select the five most foundational and prevalent domains within the field of primary geometry, including:

•

Plane figures: Questions involving identification and properties of two-dimensional shapes.
•

Solid figures: Questions related to the recognition and characteristics of three-dimensional objects.
•

Transformation and motion of figures: Problems focusing on geometric transformations such as translation, rotation, and reflection.
•

Position and direction: Questions that involve understanding spatial relationships and directions.
•

Measurement: Problems requiring the measurement of length, area, volume, and angles.

The selection criteria are as follows: (1) The problems include multiple knowledge concepts and can be decomposed into steps for solution. (2) The problems and images are consistent. (3) The correct answer is unique, and the distractor options are highly confusing.

B.5 Details of Data Statistics

Table 4: Key statistics of We-Math.

Statistic	Number
Total questions	6,524
Newly collected questions	6,524
Multiple-choice questions	6,524
-First-layer nodes	5
-Second-layer nodes	12
-Terminal nodes	67
Question options
-Total options	25,178
-Average options	3.859
-Proportion of answer A	6,524 (25.9%)
-Proportion of answer B	6,524 (25.9%)
-Proportion of answer C	6,505 (25.8%)
-Proportion of answer D	4,419 (17.6%)
-Proportion of answer E	1,198 (4.8%)
-Proportion of answer F&G	11 (0.04%)
Question length
-Maximum length (word)	143
-Maximum length (character)	852
-Average length (word)	25.8
-Average length (character)	135.3

Question distribution. The We-Math consists entirely of English questions, all newly collected from public authoritative mathematics websites, and presented in the format of multiple-choice questions. As illustrated in Table 4, the average number of words in the English questions within We-Math is 25.81, with the maximum number of words in a question reaching 143. Figure 16 further elaborates on the distribution of word counts, highlighting the diverse patterns of the questions.

Advantages of Multiple-Choice Questions.

In We-Math, all problems are presented as multiple-choice questions. Even if some problems did not originally conform to the multiple-choice format during the initial selection, our researchers manually converted them into the format. Using multiple-choice questions offers several advantages:

Standardization: Ensures uniformity across all questions, facilitating consistent assessment and comparison across different hierarchical subjects.

Objective Grading: The use of single correct answers eliminates subjectivity in grading, enhancing the reliability of the evaluation.

Efficiency: Allows for rapid and scalable assessment, suitable for large datasets and automated systems.

Focused Assessment: Carefully designed distractors help in accurately identifying specific knowledge gaps and common misconceptions.

Appendix C More Details on the Metrics

Distinguishing Metric. Considering the model’s instability, Figure 4 and Figure 17, 18 illustrate the two metrics we propose for distinguishing between RM and CM metrics. Figure 4 represents the two-step problem, while Figures 17 and Figures 18 represent the three-step problem. Specifically, under the strict metric, if there is any error in the corresponding sub-problems of a multi-step problem that is answered correctly, it is classified as RM (Rote Memorization). Only if all corresponding sub-problems are answered correctly (TTTT, TTT) is it classified as CM (Complete Master). Under the loose metric, it is classified as RM only if the model answers all sub-problems incorrectly (FFFT, FFT), otherwise, it is classified as CM. Therefore, the $\text{Score}_{\text{average}}$ under the loose metric is slightly higher. We hope to see models like GPT-4o [38] and GPT-4V [26], which have already performed nearly perfectly under the loose metric and are far ahead of other models, bring us even greater surprises under the strict metric in the next update.

Metrics’ Intrinsic Logic. As shown in Figure 4, 17, 18, it is evident in the Metric for Reasoning Evaluation Section that IK, IG, and CM have a logical relationship. In the early stages of constructing We-Math, we recorded all the model’s responses and analyzed the answers to each multi-step problem and its corresponding sub-problems. We believe that for both humans and models, a reasonable learning process should involve first mastering each knowledge concept individually and then learning to comprehensively apply them to achieve complete mastery. The situation where the multi-step problem is answered correctly but the sub-problems are answered incorrectly (RM) is an unreasonable phenomenon. Therefore, we developed a four-dimensional fine-grained metric to further evaluate the model’s performance.

Appendix D More Details on Experiment Setup

D.1 Details of the Evaluated Models

To evaluate the mathematical reasoning abilities of various LMMs, we selected their latest model versions. Table 5 presents their release dates and specific sources. Given the intuition that smaller models (with parameters of 7B or less) perform poorly on Insufficient Knowledge (IK), we also included evaluations of the latest models with 7B, 4.2B, and 1.3B parameters. This was done to explore whether these models could achieve significant improvement under the KCA strategy.

Table 5: The release time and model source of LMMs used in We-Math

Model	Release Time	Source
GPT-4o [38]	2024-05	https://gpt4o.ai/
GPT-4V [26]	2024-04	https://openai.com/index/gpt-4v-system-card/
Gemini 1.5 Pro [40]	2024-05	https://deepmind.google/technologies/gemini/pro/
Qwen-VL-Max [13]	2024-01	https://huggingface.co/spaces/Qwen/Qwen-VL-Max/
LLaVA-NeXT-110B [39]	2024-05	https://huggingface.co/lmms-lab/llava-next-110b/
LLaVA-NeXT-72B [39]	2024-05	https://huggingface.co/lmms-lab/llava-next-72b/
LLaVA-1.6-13B [41]	2024-03	https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf/
LLaVA-1.6-7B [41]	2024-03	https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf/
DeepSeek-VL-1.3B [42]	2024-03	https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat/
DeepSeek-VL-7B [42]	2024-03	https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat/
Phi3-Vision-4.2B [43]	2024-05	https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/
MiniCPM-LLaMA3-V 2.5 [44]	2024-05	https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/
InternLM-XComposer2-VL-7B [45]	2024-04	https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/
InternVL-Chat-V1.5 [46]	2024-04	https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5/
GLM-4V-9B [47]	2024-06	https://huggingface.co/THUDM/glm-4v-9b
LongVA [48]	2024-06	https://huggingface.co/lmms-lab/LongVA-7B
G-LLaVA-13B [29]	2024-03	https://huggingface.co/renjiepi/G-LLaVA-13B/

D.2 Details of the Model Hyperparameters

For all closed-sourced models with API access, we adopt the generation scheme shown in Table 6 and simply run the inference with CPUs, which typically completes within a day. For all open-source models, we utilize a cluster with 8 NVIDIA A800-SXM4-80GB GPUs to run the inference, and we follow the hyper-parameter settings specified in the model source’s inference samples. If no specific instructions are provided, we use the default settings. Table 7 details the specific generation parameters.

Table 6: Generating parameters for Closed-Source LMMs.

Model

Generation Setup

GPT-4o

"model" : "gpt-4o", "temperature" : 0, "max_tokens" : 1024

GPT-4V

"model" : "gpt-4-turbo", "temperature" : 0, "max_tokens" : 1024

Gemini 1.5 Pro

"model" : "gemini-1.5-pro-latest", "temperature" : 0, "max_tokens" : 1024

Qwen-VL-Max

"model" : "qwen-vl-max", "temperature" : 0, "max_tokens" : 1024

Table 7: Generating parameters for Open-Source LMMs.

Model

Generation Setup

LLaVA-NeXT-110B