We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao1  , Qiuna Tan1, Guanting Dong1, Minhui Wu2, Chong Sun2, Xiaoshuai Song1,
Zhuoma GongQue1, Shanglin Lei3, Zhe Wei1, Miaoxuan Zhang1, Runfeng Qiao4,
Yifan Zhang1, Xiao Zong1, Yida Xu1, Muxi Diao1, Zhimin Bao2,
Chen Li2,
Honggang Zhang1
1Bei**g University of Posts and Telecommunications,2Wechat, Tencent Inc.,
3Huazhong University of Science and Technology, 4Bei**g Institute of Technology
https://We-Math.github.io
Equal contribution.Corresponding author
Abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks focus more on the result-oriented performance, but neglecting the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce We-Math, the first benchmark specifically designed to explore the problem-solving principles beyond the end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. We firstly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) to hierarchically assess inherent issues in LMMs’ reasoning process. With We-Math, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving step and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategy. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization – they correctly solve composite problems involving multiple knowledge concepts, yet fail in answering sub-problems. We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning for LMMs. The We-Math data and evaluation code are available at https://github.com/We-Math/We-Math.

Refer to caption
Figure 1: Overview of LMMs’ performances on We-Math. Figures from left to right illustrates the (1) accuracy of different LMMs on various problem-solving steps, (2) the performance in different visual mathematics categories and (3) the result in knowledge based reasoning evaluation.

1 Introduction

“I think, therefore I am.” — René Descartes

Human cognitive and reasoning patterns have profoundly shaped the progress of deep learning [1]. Initially, the design of neural networks [2] is inspired by the brain’s neuronal mechanisms. It uses convolution kernels and hierarchical network to mimic human cognitive process of knowledge acquisition. Recently, Transformers [3] employ attention mechanisms to handle multiple information flows and quickly focus on critical content, thereby achieving more efficient and in-depth sequential learning. Owing the scalability of the Transformer architecture and pre-training techniques, Large Language Models (LLMs) [4, 5, 6, 7] and Large Multimodal Models (LMMs) [8, 9, 10, 11, 12, 13, 14, 15, 16] showcases strong reasoning abilities that parallel human performance across a wide range of tasks and provide a glimpse into the early outlines of Artificial General Intelligence (AGI).

Mathematical reasoning is a critical capability of foundational models. Existing methods employ Chain of Thought (COT) [17], Program of Thought (POT) [18, 19], Tool-integrated techniques [20, 21] and data augmentation strategies [22, 23, 24, 25] to guide LLMs towards emulating human-like reasoning patterns. In a more challenging scenario, Visual mathematical reasoning requires the model to accurately decode the visual information in image and perform reasoning based on the textual problem. With the rapid advancements of large multimodal models (LMMs) [26, 27], researchers progressively utilize the LMMs for solving visual mathematical problems [28, 29]. These studies provide valuable insights into the ongoing improvements in multi-modal logical thinking capabilities.

To systematically evaluate visual mathematical reasoning capabilities, previous efforts [30, 31, 32, 33] have focused on challenging geometric problems. Recently, several benchmarks [34, 35] expand the scope to include a wider range of disciplines. However, these benchmarks rely solely on end-to-end results for assessment, which fails to identify inherent issues within the LMMs’ reasoning process. Moreover, MathVerse [35] attempt to directly evaluate reasoning paths based on reference answers, but limitations remain due to the knowledge-intensive nature of mathematical reasoning. While noticing that humans solve complex math problems through gradually mastering and generalizing the knowledge concepts [36], we claim a fair evaluation of a model’s reasoning process should be based on knowledge concepts. Therefore, we pose two questions about mathematical reasoning evaluation:

Q1: Does the correct answer truly reflect LMM’s ability to reason through such problems accurately?

Q2: Does an incorrect answer suggest a lack of foundational knowledge in LMM’s reasoning process?

As the response, we present We-Math, a pioneering benchmark for conducting an in-depth analysis of the underlying principles of LMMs in visual mathematical reasoning. We-Math consists of over 6.5K meticulously selected visual math problems, which can be categorized into 5 layers of knowledge granularity across 67 knowledge concepts for ensuring a comprehensive coverage. We observe that real-world math problems typically encompass multiple foundational knowledge concepts, and their difficulty is directly related to the number of concepts involved. Upon this, we decouple the model’s ability to solve composite problems with k𝑘kitalic_k knowledge concepts into two stages:

1) LMMs can solve k𝑘kitalic_k individual sub-problems corresponding its knowledge concept;

2) LMMs reason out the final answer by integrating the k individual knowledge concepts.

The above process can be formulated as follows:

P(Y|X)=i=1kP(yi|xi)Preason,𝑃conditional𝑌𝑋superscriptsubscriptproduct𝑖1𝑘𝑃conditionalsubscript𝑦𝑖subscript𝑥𝑖subscript𝑃reasonP(Y|X)=\prod_{i=1}^{k}P(y_{i}|x_{i})\cdot P_{\text{reason}},italic_P ( italic_Y | italic_X ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_P start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT , (1)

where (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) and (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the (question,answer)𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑎𝑛𝑠𝑤𝑒𝑟(question,answer)( italic_q italic_u italic_e italic_s italic_t italic_i italic_o italic_n , italic_a italic_n italic_s italic_w italic_e italic_r ) pairs in a composite mathematical problem and the i𝑖iitalic_i-th sub-problem, respectively. Preasonsubscript𝑃reasonP_{\text{reason}}italic_P start_POSTSUBSCRIPT reason end_POSTSUBSCRIPT stands for the LMMs’ reasoning capacity. It is evident that assessing the reasoning process cannot be based solely on final answers. To decompose a composite problem into individual sub-problems according to the invoked knowledge concept, we select 1.5k high-quality problems with multiple knowledge concepts in We-Math. Following equation 1, these composite problems are gradually decomposed by expert annotators into one-step problem (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Motivated by human reasoning patterns, We-Math further introduces a four-dimensional metric to precisely evaluate the inherent gaps in LMMs’ problem-solving abilities, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM). To further tackle the fundamental IK issue, we propose a heuristic knowledge concept augmented (KCA) strategy, constructing descriptions for 67 knowledge concepts from Wikipedia [37] and textbooks, thereby providing essential knowledge for LMMs’ reasoning.

Figure 1 illustrates our overview experimental results. Not surprisingly, GPT-4o [38] achieves the best overall performance across different visual mathematics categories. Closed-source LLMs (GPT-4V, Gemini 1.5 Pro) and LMMs with larger parameter scales (LLaVA-NeXT-110B [39]) generally exhibit superior visual mathematical reasoning capabilities. However, most LMMs perform significantly worse on multi-step problems compared to one-step problems, suggesting that the number of knowledge concepts is positively correlated with the question’s difficulty and negatively correlated with LMM performance. In specialized disciplines, most LMMs excel in calculation but consistently struggle with fine-grained visual measurement ("Angles and Length").

For reasoning evaluation, we emphasize that mastery of knowledge concepts is fundamental. Unfortunately, most LMMs still suffer from Insufficient Knowledge issue, especially smaller-scale models (e.g., over 350 IK issues in LLaVA-1.6-7B and DeepSeek-VL-1.3B). GPT-4o significantly addresses this knowledge gap, establishing it as the first LMM advancing towards the knowledge generalization stage. More notably, several LMMs still exhibit a marked inclination towards Rote Memorization (e.g. G-LLaVA-13B nearly 36% in RM (Loose)), raising doubts about whether current LMMs truely possess the mathematical reasoning capability. In addition, our proposed KCA strategy substantially reduces the IK issue in LMMs, and error analysis further provides empirical guidance towards human-like reasoning. We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning in LMMs.

2 We-Math

Overview of We-Math. As previously mentioned, existing benchmarks tend to be result-oriented, while overlooking the essence of solving mathematical problems. This leads to the generation of some counterintuitive evaluation conclusions. For example, conclusions in MathVista [34] indicate that LMMs exhibit superior performance on university-level problems compared to elementary-level ones. Different from existing benchmarks, as shown in Figure 2, We-Math is constructed around textbook knowledge units, decomposing composite problem solutions into sub-problems based on the knowledge concepts. We-Math has the following characteristics:

(1) Hierarchical Knowledge Structure. We-Math strictly adheres to the knowledge presented in mathematics textbooks, featuring a rigorous hierarchical and multi-category architecture. It ensures the independence of knowledge concepts within the same level, while establishing logical relationships among concepts at different hierarchical levels.

(2) Knowledge based Reasoning Evaluation. We-Math is designed to explore how LMMs solve problems. Drawing upon that humans tackle problems incrementally by leveraging fundamental knowledge concepts, we break down complex mathematical problems into more manageable sub-problems. Furthermore, we employ diverse measurement dimensions for meticulous evaluations.

(3) Knowledge Concept Augmentation. To alleviate the inherent issues during the problem-solving process, we heuristically introduce descriptions for 67 knowledge concepts from Wikipedia and textbooks, thereby providing essential knowledge support for the reasoning processes of LMMs.

Refer to caption
Figure 2: Overview diagram and the statistics of We-Math. The left and right side shows the first two layers of We-Math’s categories and information of different samples and terminal nodes.

2.1 Hierachial Structured Dataset Composition

Hierachial Knowledge Structure. We-Math emphasizes fundamental math skills, believing that complex mathematical reasoning is built upon foundation of basic mathematical reasoning processes. Based on extensive research, mathematical problems are categorized into five distinct types, namely Plane Figures, Solid Figures, Transformations and Movements of Shapes, Positions and Directions, Measurements. These five categories can be decomposed into 12 typical problems, which are further decomposed as 67 knowledge concepts (terminal nodes in the structure). We collect problems according to this tree structure and constrain that each terminal node contains a strict range of 10-40 samples. This rule ensures data balance across domains.

Data Collection and Annotation. All problems (6.5K) in We-Math are sourced from publicly authoritative mathematics websites and subsequently organized based on our defined knowledge structure. We employ three expert annotators to manually label each question with knowledge concepts. Cross-validation is performed to ensure at least two experts have identical annotations for the same question. Samples with notably inconsistent labels will be considered of low quality and subsequently excluded. To prepare for the subsequent decomposition of problems, we further annotate problem-solving steps based on the knowledge concepts labels. We categorize each problem into three distinct classes: "One-Step", "Two-Step", and "Three-Step". This categorization enables us to gain a deeper understanding on how LMMs solve problems. Further details about annotation can be found in Appendix. After the annotation, all problems are double checked by an expert team in terms of three aspaces: (1) The consistency between the questions and dagrams; (2) The correctness of the answers to the questions; (3) The alignments between problems and the 67 knowledge concepts.

2.2 Knowledge based Reasoning Evaluation

Problem Definition. For the visual mathematical reasoning task, given text question Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and corresponding answer Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We define the LMMs evaluation dataset Deval={(Qi,Ii,Ai)|Ki,Ci}i=1Nsubscript𝐷evalsuperscriptsubscriptconditional-setsubscript𝑄𝑖subscript𝐼𝑖subscript𝐴𝑖subscript𝐾𝑖subscript𝐶𝑖𝑖1𝑁D_{\rm eval}=\{(Q_{i},I_{i},A_{i})|K_{i},C_{i}\}_{i=1}^{N}italic_D start_POSTSUBSCRIPT roman_eval end_POSTSUBSCRIPT = { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. where Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are two prior constraints for question Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In detail, Ki={ki}i=1Msubscript𝐾𝑖superscriptsubscriptsubscript𝑘𝑖𝑖1𝑀K_{i}=\{k_{i}\}_{i=1}^{M}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT denote M𝑀Mitalic_M knowledge concepts within the question. Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the prerequisite conditions needed to solve the problem Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (see Figure 3 for example). For the convenience of presenting, we define the problem containing k𝑘kitalic_k knowledge concepts as a "k𝑘kitalic_k-step problem" in our paper.

Refer to caption
Figure 3: The pipeline of knowledge-based data decomposition (an example of a three-step problem in We-Math).

Knowledge-based Data Decomposition. Real-world mathematical problems are composed of multiple atomic knowledge concepts. However, existing benchmarks usually overlook this information, leading to unreasonable evaluation results. Inspired by Euclid’s Elements [36], we argue that the evaluation of mathematical reasoning ability in LMMs essentially involves assessing their mastery of fundamental knowledge concepts. It is quite a natural and objective way to exploit basic knowledge concepts for reasoning evaluation of LMMs. Given an i𝑖iitalic_i-th test sample {(Qi,Ii,Ai)|Ki,Ci}DWe-Mathconditional-setsubscript𝑄𝑖subscript𝐼𝑖subscript𝐴𝑖subscript𝐾𝑖subscript𝐶𝑖subscript𝐷We-Math\{(Q_{i},I_{i},A_{i})|K_{i},C_{i}\}\in D_{\textsc{We-Math}}{ ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ italic_D start_POSTSUBSCRIPT We-Math end_POSTSUBSCRIPT with M concepts Ki={kim}m=1Msubscript𝐾𝑖superscriptsubscriptsuperscriptsubscript𝑘𝑖𝑚𝑚1𝑀K_{i}=\{k_{i}^{m}\}_{m=1}^{M}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, we ask human experts to decompose each problem step by step into M𝑀Mitalic_M sub-problems based on knowledge concepts, which can be formulated as:

{(qim,iim,aim)|kim,cim}m=1M=Decompose(Qi,Ii,Ai)DWe-Math{(Qi,Ii,Ai)|Ki,Ci}superscriptsubscriptconditional-setsuperscriptsubscript𝑞𝑖𝑚superscriptsubscript𝑖𝑖𝑚superscriptsubscript𝑎𝑖𝑚superscriptsubscript𝑘𝑖𝑚superscriptsubscript𝑐𝑖𝑚𝑚1𝑀subscriptDecomposesubscript𝑄𝑖subscript𝐼𝑖subscript𝐴𝑖subscript𝐷We-Mathconditional-setsubscript𝑄𝑖subscript𝐼𝑖subscript𝐴𝑖subscript𝐾𝑖subscript𝐶𝑖\{(q_{i}^{m},i_{i}^{m},a_{i}^{m})|k_{i}^{m},c_{i}^{m}\}_{m=1}^{M}=\mathop{\rm Decompose% }_{(Q_{i},I_{i},A_{i})\in D_{\textsc{We-Math}}}\{(Q_{i},I_{i},A_{i})|K_{i},C_{% i}\}{ ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) | italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = roman_Decompose start_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_D start_POSTSUBSCRIPT We-Math end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } (2)

where kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the individual knowledge and prior condition for the sub-problem. “DecomposeDecompose\mathop{\rm Decompose}roman_Decompose” represents the Human decomposition process based on M𝑀Mitalic_M knowledge concepts. To ensure logical coherence of decomposition, the condition cimsuperscriptsubscript𝑐𝑖𝑚c_{i}^{m}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is initialized as Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then it is recursively computed by concatenating the answer aim1superscriptsubscript𝑎𝑖𝑚1a_{i}^{m-1}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT and condition cim1superscriptsubscript𝑐𝑖𝑚1c_{i}^{m-1}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT of the m1𝑚1m-1italic_m - 1-th concept:

cim=cim1+aim1for m=2,3,,Mformulae-sequencesuperscriptsubscript𝑐𝑖𝑚superscriptsubscript𝑐𝑖𝑚1superscriptsubscripta𝑖𝑚1for 𝑚23𝑀c_{i}^{m}=c_{i}^{m-1}+\textit{a}_{i}^{m-1}\quad\text{for }m=2,3,\ldots,Mitalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT + a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT for italic_m = 2 , 3 , … , italic_M (3)

where “+++” denotes the concatenation operation. In addition, the equation {qiM=QiaiM=Ai}superscriptsubscript𝑞𝑖𝑀subscript𝑄𝑖superscriptsubscript𝑎𝑖𝑀subscript𝐴𝑖\left\{\begin{array}[]{l}q_{i}^{M}=Q_{i}\\ a_{i}^{M}=A_{i}\end{array}\right\}{ start_ARRAY start_ROW start_CELL italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY } must be satisfied, which is also a constraint for logical coherence. Finally, we can obtain the original multi-step problem and M𝑀Mitalic_M one-step sub-problems for reasoning evaluation. The overall pipeline of Knowledge-based Data Decomposition are displayed in the left side of Figure 3.

Metric for Reasoning Evaluation.

Refer to caption
Figure 4: An example of the four-dimensional metrics for evaluating a two-step problem, using both loose and strict settings.

Based on the decomposed multi-step problems, we further reveal the inherent issues of LMMs in problem-solving process. We feed both the M𝑀Mitalic_M one-step sub-problems and the original problem into LMMs, and classifying the responses into four categories:

1. Insufficient Knowledge (IK): Part of one-step problems contain errors, and the multi-step problem is wrong. It is reasonable because model’s insufficient grasp of single knowledge concept may lead to errors in multi-step problem.

2. Inadequate Generalization (IG): One-Step problems are all correct, but the multi-step problem is incorrect. This is also considered reasonable. While LMMs are capable of understanding individual knowledge concepts, they may struggle to generalize that knowledge to solve composite problems.

3. Complete Mastery (CM): One-Step problems are all correct, and multi-step problem is also answered correctly. This result demonstrates that the model’s results are both reliable and accurate.

4. Rote Memorization (RM): One-Step problems contain errors, but the multi-step problem is answered correctly, which contradicts human logical thinking. If a model can solve composite multi-step problems but fails to answer the one-step problems needed in the process, it raises doubts about the model’s reliability.

Considering IK, IG, and CM, it is evident that results falling under the IG category are generally more preferred compared to those classified as IK. The reason is that IK reflects the model’s struggle with both single and multiple knowledge concepts, while IG shows the model’s proficiency one-step problem. By enhancing the model’s generalization ability in reasoning process, we can potentially shift results from IG to CM. Therefore, we establish a reasoning capability hierarchy as IK<IG<CMIKIGCM\textit{IK}<\textit{IG}<\textit{CM}IK < IG < CM. We believe that RM is an unreasonable scenario (models can solve multi-step problems without mastering one-step problems, which completely contradicts human reasoning intuition).

In light of the model’s instability, the current criteria for determining whether a result belongs RM is strict. We thus propose a more flexible loose metric. As illustrated in Figure 4, the TFT and FTT situations in the two-step problems are regard as CM (rather than RM), according to the loose metric. We also discuss the situation of four-dimensional metrics on three-problem in Appendix C. We propose the following metric to judge the reliability of the model’s reasoning process:

SIK=NIKN,SIG=NIGN,SCM=NCMN,SRM=NRMNRM+NCMformulae-sequencesubscript𝑆IKsubscript𝑁IK𝑁formulae-sequencesubscript𝑆IGsubscript𝑁IG𝑁formulae-sequencesubscript𝑆CMsubscript𝑁CM𝑁subscript𝑆RMsubscript𝑁RMsubscript𝑁RMsubscript𝑁CMS_{\rm IK}=\frac{N_{\rm IK}}{N},\quad S_{\rm IG}=\frac{N_{\rm IG}}{N},\quad S_% {\rm CM}=\frac{N_{\rm CM}}{N},\quad S_{\rm RM}=\frac{N_{\rm RM}}{N_{\rm RM}+N_% {\rm CM}}italic_S start_POSTSUBSCRIPT roman_IK end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT roman_IK end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , italic_S start_POSTSUBSCRIPT roman_IG end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT roman_IG end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , italic_S start_POSTSUBSCRIPT roman_CM end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT roman_CM end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , italic_S start_POSTSUBSCRIPT roman_RM end_POSTSUBSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT roman_RM end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_RM end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT roman_CM end_POSTSUBSCRIPT end_ARG (4)

where N denotes the total number of samples and NIKsubscript𝑁𝐼𝐾N_{IK}italic_N start_POSTSUBSCRIPT italic_I italic_K end_POSTSUBSCRIPT, NIGsubscript𝑁𝐼𝐺N_{IG}italic_N start_POSTSUBSCRIPT italic_I italic_G end_POSTSUBSCRIPT, NCMsubscript𝑁𝐶𝑀N_{CM}italic_N start_POSTSUBSCRIPT italic_C italic_M end_POSTSUBSCRIPT, NRMsubscript𝑁𝑅𝑀N_{RM}italic_N start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT represents the number of samples for a specific situation. Therefore, we obtain our final reasoning confidence scores:

Scoreaverage=αSIK+βSIG+SCMsubscriptScoreaverage𝛼subscript𝑆IK𝛽subscript𝑆IGsubscript𝑆CM\text{Score}_{\text{average}}=\alpha S_{\rm IK}+\beta S_{\rm IG}+S_{\rm CM}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT roman_IK end_POSTSUBSCRIPT + italic_β italic_S start_POSTSUBSCRIPT roman_IG end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT roman_CM end_POSTSUBSCRIPT (5)

where α,β𝛼𝛽\alpha,\betaitalic_α , italic_β denotes the weight for each case. To ensure the reasoning capability hierarchy is "IK < IG < CM", we control the params α<β<1𝛼𝛽1\alpha<\beta<1italic_α < italic_β < 1, and set the default value of α𝛼\alphaitalic_α to 0.0 and β𝛽\betaitalic_β to 0.5.

2.3 Knowledge Concept Augmentation

In the previous section, we identify the Insufficient Knowledge (IK) as the foundation challenge in mathematical reasoning. To heuristically tackle this issue, we enlist human experts to create 67 knowledge concept cards, which is essential for LMM’s reasoning process. Initially, expert annotators offer precise summaries derived from the definitions in Euclid’s Elements [36], Wikipedia and textbooks. Subsequently, these experts further condense the content examined by a series of questions related to a specific knowledge concept, extracting crucial knowledge hints for incorporation into the knowledge cards. After several rounds of review, we confirm the accuracy and utility of each card. Figure 5 showcases typical knowledge concept cases and their descriptions. Consequently, with a given problem Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its respective knowledge concept Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, LMMs utilize the relevant knowledge cards to deduce the answer Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The detailed information of KCA can be found in Appendix.

Refer to caption
Figure 5: The cases of knowledge concept cards.

3 Experiment

Evaluation Protocols.

To accelerate the evaluation speed, We-Math comprises a testmini set with 1740 samples, including 1215 one-step samples, 360 two-step samples, and 165 three-step samples. In subsequent experiments, we utilize the We-Math testmini subset for evaluation. For automated evaluation, we standardize all samples into a multiple-choice format. We use regex to match the LMMs’ predictions and then calculate their accuracy against the ground-truth answers for main results. For analyses in section 3.2 and 3.3, we utilize the four-dimensional metric described in section 2.2 for assessment. To avoid LMMs deduce answers from options, we introduce an extra uncertain option to mitigate this issue.

Evaluation Models.

We examine the performance of foundation models across two distinct categories on We-Math: (a) Closed-source LMMs: GPT-4o [38], GPT-4V [26], Gemini 1.5 Pro [40], Qwen-VL-Max [13], (b) Open-source LMMs: LLaVA-NeXT-110B, LLaVA-NeXT-70B [39], LLaVA-1.6-13B, LLaVA-1.6-7B [41], DeepSeek-VL-1.3B, DeepSeek-VL-7B [42], Phi3-Vision-4.2B [43], MiniCPM-Llama3-V 2.5 [44], InternLM-XComposer2-VL-7B [45], InternVL-Chat-V1.5 [46], GLM-4V-9B [47], LongVA [48], G-LLaVA-13B [29].

3.1 Main Result

Table 1: Accuracy scores of LMMs on the testmini subset of We-Math. The first 3 columns report the overall performance on one-step, two-step, three-step problems, while the other columns display the result on one-step problems in different problem categories. The highest accuracy for closed-source and open-source LMMs is marked in blue and green respectively. (S1: one-step problem, S2: two-step problem, S3: three-step problem, Mem: Measurement, PF: Plane Figures, SF: Solid Figures, TMF: Transformations and Motion of Figures, PD: Position and Direction. AL: Angles and Length, UCU: Understanding and Conversion of Units, CPF: Calculation of Plane Figures, UPF: Understanding of Plane Figures, CSF: Calculation of Solid Figures, USF: Understanding of Solid Figures, BTF: Basic Transformations of Figures, CCF: Cutting and Combining of Figures, Dir: Direction, Pos: Position, RoM: Route Map, CCP: Correspondence of Coordinates and Positions).
Model S1 S2 S3 Mem PF SF TMF PD
UCU AL CPF UPF CSF USF BTF CCF Dir Pos RoM CCP
Closed-source
GPT-4o 72.84 58.06 43.64 86.61 39.12 77.35 71.56 84.50 62.27 58.74 69.37 93.10 72.67 47.53 73.33
GPT-4V 65.51 49.17 38.18 82.54 38.42 70.67 60.22 76.58 56.32 57.76 67.67 79.29 57.48 47.80 63.33
Gemini 1.5 Pro 56.13 51.39 33.94 50.99 31.23 61.75 45.03 69.95 57.54 39.24 62.65 68.81 54.13 40.66 60.00
Qwen-VL-Max 40.82 30.28 20.61 19.35 25.26 39.82 41.44 43.64 48.02 43.82 43.39 41.43 35.09 40.66 26.67
Open-source
LLaVA-NeXT-110B 53.74 36.94 31.52 39.48 57.72 59.48 53.06 52.25 50.22 54.09 50.76 54.76 55.86 40.11 40.00
LLaVA-NeXT-72B 42.88 35.56 30.91 31.65 25.26 43.25 42.39 46.14 41.76 44.22 51.02 44.29 38.93 32.97 36.67
InternVL-Chat-V1.5 49.38 30.56 28.48 43.95 29.82 52.23 52.06 44.19 48.15 47.05 46.82 65.71 50.47 36.54 36.67
LLaVA-1.6-13B 29.38 25.28 32.73 21.73 23.16 23.37 34.72 25.26 26.36 37.52 41.65 26.90 28.87 37.09 30.00
G-LLaVA-13B 32.43 30.56 32.73 33.33 29.12 32.04 37.88 19.57 33.51 37.12 32.79 31.19 33.21 25.55 40.00
GLM-4V-9B 47.33 37.22 38.18 53.37 37.02 51.32 46.52 50.60 38.22 44.09 45.22 40.95 49.27 36.81 53.33
MiniCPM-LLaMA3-V 2.5 39.75 31.11 29.70 28.57 37.02 40.81 39.82 40.97 38.61 31.96 42.66 40.95 42.70 43.96 43.33
LongVA-7B 43.54 30.56 28.48 24.50 39.82 45.09 40.75 51.85 42.49 45.60 44.56 44.52 40.74 47.53 20.00
LLaVA-1.6-7B 22.96 20.83 15.76 18.45 20.53 16.92 29.63 15.57 18.60 42.67 24.05 17.62 43.31 28.85 26.67
DeepSeek-VL-7B 32.59 26.67 25.45 16.57 35.09 27.27 38.01 24.18 38.65 50.02 30.09 24.52 41.01 51.65 23.33
InternLM-XComposer2-VL-7B 47.00 33.06 33.33 31.25 46.49 47.70 42.57 51.44 43.87 41.13 50.58 65.48 53.87 55.22 40.00
Phi3-Vision-4.2B 42.14 34.17 27.88 28.67 15.96 47.23 38.83 49.99 44.41 28.76 31.22 48.57 49.19 26.37 50.00
DeepSeek-VL-1.3B 31.44 27.78 23.03 27.78 23.86 22.76 36.92 30.36 34.18 44.46 28.29 48.10 41.77 37.09 33.33
Refer to caption
Figure 6: The visualization of different LMMs’ performances on each category.

Table 1 shows the overall performance of different LMMs on One-Step / Two-Step / Three-Step problems and different problem domains. We have the following observations:

The Nums of Knowledge Concepts are negatively correlated with LMMs’ Performance. Regarding problems of varying complexities (one-step vs. two-step vs. three-step), GPT-4o consistently achieve an advantage across all settings. Other closed-source models, such as GPT-4V and Gemini 1.5 Pro, also demonstrate competitive performance. However, most LMMs perform significantly worse on multi-step problems compared to one-step problems. For instance, GPT-4o’s accuracy drops from 72.84% to 43.64%. This trend is even more pronounced in stronger models like LLaVA-NeXT-110B and InternVL-Chat-V1.5. These observations suggest that the number of knowledge concepts in a question is positively correlated with its difficulty and negatively correlated with LMMs’ performance, supporting the rationale for decomposing questions to a certain extent.

Larger Parameter Scales in LLMs generally achieve Better Generalization Abilites. To explore what role LLM plays in LMMs, we conduct pairwise comparisons on the LMMs with the same LLM backbone (e.g. LLaVA-NeXT-110B vs LLaVA-NeXT-72B; DeepSeek-VL-7B vs DeepSeek-VL-1.3B). Focusing on the strict metric, we observe that larger parameter scales in LLMs generally perform better, which reveals that the parameter scales in the text decoder is a key factor in achieving the generalization ability in visual mathematical reasoning.

LMMs excel in Calculation but struggle with Fine-grained Visual Measurement. Focusing on different math categories, GPT-4o still maintains impressive results across various subfields. In contrast, as shown in Figure 6, other LMMs generally struggle with "Angle Measurement" and "Unit Conversion". After analyzing these cases, we reveals that the main challenge for LMMs lies in their inability to perform precise visual angle and unit measurements. Furthermore, most LMMs demonstrate better proficiency in calculation (e.g., Calculations of Solid Figures, Calculations of Plane Figures) compared to conceptual understanding (e.g., Understanding of Solid Figures, Understanding of Plane Figures), which indicates that current LMMs excel at directly applying formulas based on given conditions on but are still limited in understanding and comprehensively applying knowledge.

Table 2: The performance of different LMMs on four-dimensional metrics for reasoning evaluation. The best and the two worst performances are marked in blue and red (Avg: ScoreaveragesubscriptScoreaverage\text{Score}_{\text{average}}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT).
Model Strict Loose
Avg (\uparrow) IK (\downarrow) IG (\downarrow) CM (\uparrow) RM (\downarrow) Avg (\uparrow) IK (\downarrow) IG (\downarrow) CM (\uparrow) RM (\downarrow)
Closed-source
GPT-4o 42.86% 31.24% (164) 15.24% (80) 35.24% (185) 34.16% (96) 60.57% 31.24% (164) 15.24% (80) 52.95% (278) 1.07% (3)
GPT-4V 31.05% 39.81% (209) 14.48% (76) 23.81% (125) 47.92% (115) 51.43% 39.81% (209) 14.48% (76) 44.19% (232) 3.33% (8)
Gemini 1.5 Pro 26.38% 42.86% (225) 11.24% (59) 20.76% (109) 54.77% (132) 46.00% 42.86% (225) 11.24% (59) 40.38% (212) 12.03% (29)
Qwen-VL-Max 10.48% 65.14% (342) 7.62% (40) 6.67% (35) 75.52% (108) 25.52% 65.14% (342) 7.62% (40) 21.71% (114) 20.28% (29)
Open-source
LLaVA-NeXT-110B 19.24% 50.29% (264) 14.48% (76) 12.00% (63) 65.95% (122) 37.90% 50.29% (264) 14.48% (76) 30.67% (161) 12.97% (24)
LLaVA-NeXT-72B 13.43% 58.86% (309) 7.05% (37) 9.90% (52) 70.95% (127) 31.52% 58.86% (309) 7.05% (37) 28.00% (147) 17.88% (32)
InternVL-Chat-V1.5 14.95% 56.19% (295) 13.90% (73) 8.00% (42) 73.25% (115) 32.67% 56.19% (295) 13.90% (73) 25.71% (135) 14.01% (22)
LLaVA-1.6-13B 5.24% 69.14% (363) 3.24% (17) 3.62% (19) 86.90% (126) 22.00% 69.14% (363) 3.24% (17) 20.38% (107) 26.21% (38)
G-LLaVA-13B 6.48% 64.19% (337) 4.57% (24) 4.19% (22) 86.59% (142) 22.29% 64.19% (337) 4.57% (24) 20.00% (105) 35.98% (59)
GLM-4V-9B 14.86% 52.95% (278) 9.52% (50) 10.10% (53) 73.10% (144) 35.05% 52.95% (278) 9.52% (50) 30.29% (159) 19.29% (38)
MiniCPM-LLaMA3-V 2.5 9.52% 60.19% (316) 9.14% (48) 4.95% (26) 83.85% (135) 28.00% 60.19% (316) 9.14% (48) 23.43% (123) 23.60% (38)
LongVA-7B 11.52% 61.14% (321) 8.95% (47) 7.05% (37) 76.43% (120) 27.71% 61.14% (321) 8.95% (47) 23.24% (122) 22.29% (35)
LLaVA-1.6-7B 3.33% 78.29% (411) 2.48% (13) 2.10% (11) 89.11% (90) 13.81% 78.29% (411) 2.48% (13) 12.57% (66) 34.65% (35)
DeepSeek-VL-7B 6.29% 69.14% (363) 4.57% (24) 4.00% (21) 84.78% (117) 20.95% 69.14% (363) 4.57% (24) 18.67% (98) 28.99% (40)
InternLM-XComposer2-VL-7B 12.67% 56.38% (296) 10.48% (55) 7.43% (39) 77.59% (135) 30.95% 56.38% (296) 10.48% (55) 25.71% (135) 22.41% (39)
Phi3-Vision-4.2B 10.57% 58.86% (309) 8.95% (47) 6.10% (32) 81.07% (137) 29.81% 58.86% (309) 8.95% (47) 25.33% (133) 21.30% (36)
DeepSeek-VL-1.3B 5.90% 71.05% (373) 2.67% (14) 4.57% (24) 82.61% (114) 21.52% 71.05% (373) 2.67% (14) 20.19% (106) 23.19% (32)
Refer to caption
Figure 7: The Leaderboard of different LMMs under the strict and loose metric (average score %). "similar-to\sim" represents an approximate estimate of the total parameters nums in LMMs.

LMMs exhibit Strong Potential for Parameter Compression. In terms of different LMMs, LLaVA-NeXT-110B demonstrates performance closest to GPT-4. Surprisingly, despite having smaller parameter scales, Phi3-Vision-4.2B and MiniCPM-Llama3-V 2.5 also show competitive performance compared to LLaVA-NeXT-72B. Moreover, the recent GLM-4V-9B and InternVL-Chat-V1.5 have allocated a larger proportion of parameters to the visual encoder (as shown in Table 8), thereby demonstrating notable capabilities. This underscores the importance of optimizing visual representations and suggests that LMMs still have significant potential for parameter compression.significant potential for parameter compression.

3.2 Knowledge based Reasoning Analysis

Table 2 and Figure 789 illustrate the results of knowledge based reasoning evaluation, including four distinct conditions (IK, IG, CM, RM). We have the following observations:

IK is the Greatest Vulnerability of LMMs. All LMMs consistently demonstrate an Insufficient Knowledge issue during the reasoning process, especially in models with smaller parameter scales (LLaVA-1.6-7B, DeepSeek-VL-1.3B). As discussed in section 2.2, addressing IK is crucial for progressing towards Inadequate Generalization (IG) and Complete Mastery (CM). This knowledge gap in solving one-step problems hinders further progress in reasoning about more composite mathematical problems. This finding also supports the rationale behind our proposed KCA strategy.

Refer to caption
Figure 8: The performance of different LMMs on four-dimensional metrics under strict metric.
Refer to caption
Figure 9: The performance of different LMMs on four-dimensional metrics under loose metric.
Refer to caption
Figure 10: Quantitative Analysis on KCA. The left two figures show the impact of KCA on the average performance of LMMs under strict and loose conditions. The right two figures compare the results between IK and IG.

GPT-4o’s Main Challenge has gradually shifted from IK to IG, highlighting it as the First LMM towards the Knowledge Generalization Stage. Focusing on IK and IG, GPT-4o exhibits a substantial lead in addressing the IK issue, but the weakest performance in IG. Further analyzing the logical relationships between IK, IG, and CM (IK \rightarrow IG \rightarrow CM), we are pleasantly surprised to find that GPT-4o is markedly superior to the open-sourced LLaVA-NeXT-110B in IK (19.05%), suggesting it has successfully converted a considerable amount of IK into IG issue. This revelation indicates that GPT-4o’s challenges in reasoning have shifted from addressing Insufficient Knowledge in one-step problems to the knowledge generalization stage, leading us to speculate that there may have been groundbreaking changes in GPT-4o’s training strategy. However, other LMMs remain stuck at the IK phase. We argue that it is pointless to compare IG without a solid grasp of IK, highlighting the significance of our hierarchical metrics (IK < IG < CM).

The Unreasonable RM issue remains widespread across Most LMMs. GPT-4o achieves a significant lead on the RM issue, particularly on the loose metric (SRM<2%subscript𝑆𝑅𝑀percent2S_{RM}<2\%italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT < 2 %). However, other LMMs still exhibit nearly 25% SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT on the loose metric. When focusing on the changes in SRMsubscript𝑆𝑅𝑀S_{RM}italic_S start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT between strict and loose metrics, several models (LLaVA-NeXT-110B, GLM-4V-9B, DeepSeek-VL-1.3B, MiniCPM-Llama3-V 2.5) show significant variations. This is a beneficial phenomenon, indicating that these models possess a certain ability to solve one-step problems, but their performance fluctuates due to external factors such as prompting templates and hyper-parameters.

Refer to caption
Figure 11: Error analysis of GPT-4o, The definitions of 4 types errors are listed in Appendix.

3.3 Quantitative Analysis

The Effectiveness on KCA. Figure 10 displays the quantitative analysis of the LMMs with knowledge concept augmented (KCA). We find that LMMs with different parameter scales show consistent performance improvements on both strict and loose metrics after introducing the KCA strategy. Moreover, the KCA strategy significantly alleviates IK issues but does not noticeably improve IG. This aligns with human intuition, as the knowledge descriptions primarily address gaps in reasoning knowledge. Nevertheless, alleviating IG issues requires a comprehensive enhancement of the LMMs’ knowledge generalization abilities, which we consider a direction for future exploration.

Error Anaysis. Figure 11 shows the occurrence of the four types of errors across the 67 knowledge concepts. Knowledge errors are the most frequent, appearing in over 45 knowledge concepts. Notably, although visual errors are the second most common, they are more concentrated in specific concepts (e.g., "Understanding Angles" >10), and over 38 concepts have no visual errors. This finding underscores the urgent need to enhance the fine-grained measurement capabilities of visual encoders in LMMs for mathematical reasoning, rather than blindly improving their overall capabilities.

4 Related Work

Mathematical Reasoning Benchmarks. Assessing mathematical reasoning abilities is crucial for the development of large foundational models (LLMs and LMMs). Early efforts, such as MathQA [49], focus on solving mathematical word problems and highlight the importance of operation-based reasoning. Following this, datasets like GSM8K [50] and MATH [51] set the stage for evaluating text-based mathematical problems at various difficulty levels. Other benchmarks, such as MMLU [52] and MT-Bench [53], also consider mathematical evaluation as a key part of assessing LLMs. Beyond text-only evaluations, datasets like GeoQA [32], UniGeo [33], and Geometry3K [30] have pioneered the evaluation of geometric problems. Recently, several benchmarks [34] [54] have expanded their scope to cover a broader range of subjects. Additionally, MathVerse [35] aims to evaluate reasoning paths based on reference answers. However, challenges remain due to the complex nature of mathematical reasoning. In this paper, we introduce We-Math, a comprehensive benchmark designed to evaluate the reasoning abilities of LMMs across a wide range of mathematical categories.

Benchmarks for Large Multimodal Model. The rapid advancement of Large Language Models (LLMs) and Large Multimodal Models (LMMs) have highlighted the necessity for more comprehensive evaluation benchmarks. At first, the emergence of a series of text-only benchmarks and evaluations give us a clearer understanding of the strengths and weaknesses of large language models [55, 52, 53, 56, 57, 58, 59, 60, 61, 50, 51, 62]. Focusing on the visual aspect, early benchmarks predominantly focused on narrow tasks like Visual Question Answering (VQA) [63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75] and image captioning [76, 77, 78], showcasing significant progress but not fully addressing the broader spectrum of multimodal perception and reasoning. This gap has driven recent research to assess LMMs from multiple angles. Notable efforts include MMBench [79] and SEED-bench [80, 81], which probe models’ abilities through common-sense queries incorporating multiple-choice formats. For domain-specific expertise, MMMU [54] utilize academic content to gauge deeper knowledge levels. Yet, benchmark such as MMStar [82] reveals that certain evaluations allow models to respond without images, risking data leakage and failing to adequately measure logic and reasoning skills. The challenge of understanding image implications, requiring multi-hop reasoning and theory of mind (ToM) [83, 84, 85, 85, 86, 87], underscores this shortfall. In parallel, the intersection of large language models (LLMs) and Large Multimodal Models (LMMs) has surged, extending the applicability of LMMs evaluations across diverse modalities including 2D images [88, 89, 90], 3D point clouds [91, 92, 93], audio [94, 95, 96, 97], and video [98, 99, 100]. Moreover, a series of works have positioned LMMs as agents with various tools, such as APIs [101, 102, 103], retrievers [104, 105] , thereby broadening the development avenues for the model evaluation community [106, 107, 108, 109].

5 Conclusion

In this paper, we propose We-Math, a comprehensive benchmark for in-depth analysis of LMMs in visual mathematical reasoning. We-Math encompasses 6.5K visual math problems, covering 5 layers and 67 knowledge concepts. Moreover, we pioneeringly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric for fine-grained reasoning evaluation. With We-Math, we thoroughly evaluate existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. Furthermore, we identify IK issues as the greatest vulnerability of LMMs. However, GPT-4o’s main challenge has shifted from IK to IG, highlighting it the first LMM towards the next stage. Lastly, analyses on KCA strategy and error cases further heuristically guides existing LMMs towards human-like visual mathematical reasoning.

References

  • [1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [2] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [3] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [4] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
  • [5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [6] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [7] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • [8] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  • [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • [10] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • [11] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  • [12] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  • [13] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  • [14] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  • [15] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  • [16] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  • [17] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • [18] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, 2022.
  • [19] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, pages 10764–10799, 2023.
  • [20] Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024.
  • [21] Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. CoRR, 2023.
  • [22] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023.
  • [23] Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and **gren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.
  • [24] Longhui Yu, Weisen Jiang, Han Shi, **cheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024.
  • [25] Chengpeng Li, Zheng Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, and Chang Zhou. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506, 2023.
  • [26] R OpenAI. Gpt-4v (ision) system card. Citekey: gptvision, 2023.
  • [27] R. Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
  • [28] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
  • [29] Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
  • [30] Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021.
  • [31] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
  • [32] Jiaqi Chen, Jianheng Tang, **ghui Qin, Xiaodan Liang, Lingbo Liu, Eric P Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021.
  • [33] Jiaqi Chen, Tong Li, **ghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022.
  • [34] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  • [35] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024.
  • [36] Richard Fitzpatrick. Euclid’s elements of geometry, 2008.
  • [37] Wikipedia contributors. Wikipedia, 2004.
  • [38] OpenAI. Hello gpt-4o, 2024.
  • [39] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  • [40] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [41] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
  • [42] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, **gxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, and Chong Ruan. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  • [43] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
  • [44] **yi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv preprint arXiv:2308.12038, 2023.
  • [45] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, **gwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  • [46] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  • [47] Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, **g Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, and Zihan Wang. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
  • [48] Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, **gkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision, 2024.
  • [49] Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Ye** Choi, and Hannaneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  • [50] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • [51] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  • [52] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
  • [53] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  • [54] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  • [55] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, **ghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023.
  • [56] Guanting Dong, **xu Zhao, Tingfeng Hui, Daichi Guo, Wenlong Wan, Boqi Feng, Yueyan Qiu, Zhuoma Gongque, Keqing He, Zechen Wang, and Weiran Xu. Revisit input perturbation problems for llms: A unified robustness evaluation framework for noisy slot filling task, 2023.
  • [57] Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. Cs-bench: A comprehensive benchmark for large language models towards computer science mastery, 2024.
  • [58] Xiaoshuai Song, Keqing He, Pei Wang, Guanting Dong, Yutao Mou, **gang Wang, Yunsen Xian, Xunliang Cai, and Weiran Xu. Large language models meet open-world intent discovery and recognition: An evaluation of chatgpt, 2023.
  • [59] Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. Understand what llm needs: Dual preference alignment for retrieval-augmented generation, 2024.
  • [60] Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, and **gren Zhou. Occuquest: Mitigating occupational bias for inclusive large language models, 2023.
  • [61] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. Hellaswag: Can a machine really finish your sentence?, 2019.
  • [62] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks, 2020.
  • [63] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  • [64] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • [65] Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017.
  • [66] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • [67] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • [68] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  • [69] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  • [70] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  • [71] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • [72] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  • [73] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2024.
  • [74] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
  • [75] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • [76] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [77] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
  • [78] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  • [79] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  • [80] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  • [81] Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024.
  • [82] Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024.
  • [83] Poorav Desai, Tanmoy Chakraborty, and Md Shad Akhtar. Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10563–10571, 2022.
  • [84] Jack Hessel, Ana Marasović, Jena D Hwang, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, and Ye** Choi. Do androids laugh at electric sheep? humor" understanding" benchmarks from the new yorker caption contest. arXiv preprint arXiv:2209.06293, 2022.
  • [85] Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Robin IM Dunbar, et al. Llms achieve adult human performance on higher-order theory of mind tasks. arXiv preprint arXiv:2405.18870, 2024.
  • [86] Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and **gren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models, 2024.
  • [87] Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, et al. Ii-bench: An image implication understanding benchmark for multimodal large language models. arXiv preprint arXiv:2406.05862, 2024.
  • [88] Chenshuang Zhang, Fei Pan, Junmo Kim, In So Kweon, and Chengzhi Mao. Imagenet-d: Benchmarking neural network robustness on diffusion synthetic object, 2024.
  • [89] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009.
  • [90] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [91] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [92] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich 3d model repository, 2015.
  • [93] Andreas Geiger, Philipp Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 32(11):1231–1237, 2013.
  • [94] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
  • [95] Christophe Veaux, Junichi Yamagishi, and Kirsten MacDonald. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 12th Annual Conference of the International Speech Communication Association (Interspeech), pages 121–125, 2017.
  • [96] Qian Yang, ** Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and **gren Zhou. Air-bench: Benchmarking large audio-language models via generative comprehension, 2024.
  • [97] Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F. Chen. Audiobench: A universal benchmark for audio large language models, 2024.
  • [98] Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models, 2023.
  • [99] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. In arXiv preprint arXiv:1609.08675, 2016.
  • [100] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang **, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023.
  • [101] Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey. arXiv preprint arXiv:2402.15116, 2024.
  • [102] Chenyu Wang, Weixin Luo, Qianyu Chen, Haonan Mai, **di Guo, Sixun Dong, Xiaohua, Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning, 2024.
  • [103] Xiao Liu, Jianfeng Lin, and Jiawei Zhang. Beyond text: Unveiling multimodal proficiency of large language models with multiapi benchmark, 2023.
  • [104] Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models, 2024.
  • [105] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, and Shafiq Joty. Retrieving multimodal information for augmented generation: A survey. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4736–4756, Singapore, December 2023. Association for Computational Linguistics.
  • [106] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Multibench: Multiscale benchmarks for multimodal representation learning, 2021.
  • [107] Wentao Ge, Shunian Chen, Guiming Hardy Chen, Zhihong Chen, Junying Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xinyi Zhang, Yichen Chai, Xiaoyu Liu, Dingjie Song, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, and Benyou Wang. Mllm-bench: Evaluating multimodal llms with per-sample criteria, 2024.
  • [108] Kaining Ying, Fanqing Meng, ** Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, ** Luo, Kaipeng Zhang, and Wenqi Shao. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi, 2024.
  • [109] Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers, 2023.

Appendix

Appendix A Broaden Impact

Bridging Human-Like Inspiration and Reliability. As previously mentioned, works such as neural networks [2] and attention mechanisms [3] draw their design inspiration from human thinking patterns. This is fundamentally because the purpose of designing AI is to assist humans. Currently, LMMs have already been hel** people in various scenarios, which was unimaginable in the past. Therefore, we firmly believe that a new era is coming, where people will focus not only on the performance of models in specific fields but also on the reliability of a model. In some fundamental scenarios, a reliable model is more important, which is one of the primary motivations behind the creation of We-Math. Furthermore, after completing our experiments, we find that in a loose setting, GPT-4o’s RM metric is only 1.07%, showing us the possibility of a reliable and accurate model emerging in the future.

Fine-grained Evaluation and Versatile Applications. From the model’s perspective, We-Math can provide LMMs with an assessment of mathematical abilities. Additionally, We-Math’s IK, IG, and CM metrics offer a fine-grained evaluation of the model’s capabilities. Furthermore, the RM metric reflects a model’s reliability to address our concern of not desiring a model that can solve complex problems but makes errors on sub-problems within the solution process. Ultimately, we introduce the ScoreaveragesubscriptScoreaverage\text{Score}_{\text{average}}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT metric to quantify the model’s overall performance. Moreover, since We-Math is constructed from the decomposition of a multi-step problem’s necessary solution process, it provides new perspectives for interactive tasks (multi-turn dialogues), self-supervised learning, information extraction, and other tasks. It also offers crucial references and support for the deployment of models in education and other fields.

Ethics Statement. We ensure that We-Math complies with legal and ethical guidelines throughout its construction process, with no violations. We provide fair compensation to all annotators involved. We-Math focuses on elementary mathematics problems, and during its construction, data collection was sourced from publicly available test questions, textbooks, and professional websites. Since mathematics problems inherently have standard answers, they are not subject to cultural differences. Additionally, we guarantee that We-Math is solely for academic research purposes, and we uphold the strict prohibition of any commercial use. Additionally, we declare that we will bear full responsibility in the event of any rights violations and confirm the data license.

Appendix B More Details on We-Math

B.1 Hierarchical Knowledge Structure

Refer to caption
Figure 12: The Hierarchical Knowledge Structure of We-Math (1).
Refer to caption
Figure 13: The Hierarchical Knowledge Structure of We-Math (2).

Figure 1213 shows the detailed hierarchical structure of We-Math, which includes 5 levels, 99 nodes, and 67 leaf nodes.

In the initial stages of constructing the benchmark, we aimed to address two key objectives. We believe that the purpose of designing a benchmark is to evaluate the performance of models and provide guidance on areas that need improvement. However, existing benchmarks offer only broad guides in these aspects. Additionally, the core contribution mentioned earlier is that We-Math is the first benchmark specifically designed to study the mathematical problem-solving mechanisms of models. Inspired by the learning paradigm of humans, which is based on knowledge concepts, We-Math constructs its dataset with knowledge concepts as the basic unit, resulting in evaluations with rigorous scientific accuracy and better guidance.

B.2 Knowledge-based Data Decomposition

Figures 1415 illustrate the process of Knowledge-based Data Decomposition.

Collection. In each example, the Collection section presents specific information about each multi-step problem in the dataset.

Human reasoning. The Human reasoning section shows the process required before decomposing each multi-step problem, where educational experts extract the key information needed for each sub-problem based on the reasoning path for the knowledge concepts included in the multi-step problem.

Decompose. The Decompose section uses the key information extracted in the Human reasoning section to formulate sub-problems, refine the options, and ultimately achieve the decomposition of a multi-step problem.

It is necessary to further explain that to ensure each sub-problem has a rigorous logical relationship and is independent, the text condition for the first sub-problem is derived from the text condition of the multi-step problem, and the image condition for the first sub-problem is the same as the image condition of the multi-step problem.

Furthermore, in constructing the second sub-problem, two situations may arise. The first situation is where the answer of the first sub-problem is injected as a key condition into the image condition of the second sub-problem, presenting the information visually. The second situation is where the answer of the first sub-problem is injected as a key condition into the text condition of the second sub-problem, while the image condition remains unchanged.

In We-Math, the vast majority of cases are of the first type. However, for some information that is extremely difficult to present in images, we opt for the second type, presenting the information in text form. To ensure fairness in the decomposition of the problems, only one of these situations will occur in the decomposition of the same multi-step problem. This approach ensures that the question of the final sub-problem will match the original multi-step problem, completing the decomposition.

Refer to caption
Figure 14: An example of a two-step problem in We-Math.
Refer to caption
Figure 15: An example of a three-step problem in We-Math.
Table 3: Prompt templates for response generations.
Type Prompt Template
Multiple
Choice
Now, we require you to solve a multiple-choice math question. Please briefly
describe your thought process and provide the final answer(option).
Question: <Question>
Option: <Option>
Regarding the format, please answer following the template below, and be
sure to include two <> symbols:
<Thought process>: <<your thought process>> <Answer>: <<your option>>
Knowledge Concept
Augmentation
Now, we require you to solve a multiple-choice math question. We will provide
you with the relevant knowledge concepts of this question for your reference.
Please briefly describe your thought process and provide the final answer(option).
Knowledge concept: <Knowledge concept>
Question: <Question>
Option: <Option>
Regarding the format, please answer following the template below, and be
sure to include two <> symbols:
<Thought process>: <<your thought process>> <Answer>: <<your option>>

B.3 Knowledge Concepts Augmentation

Table 3 report the prompt templates in our experiments. We concatenate the textual descriptions into the prompt. Additionally, each knowledge concept description is accompanied by its corresponding visual content, which helps the experimenter understand and facilitates further enhancement when models can incorporate sufficient visual information as part of the prompt in the future.

In section F.1, we illustrates the specific content of descriptions for 67 knowledge concepts. For example, as shown in Figure 46, for the knowledge concept "Perimeter of Squares," it is necessary to know that "c=4a", relying solely on textual descriptions is insufficient for understanding this concept, so we include visual information to aid comprehension.

B.4 Details of Data Collection

With the hierarchical knowledge structure, we select geometric problems with images from publicly authoritative mathematics websites from various countries, including professional exams and practice tests (detailed sources list can be found in section F.2). To ensure comprehensive coverage of fundamental and critical areas in primary math, we select the five most foundational and prevalent domains within the field of primary geometry, including:

  • Plane figures: Questions involving identification and properties of two-dimensional shapes.

  • Solid figures: Questions related to the recognition and characteristics of three-dimensional objects.

  • Transformation and motion of figures: Problems focusing on geometric transformations such as translation, rotation, and reflection.

  • Position and direction: Questions that involve understanding spatial relationships and directions.

  • Measurement: Problems requiring the measurement of length, area, volume, and angles.

The selection criteria are as follows: (1) The problems include multiple knowledge concepts and can be decomposed into steps for solution. (2) The problems and images are consistent. (3) The correct answer is unique, and the distractor options are highly confusing.

B.5 Details of Data Statistics

Refer to caption
Figure 16: The distribution of the number of words per question in We-Math. Questions with a length greater than 80 are categorized as 81 for visualization simplicity.
Table 4: Key statistics of We-Math.
Statistic Number
Total questions 6,524
Newly collected questions 6,524
Multiple-choice questions 6,524
-First-layer nodes 5
-Second-layer nodes 12
-Terminal nodes 67
Question options
-Total options 25,178
-Average options 3.859
-Proportion of answer A 6,524 (25.9%)
-Proportion of answer B 6,524 (25.9%)
-Proportion of answer C 6,505 (25.8%)
-Proportion of answer D 4,419 (17.6%)
-Proportion of answer E 1,198 (4.8%)
-Proportion of answer F&G 11 (0.04%)
Question length
-Maximum length (word) 143
-Maximum length (character) 852
-Average length (word) 25.8
-Average length (character) 135.3

Question distribution. The We-Math consists entirely of English questions, all newly collected from public authoritative mathematics websites, and presented in the format of multiple-choice questions. As illustrated in Table 4, the average number of words in the English questions within We-Math is 25.81, with the maximum number of words in a question reaching 143. Figure 16 further elaborates on the distribution of word counts, highlighting the diverse patterns of the questions.

Advantages of Multiple-Choice Questions.

In We-Math, all problems are presented as multiple-choice questions. Even if some problems did not originally conform to the multiple-choice format during the initial selection, our researchers manually converted them into the format. Using multiple-choice questions offers several advantages:

Standardization: Ensures uniformity across all questions, facilitating consistent assessment and comparison across different hierarchical subjects.

Objective Grading: The use of single correct answers eliminates subjectivity in grading, enhancing the reliability of the evaluation.

Efficiency: Allows for rapid and scalable assessment, suitable for large datasets and automated systems.

Focused Assessment: Carefully designed distractors help in accurately identifying specific knowledge gaps and common misconceptions.

Appendix C More Details on the Metrics

Distinguishing Metric. Considering the model’s instability, Figure 4 and Figure 1718 illustrate the two metrics we propose for distinguishing between RM and CM metrics. Figure 4 represents the two-step problem, while Figures 17 and Figures 18 represent the three-step problem. Specifically, under the strict metric, if there is any error in the corresponding sub-problems of a multi-step problem that is answered correctly, it is classified as RM (Rote Memorization). Only if all corresponding sub-problems are answered correctly (TTTT, TTT) is it classified as CM (Complete Master). Under the loose metric, it is classified as RM only if the model answers all sub-problems incorrectly (FFFT, FFT), otherwise, it is classified as CM. Therefore, the ScoreaveragesubscriptScoreaverage\text{Score}_{\text{average}}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT under the loose metric is slightly higher. We hope to see models like GPT-4o [38] and GPT-4V [26], which have already performed nearly perfectly under the loose metric and are far ahead of other models, bring us even greater surprises under the strict metric in the next update.

Metrics’ Intrinsic Logic. As shown in Figure 41718, it is evident in the Metric for Reasoning Evaluation Section that IK, IG, and CM have a logical relationship. In the early stages of constructing We-Math, we recorded all the model’s responses and analyzed the answers to each multi-step problem and its corresponding sub-problems. We believe that for both humans and models, a reasonable learning process should involve first mastering each knowledge concept individually and then learning to comprehensively apply them to achieve complete mastery. The situation where the multi-step problem is answered correctly but the sub-problems are answered incorrectly (RM) is an unreasonable phenomenon. Therefore, we developed a four-dimensional fine-grained metric to further evaluate the model’s performance.

Refer to caption
Figure 17: Diagram illustrating strict metric in three-step problem.
Refer to caption
Figure 18: Diagram illustrating loose metric in three-step problem.

Appendix D More Details on Experiment Setup

D.1 Details of the Evaluated Models

To evaluate the mathematical reasoning abilities of various LMMs, we selected their latest model versions. Table 5 presents their release dates and specific sources. Given the intuition that smaller models (with parameters of 7B or less) perform poorly on Insufficient Knowledge (IK), we also included evaluations of the latest models with 7B, 4.2B, and 1.3B parameters. This was done to explore whether these models could achieve significant improvement under the KCA strategy.

Table 5: The release time and model source of LMMs used in We-Math

D.2 Details of the Model Hyperparameters

For all closed-sourced models with API access, we adopt the generation scheme shown in Table 6 and simply run the inference with CPUs, which typically completes within a day. For all open-source models, we utilize a cluster with 8 NVIDIA A800-SXM4-80GB GPUs to run the inference, and we follow the hyper-parameter settings specified in the model source’s inference samples. If no specific instructions are provided, we use the default settings. Table 7 details the specific generation parameters.

Table 6: Generating parameters for Closed-Source LMMs.
Model Generation Setup
GPT-4o
"model" : "gpt-4o", "temperature" : 0, "max_tokens" : 1024
GPT-4V
"model" : "gpt-4-turbo", "temperature" : 0, "max_tokens" : 1024
Gemini 1.5 Pro
"model" : "gemini-1.5-pro-latest", "temperature" : 0, "max_tokens" : 1024
Qwen-VL-Max
"model" : "qwen-vl-max", "temperature" : 0, "max_tokens" : 1024
Table 7: Generating parameters for Open-Source LMMs.
Model Generation Setup
LLaVA-NeXT-110B
do_sample = False, temperature = 0, max_new_tokens = 1024
LLaVA-NeXT-72B
do_sample = False, temperature = 0, max_new_tokens = 1024
LLaVA-1.6-13B
do_sample = False, temperature = 0, max_new_tokens = 1024
LLaVA-1.6-7B
do_sample = False, temperature = 0, max_new_tokens = 1024
DeepSeek-VL-1.3B
do_sample = False, max_new_tokens = 1024
DeepSeek-VL-7B
do_sample = False, max_new_tokens = 1024
Phi3-Vision-4.2B
do_sample = False, temperature = 0, max_new_tokens = 1024
MiniCPM-LLaMA3-V 2.5
sampling = True, temperature = 0.7
InternLM-XComposer2-VL-7B
do_sample = False
InternVL-Chat-V1.5
num_beams = 1, do_sample = False, max_new_tokens = 1024
GLM-4V-9B
do_sample = True, max_length = 1024, top_k = 1
LongVA
do_sample = False, temperature = 0, max_new_tokens = 1024, num_beams = 1
G-LLaVA-13B
do_sample = True, temperature = 0.2, max_new_tokens = 1024
Table 8: Model architecture of 17171717 LMMs evaluated on We-Math.
Models LLM Vision Encoder
GPT-4o - -
GPT-4V - -
Gemini 1.5 Pro - -
Qwen-VL-Max - -
LLaVA-NeXT-110B Qwen1.5-110B-Chat CLIP-ViT-L-P14-336
LLaVA-NeXT-72B Qwen1.5-72B-Chat CLIP-ViT-L-P14-336
LLaVA-1.6-13B Vicuna-13B-v1.5 CLIP-ViT-L-P14-336
LLaVA-1.6-7B Vicuna-7B-v1-5 CLIP-ViT-L-P14-336
DeepSeek-VL-1.3B DeepSeek-LLM-1.3B-base SigLIp-L-P16-384
DeepSeek-VL-7B DeepSeek-LLM-7B-base SigLIp-L-P16-384 & SAM-B
Phi3-Vision-4.2B Phi-3-mini-128K-instruct CLIP-ViT-L-P14-336
MiniCPM-LLaMA3-V 2.5 Llama3-8B-Instruct SigLIp-L-P14-384
InternLM-XComposer2-VL-7B InternLM2-7B-ChatSFT CLIP-ViT-L-P14-336
InternVL-Chat-V1.5 InternLM2-Chat-20B InternViT-6B-448px-V1-5 (6B)
GLM-4V-9B GLM-9B EVA_02_CLIP-E-P14 (4.7B)
LongVA Qwen2-7B-Instruct CLIP-ViT-L-P14-336
G-LLaVA-13B Vicuna-13B-v1.5 CLIP-ViT-L-P14-336

Appendix E More Details on Experiment Results

E.1 Details of Model Performance

The Leaderboard on We-Math. We present the visualization results of the ScoreaveragesubscriptScoreaverage\text{Score}_{\text{average}}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT under the loose and strict metric in Figure 7, respectively. GPT-4o shows a significant lead under both metric, and LLaVA-NeXT-110B performs the best among open-source models. Impressively, InternVL-Chat-V1.5 and GLM-4V-9B achieved excellent scores, surpassing the closed-source model Qwen-VL-Max. Additionally, some recently proposed smaller models (such as Phi-3-Vision-4.2B, InternLM-XComposer2-VL-7B, and MiniCPM-LLaMA3-V 2.5) also demonstrated outstanding performance, suggesting that optimizing training methods might partially substitute for the performance gains typically achieved by increasing the parameter count.

Detailed Performance of Four-Dimensional Metrics. Figure 8 and Figure 9 display the specific performance of LMMs under both loose and strict metric across four metrics. Focusing on the IK metric, GPT-4o has the fewest instances under both metric, indicating that GPT-4o has the best grasp of the knowledge concepts. Furthermore, for the IG metric, we find that GPT-4o and GPT-4V have the highest exposure compared to other models. As discussed in the previous Section C, IG issues only arise after addressing IK issues, which further indicates that GPT-4 is progressing to the next stage. Focusing on the CM and RM metrics, GPT-4o and GPT-4V continue to show significant leadership. Both models excel in the CM metric, where the number of correctly answered multi-step problems and their corresponding sub-questions is significantly higher than that of other models. Additionally, comparing GPT-4o and GPT-4V under strict metric, GPT-4o consistently outperforms GPT-4V, aligning with the ScoreaveragesubscriptScoreaverage\text{Score}_{\text{average}}Score start_POSTSUBSCRIPT average end_POSTSUBSCRIPT results.

Detailed Performance on Each Category. In Figure 6, we present the performance of open-source and closed-source models under the second-level nodes. In Figure 19 to Figure 35, we detail the specific performance of 17 models across 67 knowledge concepts (based on statistics from one-step problem questions). It is evident that GPT-4o consistently leads in overall performance, but its main issue lies in measurement-related tasks. Notably, some open-source models perform worse on the simpler "Understanding and Conversion of Units" knowledge concepts compared to "Angles and Length" related concepts, while InternVL-Chat-V1.5 and MiniCPM-LLaMA3-V 2.5 exhibit more logically consistent results.

Refer to caption
Figure 19: Detailed performance of GPT-4o across 67 knowledge concepts.
Refer to caption
Figure 20: Detailed performance of GPT-4V across 67 knowledge concepts.
Refer to caption
Figure 21: Detailed performance of Gemini 1.5 Pro across 67 knowledge concepts.
Refer to caption
Figure 22: Detailed performance of Qwen-VL-Max across 67 knowledge concepts.
Refer to caption
Figure 23: Detailed performance of LLaVA-NeXT-110B across 67 knowledge concepts.
Refer to caption
Figure 24: Detailed performance of LLaVA-NeXT-72B across 67 knowledge concepts.
Refer to caption
Figure 25: Detailed performance of InternVL-Chat-V 1.5 across 67 knowledge concepts.
Refer to caption
Figure 26: Detailed performance of LLaVA-1.6-13B across 67 knowledge concepts.
Refer to caption
Figure 27: Detailed performance of G-LLaVA-13B across 67 knowledge concepts.
Refer to caption
Figure 28: Detailed performance of GLM-4V-9B across 67 knowledge concepts.
Refer to caption
Figure 29: Detailed performance of MiniCPM-LLama3-V 2.5 across 67 knowledge concepts.
Refer to caption
Figure 30: Detailed performance of LongVA-7B across 67 knowledge concepts.
Refer to caption
Figure 31: Detailed performance of LLaVA-1.6-7B across 67 knowledge concepts.
Refer to caption
Figure 32: Detailed performance of DeepSeek-VL-7B across 67 knowledge concepts.
Refer to caption
Figure 33: Detailed performance of InternLM-XComposer2-VL-7B across 67 knowledge concepts.
Refer to caption
Figure 34: Detailed performance of Phi3-Vision-4.2B across 67 knowledge concepts.
Refer to caption
Figure 35: Detailed performance of DeepSeek-VL-1.3B across 67 knowledge concepts.

E.2 Specific Error Analysis

Table 9: Detailed Descriptions of Error Types.
Error Type Explanation
Knowledge Error
For a specific knowledge concept, the model is unclear or confused about
it, or it misuses another knowledge concept to solve the problem.
Reason Error
Errors that occur in the logical reasoning process while using knowledge
concepts to solve the problem step by step.
Visual Error
Errors in visual perception, where the model incorrectly identifies shapes
or numbers in an image.
Hallucination
The thought process introduces factors that are not consistent with the facts,
which are not mentioned in the context of the image or question.

Error Types. To delve into the failure cases of models, we detailed four typical error types in Table 9. Furthermore, to facilitate a better understanding of each error type, we provide examples of each error made by GPT-4o from Figure 36 to Figure 39. Since a single thought process in a problem can involve multiple errors and a single logical error is enough to derail a much larger solution, we consider the first error that occurs in the reasoning steps as the key error and include only this error in our statistics.

Correspondence of Errors in Multi-Step and One-Step Problems. Focusing on Insufficient Knowledge, the errors in multi-step problems often correspond to those in one-step problems. This supports our approach of decomposing problems to accurately associate error types with specific knowledge concepts. Furthermore, we observe a positive correlation between the quantity of knowledge concepts and the errors in the reasoning process. As the complexity of knowledge concepts increases, the difficulty for the model to perform multi-step reasoning also increases, leading to a higher likelihood of visual recognition errors and incorrect application of knowledge concepts.

Refer to caption
Figure 36: Specific examples of Visual Error.
Refer to caption
Figure 37: Specific examples of Reason Error.
Refer to caption
Figure 38: Specific examples of Knowledge Error.
Refer to caption
Figure 39: Specific examples of Hallucination.

Appendix F Example Demonstration

F.1 Description of the Knowledge Concepts

Figure 5 and Figure 40 to 49 illustrate the the detailed information of knowledge concepts.

Refer to caption
Figure 40: The description of the knowledge concept "Understanding and Conversion of Units"
Refer to caption
Figure 41: The description of the knowledge concept "Angles and Length"
Refer to caption
Figure 42: The description of the knowledge concept "Basic Transformations of Figures"
Refer to caption
Figure 43: The description of the knowledge concepts "Direction" and "Position"
Refer to caption
Figure 44: The description of the knowledge concept "Calculation of Solid Figures"
Refer to caption
Figure 45: The description of the knowledge concept "Understanding of Solid Figures"
Refer to caption
Figure 46: The description of the knowledge concept "Calculation of Plane Figures"
Refer to caption
Figure 47: The description of the knowledge concept "Understanding of Plane Figures"
Refer to caption
Figure 48: The description of the knowledge concept "Route Map"
Refer to caption
Figure 49: The description of the knowledge concept "Correspondence of Coordinates and Positions"

F.2 Data Sources of We-Math

Table 10 to 14 illustrate the the detailed data source lists of We-Math.

Table 10: The data sources of We-Math (Part1, Source 1 to 50).
Number Data Source
1 [Bei**g Chaoyang New Target Detection] 2022 Edition of People’s Education Edition Grade 3 Volume 1: Mathematics
2 [Bei**g New Target Detection] 2023 Printing People’s Education Edition Grade 3 Volume 1: Mathematics
3 [Bei**g Chaoyang New Target Detection] 2023 Printing People’s Education Edition Grade 3 Volume 2: Mathematics
4 [Learning Objectives and Assessment] 2023-2024 Academic Year People’s Education Edition Grade 3 Volume 2: Mathematics
5 [Bei**g Chaoyang New Target Detection] 2020 Edition of the People’s Education Edition Grade 4 Volume 1: Mathematics
6 [New Target Detection] 2022 Edition of the People’s Education Press Grade 4 Volume 2: Mathematics
7 [Bei**g Chaoyang New Target Detection] 2022 Edition of People’s Education Edition Grade 4 Volume 1: Mathematics
8 [Bei**g Chaoyang New Target Detection] 2023 Printing People’s Education Edition Grade 4 Volume 2: Mathematics
9 [Bei**g New Target Detection] 2023 Printing People’s Education Edition Grade 4 Volume 1: Mathematics
10 [Learning Objectives and Assessment] 2023-2024 Academic Year People’s Education Press Grade 4 Volume 2: Mathematics
11 [Bei**g Chaoyang New Target Detection] 2021 Edition of the People’s Education Edition Grade 5 Volume 1: Mathematics
12 [New Target Detection] 2022 Edition of the People’s Education Edition Grade 5 Volume 2: Mathematics
13 [Bei**g Chaoyang New Target Detection] 2022 Edition of the People’s Education Edition Grade 5 Volume 1: Mathematics
14 [Bei**g Chaoyang New Target Detection] 2023 Printing People’s Education Edition Grade 5 Volume 2: Mathematics
15 [Bei**g New Target Detection] 2023 Printing People’s Education Edition Grade 5 Volume 1: Mathematics
16 [Learning Objectives and Tests] 2023-2024 Academic Year People’s Education Edition Grade 5 Volume 2: Mathematics
17 [Bei**g Chaoyang New Target Detection] 2023 Printing People’s Education Edition Sixth Grade Volume 2: Mathematics
18 [New Target Detection] 2022 Edition of the People’s Education Press Grade 6 Volume 2: Mathematics
19 [Bei**g Chaoyang New Target Detection] 2022 Edition of People’s Education Edition Grade 6 Volume 1: Mathematics
20 [Bei**g Chaoyang New Target Detection] 2023 Edition of the People’s Education Press Sixth Grade Volume 1: Mathematics
21 [Learning Objectives and Tests] 2023-2024 Academic Year Sixth Grade People’s Education Press Volume 2: Mathematics
22 [Bei**g Xicheng Learning Exploration Diagnosis] 2022 Edition of People’s Education Edition Grade 3 Volume 1: Mathematics
23 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 3 Volume 2: Mathematics
24 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 3 Volume 1: Mathematics
25 [Learning Exploration Diagnosis] 2023-2024 Academic Year People’s Education Edition Grade 3 Volume 2: Mathematics
26 [Learning Exploration Diagnosis] 2022 Edition of the People’s Education Press Grade 4 Volume 2: Mathematics
27 [Bei**g Xicheng Learning Exploration Diagnosis] 2022 Edition of People’s Education Edition Grade 4 Volume 1: Mathematics
28 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 4 Volume 2: Mathematics
29 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 4 Volume 1: Mathematics
30 [Learning Exploration Diagnosis] 2023-2024 Academic Year People’s Education Edition Grade 4 Volume 2: Mathematics
31 [Learning Exploration Diagnosis] 2022 Edition of the People’s Education Press Grade 5 Volume 2: Mathematics
32 [Bei**g Xicheng Learning Exploration Diagnosis] 2022 Edition of the People’s Education Press Fifth Grade Volume 1: Mathematics
33 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 5 Volume 2: Mathematics
34 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Grade 5 Volume 1: Mathematics
35 [Learning Exploration Diagnosis] 2023-2024 Academic Year People’s Education Edition Grade 5 Volume 2: Mathematics
36 [Learning Exploration Diagnosis] 2022 Edition of the People’s Education Press Grade 6 Volume 2: Mathematics
37 [Bei**g Learning Inquiry Diagnosis] 2022 Edition of People’s Education Press Grade 6 Volume 1: Mathematics
38 [Bei**g Learning Exploration Diagnosis] 2023 Printing People’s Education Press Grade 6 Volume 2: Mathematics
39 [Bei**g Learning Inquiry Diagnosis] 2023 Printing People’s Education Press Sixth Grade Volume 1: Mathematics
40 [Learning Exploration Diagnosis] 2023-2024 Academic Year People’s Education Edition Sixth Grade Volume 2: Mathematics
41 [Bei**g Haidian famous teachers accompany you to study and practice synchronously] 2022 Bei**g Normal University Edition Grade 3 Volume 1: Mathematics
42 [Bei**g Haidian famous teachers accompany you to study and practice synchronously] 2022 print Bei**g Normal University edition Grade 3, Volume 2: Mathematics
43 [Bei**g Companion Learning Synchronous Learning Handbook] 2023 Printing Bei**g Normal University Edition Grade 3 Volume 1: Mathematics
44 [Synchronous Learning Handbook for You] 2023-2024 Academic Year Bei**g Normal University Edition Grade 3 Volume 2: Mathematics
45 [Famous teachers from Haidian accompany you to study, practice and test] 2022 Bei**g Normal University Grade 4 Volume 2: Mathematics
46 [Bei**g Haidian famous teachers accompany you to study and practice synchronously] 2022 Bei**g Normal University Edition Grade 4 Volume 1: Mathematics
47 [Bei**g Haidian famous teachers accompany you to study and practice synchronously] 2022 Bei**g Normal University Edition Grade 4 Volume 2: Mathematics
48 [Bei**g Companion Learning Synchronous Learning Handbook] 2023 Printing Bei**g Normal University Edition Grade 4 Volume 1: Mathematics
49 [Synchronous Learning Handbook for You] 2023-2024 Academic Year Bei**g Normal University Edition Grade 4 Volume 2: Mathematics
50 [Haidian famous teachers accompany you to study synchronous learning and practice book] 2021 edition of fifth grade volume 1: Mathematics
Table 11: The data sources of We-Math (Part2, Source 51 to 100).
Number Data Source
51 [Haidian famous teachers accompany you to study and practice synchronously] 2022 edition of Bei**g Normal University Grade 5 Volume 2: Mathematics
52 [Bei**g Haidian famous teachers accompany you to study synchronously] 2022 Bei**g Normal University Edition Grade 5 Volume 2: Mathematics
53 [Bei**g Companion Learning Synchronous Learning Handbook] 2023 Printing Bei**g Normal University Edition Grade 5 Volume 1: Mathematics
54 [Synchronous Learning Handbook for You] 2023-2024 Bei**g Normal University Edition Grade 5 Volume 2: Mathematics
55 [Haidian famous teachers accompany you to study synchronous learning and practice book] 2021 edition of sixth grade volume: Mathematics
56 [Haidian famous teachers accompany you to study and practice synchronously] 2022 Bei**g Normal University Grade 6 Volume 2: Mathematics
57 [Bei**g Haidian Famous Teacher] 2022 Bei**g Normal University Edition Sixth Grade Volume 1: Mathematics
58 [Bei**g Haidian famous teachers accompany you to study and practice synchronously] 2022 Bei**g Normal University Edition Grade 6 Volume 2: Mathematics
59 [Bei**g Companion Learning Synchronous Learning Handbook] 2023 Printing Bei**g Normal University Edition Sixth Grade Volume 1: Mathematics
60 [Synchronous Learning Handbook for You] 2023-2024 Academic Year Bei**g Normal University Edition Sixth Grade Volume 2: Mathematics
61 [Bei**g Dongcheng Formative Independent Evaluation] 2022 Edition of People’s Education Edition Grade 3 Volume 1: Mathematics
62 [Formative Self-Evaluation] 2023-2024 Academic Year Grade 3, Volume 2: Mathematics
63 [Bei**g Formative Independent Assessment] 2023 Printing People’s Education Press Grade 3 Volume 1: Mathematics
64 [Bei**g Formative Independent Assessment] 2023 Printing People’s Education Press Grade 3 Volume 1: Mathematics
65 [Bei**g Dongcheng Formative Independent Evaluation] 2022 Edition of the People’s Education Press Grade 4 Volume 1: Mathematics
66 [Formative Self-Evaluation] 2023-2024 Academic Year People’s Education Edition Grade 4 Volume 2: Mathematics
67 [Bei**g Formative Independent Assessment] 2023 Printing People’s Education Press Grade 4 Volume 1: Mathematics
68 [Formative Self-Evaluation] 2022 Edition of the People’s Education Press Grade 5 Volume 2: Mathematics
69 [Bei**g Dongcheng Formative Independent Evaluation] 2022 Edition of the People’s Education Press Fifth Grade Volume 1: Mathematics
70 [Formative Self-Evaluation] 2023-2024 Academic Year People’s Education Edition Grade 5 Volume 2: Mathematics
71 [Bei**g Formative Independent Assessment] 2023 Printing People’s Education Edition Grade 5 Volume 1: Mathematics
72 [Formative Self-Evaluation] 2022 Edition of the People’s Education Press Grade 6 Volume 2: Mathematics
73 [Bei**g Dongcheng Formative Independent Evaluation] 2022 Edition of the People’s Education Press Sixth Grade Volume 1: Mathematics
74 [Formative Self-Evaluation] 2023-2024 Academic Year Sixth Grade People’s Education Edition Volume 2: Mathematics
75 [Bei**g Formative Independent Assessment] 2023 Printing People’s Education Press Sixth Grade Volume 1: Mathematics
76 [All-in-one study, practice and examination] 2023-2024 school year Bei**g edition Grade 3, Volume 2: Mathematics
77 [All-in-one study, practice and examination] 2023-2024 school year People’s Education Press Grade 3 Volume 2: Mathematics
78 [Bei**g all-in-one study, practice and examination] 2023 Printing People’s Education Press Grade 3 Volume 1: Mathematics
79 [Bei**g all-in-one study, practice and examination] 2023 Bei**g Edition Grade 3 Volume 2: Mathematics
80 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Edition Grade 3 Volume 2: Mathematics
81 [Bei**g all-in-one study, practice and examination] 2022 Edition of People’s Education Press Grade 3 Volume 2: Mathematics
82 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Normal University Edition Grade 3 Volume 1: Mathematics
83 [All-in-one study, practice and examination] 2023-2024 school year Bei**g Edition Grade 4 Volume 2: Mathematics
84 [All-in-one study, practice and examination] 2023-2024 school year People’s Education Press Grade 4 Volume 2: Mathematics
85 [Bei**g all-in-one study, practice and examination] 2023 Bei**g Edition Grade 4 Volume 1: Mathematics
86 [Bei**g all-in-one study, practice and examination] 2023 Printing People’s Education Press Grade 4 Volume 1: Mathematics
87 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Edition Grade 4 Volume 2: Mathematics
88 [Bei**g all-in-one study, practice and examination] 2022 Edition of People’s Education Press Grade 4 Volume 2: Mathematics
89 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Normal University Edition Grade 4 Volume 1: Mathematics
90 [Bei**g all-in-one study, practice and examination] 2021 Edition Grade 4 Volume 1: Mathematics
91 [All-in-one study, practice and examination] 2023-2024 school year Bei**g Edition Grade 5 Volume 2: Mathematics
92 [All-in-one study, practice and examination] 2023-2024 school year People’s Education Press Grade 5 Volume 2: Mathematics
93 [Bei**g all-in-one study, practice and examination] 2023 Bei**g Edition Grade 5 Volume 1: Mathematics
94 [Bei**g all-in-one study, practice and examination] 2023 Printing People’s Education Edition Grade 5 Volume 1: Mathematics
95 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Edition Grade 5, Volume 2: Mathematics
96 [Bei**g all-in-one study, practice and examination] 2022 Edition of People’s Education Edition Grade 5 Volume 2: Mathematics
97 [All-in-one study, practice and examination] 2022 Bei**g Normal University Edition Grade 5 Volume 1: Mathematics
98 [Bei**g all-in-one study, practice and examination] 2021 Edition Grade 5 Volume 1: Mathematics
99 [All-in-one study, practice and examination] 2023-2024 school year Bei**g edition sixth grade volume 2: Mathematics
100 [All-in-one study, practice and examination] 2023-2024 school year People’s Education Press Grade 6 Volume 2: Mathematics
Table 12: The data sources of We-Math (Part3, Source 101 to 150).
Number Data Source
101 [Bei**g all-in-one study, practice and examination] 2023 Bei**g Edition Sixth Grade Volume 1: Mathematics
102 [Bei**g all-in-one study, practice and examination] 2023 Printing People’s Education Press Sixth Grade Volume 1: Mathematics
103 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Edition Grade 6 Volume 2: Mathematics
104 [Bei**g all-in-one study, practice and examination] 2022 Edition of People’s Education Press Sixth Grade Volume 2: Mathematics
105 [Bei**g all-in-one study, practice and examination] 2022 Printing Bei**g Normal University Edition Grade 6 Volume 2: Mathematics
106 [Bei**g all-in-one study, practice and examination] 2022 Bei**g Normal University Edition Sixth Grade Volume 1: Mathematics
107 [Bei**g all-in-one study, practice and examination] 2021 Edition Sixth Grade Volume 1: Mathematics
108 [Bei**g Class Workbook] 2023 Bei**g Edition Grade 4 Volume 1: Mathematics
109 [Class Workbook] 2023-2024 Academic Year Bei**g Edition Grade 4 Volume 2: Mathematics
110 [Zhejiang Class Workbook] 2022 Edition People’s Education Edition Grade 4 Volume 1: Mathematics
111 [Class Workbook] 2023-2024 Bei**g Edition Grade 5, Volume 2: Mathematics
112 [Bei**g Class Workbook] 2023 Bei**g Edition Grade 5 Volume 1: Mathematics
113 [Zhejiang Class Workbook] 2022 Bei**g Normal University Edition Grade 5 Volume 1: Mathematics
114 [Zhejiang Class Workbook] 2022 Edition People’s Education Edition Grade 5 Volume 1: Mathematics
115 [Bei**g Class Workbook] 2023 Bei**g Edition Grade 6 Volume 1: Mathematics
116 [Class Workbook] 2023-2024 Academic Year Bei**g Edition Grade 6 Volume 2: Mathematics
117 [Zhejiang Class Workbook] 2022 Bei**g Normal University Edition Sixth Grade Volume 1: Mathematics
118 [Zhejiang Class Workbook] 2022 Edition People’s Education Edition Sixth Grade Volume 1: Mathematics
119 [Mathematics Textbook] 2023-2024 Academic Year People’s Education Press Grade 3 Volume 2: Mathematics
120 [Mathematics Textbook] 2022 Shanghai Education Edition Grade 3 Volume 1: Mathematics
121 [Bei**g Mathematics Textbook] 2022 Bei**g Edition Grade 3 Volume 1: Mathematics
122 [Bei**g Mathematics Textbook] 2022 Bei**g Normal University Edition Grade 3 Volume 1: Mathematics
123 [Shanghai Mathematics Textbook] 2021 Shanghai Education Edition Grade 3 Volume 2: Mathematics
124 [Bei**g Mathematics Textbook] 2021 Bei**g Edition Grade 3 Volume 2: Mathematics
125 [Mathematics Textbook] 2020 Bei**g Normal University Edition Grade 3 Volume 2: Mathematics
126 [Bei**g Mathematics Textbook] 2020 Edition People’s Education Press Grade 3 Volume 1: Mathematics
127 [Mathematics Textbook] 2023-2024 Academic Year People’s Education Press Grade 4 Volume 2: Mathematics
128 [Shanghai Mathematics Textbook] 2022 Shanghai Education Edition Grade 4 Volume 1: Mathematics
129 [Bei**g Mathematics Textbook] 2022 Bei**g Edition Grade 4 Volume 1: Mathematics
130 [Bei**g Mathematics Textbook] 2022 Bei**g Normal University Edition Grade 4 Volume 1: Mathematics
131 [Shanghai Mathematics Textbook] 2021 Shanghai Education Edition Grade 4 Volume 2: Mathematics
132 [Bei**g Mathematics Textbook] 2021 Bei**g Edition Grade 4 Volume 2: Mathematics
133 [Bei**g Mathematics Textbook] 2020 Bei**g Normal University Edition Grade 4 Volume 2: Mathematics
134 [Bei**g Mathematics Textbook] 2019 Edition People’s Education Press Grade 4 Volume 1: Mathematics
135 [Mathematics Textbook] 2023-2024 Academic Year People’s Education Edition Grade 5 Volume 2: Mathematics
136 [Shanghai Mathematics Textbook] 2022 Shanghai Education Edition Grade 5 Volume 1: Mathematics
137 [Mathematics Textbook] 2022 Bei**g Normal University Edition Grade 5 Volume 1: Mathematics
138 [Shanghai Mathematics Textbook] 2022 Shanghai Education Edition Grade 5 Volume 1: Mathematics
139 [Mathematics Textbook] 2022 Bei**g Normal University Edition Grade 5 Volume 1: Mathematics
140 [Bei**g Mathematics Textbook] 2022 Bei**g Edition Grade 5 Volume 1: Mathematics
141 [Bei**g Mathematics Textbook] 2021 Bei**g Edition Grade 5 Volume 2: Mathematics
142 [Shanghai Mathematics Textbook] 2020 Shanghai Education Edition Grade 5 Volume 2: Mathematics
143 [Bei**g Mathematics Textbook] 2020 Bei**g Normal University Edition Grade 5 Volume 2: Mathematics
144 [Bei**g Mathematics Textbook] 2020 Edition People’s Education Edition Grade 5 Volume 1: Mathematics
145 [Bei**g Mathematics Textbook] 2022 Edition People’s Education Press Sixth Grade Volume 2: Mathematics
146 [Bei**g Mathematics Textbook] 2022 Bei**g Edition Sixth Grade Volume 1: Mathematics
147 [Mathematics Textbook] 2022 Bei**g Normal University Edition Grade 6 Volume 1: Mathematics
148 [Bei**g Mathematics Textbook] 2021 Bei**g Normal University Edition Grade 6 Volume 2: Mathematics
149 [Bei**g Mathematics Textbook] 2021 Bei**g Edition Sixth Grade Volume 2: Mathematics
150 [Bei**g Mathematics Textbook] 2018 Edition People’s Education Press Sixth Grade Volume 1: Mathematics
Table 13: The data sources of We-Math (Part4, Source 151 to 200).
Number Data Source
151 2022 Sichuan Liangshan Primary School to Junior High School Examination Paper (People’s Education Edition): Mathematics
152 2022 Chongqing Yubei District Primary School to Junior High School Examination Paper (People’s Education Edition Examination): Mathematics
153 2022 Guizhou Qiandongnan Primary School to Junior High School Examination Paper (People’s Education Edition Examination): Mathematics
154 2022 Anhui Fuyang Taihe County Primary School to Junior High School Examination Paper (Bei**g Normal University Edition Examination): Mathematics
155 2022 Guangdong Huizhou Huiyang District Primary School to Junior High School Examination Paper (Bei**g Normal University Edition): Mathematics
156 2022 Guangdong Shaoguan Xinfeng County Primary School to Junior High School Examination Paper (People’s Education Edition): Mathematics
157 2022 Guangdong Zhanjiang Mazhang District Primary School to Junior High School Examination Paper (Bei**g Normal University Edition Examination): Mathematics
158 2022 Gansu Dingxi Minxian Primary School to Junior High School Examination Paper (Bei**g Normal University Edition): Mathematics
159 2022 Guangdong Jieyang Jiedong District Primary School to Junior High School Examination Paper (Bei**g Normal University Edition Examination): Mathematics
160 2022 Henan Luohe Wuyang County Primary School Entrance Examination Paper (People’s Education Edition): Mathematics
161 2022 Tian** Primary School to Junior High School Examination Paper (Primary School to Junior High School in Some Districts): Mathematics
162 2022 Hebei Tangshan Lunan District Primary School Entrance Examination Paper (Hebei Education Edition Examination): Mathematics
163 2022 Hebei Baoding Qingyuan District Primary School to Junior High School Examination Paper (People’s Education Edition): Mathematics
164 2022 Xinjiang Turpan Primary School to Junior High School Examination Paper (People’s Education Edition): Mathematics
165 2022 Hebei Shijiazhuang Luquan District Primary School Entrance Examination Paper (Hebei Education Edition): Mathematics
166 2022 Hainan Ledong Li Autonomous County Primary School to Junior High School Examination Paper (People’s Education Edition Examination): Mathematics
167 2022 Henan Jiyuan Primary School to Junior High School Examination Paper (People’s Education Edition): Mathematics
168 2021 Bei**g Fengtai District Primary School to Junior High School Examination Paper (People’s Education Edition Examination): Mathematics
169 2021 Yunnan Kunming Wuhua District Primary School to Junior High School Examination Paper: Mathematics
170 2021 Yunnan Kunming Xishan District Primary School Entrance Examination Paper: Mathematics
171 2021 Shaanxi Xi’an Beilin District Primary School to Junior High School Examination Paper (Part 2): Mathematics
172 2021 Shaanxi Xi’an Beilin District Primary School to Junior High School Examination Paper: Mathematics
173 2021 Zhejiang Ningbo Haishu District Primary School to Junior High School Examination Paper: Mathematics
174 2021 Shaanxi Xi’an Weiyang District Primary School to Junior High School Examination Paper: Mathematics
175 2021 Shaanxi Xi’an Yanta District Primary School to Junior High School Examination Paper (Part 5): Mathematics
176 2021 Shaanxi Xi’an Beilin District Primary School to Junior High School Examination Paper: Mathematics
177 "2021 Primary to Junior High School Examination Paper in Baqiao District, Xi’an, Shaanxi: Mathematics
178 2021 Shaanxi Xi’an Weiyang District Primary School to Junior High School Examination Paper (Part 3): Mathematics
179 "2021 Primary to Junior High School Examination Paper in Erqi District, Zhengzhou, Henan: Mathematics
180 2021 Jiangsu Nantong Primary School to Junior High School Examination Paper (Main Urban Area): Mathematics
181 2021 Jiangsu Suzhou Xiangcheng District Primary School to Junior High School Examination Paper 5 to 6 Direct Promotion Class: Mathematics
182 "2021 Primary to Junior High School Examination Paper for Baqiao District, Xi’an, Shaanxi (Part 5): Mathematics
183 2021 Hunan Changsha Yuhua District Primary School to Junior High School Examination Paper: Mathematics
184 "2021 Primary to Junior High School Examination Paper of **gxing County, Shijiazhuang, Hebei (Hebei Education Edition): Mathematics
185 2021 Hebei Shijiazhuang **shan County Primary School Entrance Examination Paper (Hebei Education Edition): Mathematics
186 2021 Hebei Shijiazhuang Lingshou County Primary School to Junior High School Examination Paper (Hebei Education Edition): Mathematics
187 "2021 Primary to Junior High School Examination Paper in Yanta District, Xi’an, Shaanxi: Mathematics
188 2021 Shaanxi Xi’an Xincheng District Primary School to Junior High School Examination Paper (Part 3): Mathematics
189 "2021 Primary to Junior High School Examination Paper of Yuanshi County, Shijiazhuang, Hebei (Hebei Education Edition): Mathematics
190 "2021 Primary to Junior High School Examination Paper in Zhengding County, Shijiazhuang, Hebei (Hebei Education Edition): Mathematics
191 2021 Shaanxi Xi’an Yanta District Primary School to Junior High School Examination Paper (Part 14): Mathematics
192 2021 Chongqing Sha**ba District Chongqing Nankai Middle School Primary School to Junior High School Examination Paper (Part 3): Mathematics
193 2021 Shaanxi Xi’an Yanta District Primary School to Junior High School Examination Paper (Part 4A): Mathematics
194 "2021 Primary to Junior High School Examination Paper in Yanta District, Xi’an, Shaanxi: Mathematics
195 2021 Shaanxi Xi’an Beilin District Primary School to Junior High School Examination Paper (II) (Hanguang Campus): Mathematics
196 2020 Bei**g Chaoyang District Primary School to Junior High School Examination Paper: Mathematics
197 2020 Bei**g Haidian District Primary School to Junior High School Examination Paper (Paper A): Mathematics
198 2020 Bei**g Haidian District Primary School to Junior High School Examination Paper (Paper B): Mathematics
199 2020 Bei**g Haidian District Primary School to Junior High School Examination Paper: Mathematics
200 2020 Bei**g Chang** District Primary School to Junior High School Examination Paper: Mathematics
Table 14: The data sources of We-Math (Part5, Source 201 to 223).
Number Data Source
201 2020 Hunan Changsha Yuhua District Yali Experimental Middle School Primary to Junior High School Mathematics Test Paper
202 2020 Shenzhen Futian District Shenzhen Senior High School Primary to Junior High School Mathematics Test Paper
203 2020 Heilongjiang Jixi Hulin City Primary School Entrance Examination Paper: Mathematics
204 2020 Heilongjiang Qiqihar Primary School to Junior High School Examination Paper: Mathematics
205 2020 Ningxia Wuzhong Hongsibao District Primary School Entrance Examination Paper (People’s Education Edition): Mathematics
206 2020 Liaoning Fushun Wanghua District Primary School to Junior High School Examination Paper: Mathematics
207 "2020 Primary to Junior High School Examination Paper in Xincheng District, Hohhot, Inner Mongolia: Mathematics
208 2020 Guangdong Zhaoqing Huaiji County Primary School to Junior High School Examination Paper: Mathematics
209 2020 Guangdong Zhaoqing Huaiji County Primary School to Junior High School Examination Paper: Mathematics
210 2020 Fujian Quanzhou Licheng District Primary School Entrance Examination Paper: Mathematics
211 "2020 Primary to Junior High School Examination Paper in Huimin District, Hohhot, Inner Mongolia: Mathematics
212 2020 Hebei Baoding **gxiu District Primary School to Junior High School Examination Paper: Mathematics
213 2020 Sichuan Chengdu Pidu District Primary School to Junior High School Examination Paper (Part 1): Mathematics
214 2020 Sichuan Mianyang Youxian District Primary School Entrance Examination Paper (Part 6): Mathematics
215 2020 Guangdong Shaoguan Zhenjiang District Primary School to Junior High School Examination Paper: Mathematics
216 2020 Sichuan Mianyang Fucheng District Primary School to Junior High School Examination Paper (Part 10): Mathematics
217 2020 Sichuan Mianyang Fucheng District Primary School to Junior High School Examination Paper (Part 5): Mathematics
218 2020 Sichuan Chengdu Pidu District Primary School to Junior High School Examination Paper (Part 5): Mathematics
219 2020 Hunan Changsha Yuelu District Primary School to Junior High School Examination Paper: Mathematics
220 2020 Hunan Changsha Yuelu District Primary School to Junior High School Examination Paper: Mathematics
221 2020 Gansu Lanzhou Chengguan District Primary School to Junior High School Examination Paper (Class Examination): Mathematics
222 2020 Hunan Changsha Primary School to Junior High School Examination Paper: Mathematics
223 2020 Liaoning Shenyang Primary School Primary Exam Paper: Mathematics