^†^†

{}^{*}

Equal Contribution.^†^†

{}^{\ddagger}

Corresponding Authors.

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

Renqiu Xia

{}^{1,2*}

, Bo Zhang

{}^{1,*}

, Hancheng Ye

{}^{1,*}

, Xiangchao Yan

{}^{1}

, Qi Liu

{}^{1,2}

, Hongbin Zhou

{}^{1}

Zijun Chen

{}^{1,2}

, Min Dou

{}^{1}

, Botian Shi

{}^{1,\ddagger}

, Junchi Yan

{}^{1,2,\ddagger}

, Yu Qiao

{}^{1}

{}^{1}

Shanghai Artificial Intelligence Laboratory

{}^{2}

Shanghai Jiao Tong University

Abstract

Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously. However, their capacity to query information depicted in visual charts and engage in reasoning based on the queried contents remains under-explored. In this paper, to comprehensively and rigorously benchmark the ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data. Besides, we develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns, such as reasoning tasks in the field of charts or geometric images. We evaluate the chart-related ability of mainstream MLLMs and our ChartVLM on the proposed ChartX evaluation set. Extensive experiments demonstrate that ChartVLM surpasses both versatile and chart-related large models, achieving results comparable to GPT-4V. We believe that our study can pave the way for further exploration in creating a more comprehensive chart evaluation set and develo** more interpretable multi-modal models. Both ChartX and ChartVLM are available at: https://github.com/UniModal4Reasoning/ChartVLM

1 Introduction

Versatile Multi-modal Large Language Models (MLLMs) have made promising progress in general-purpose vision-language applications such as multi-modal Question Answering (QA) (Lin et al., 2023; Bai et al., 2023; Wang et al., 2023; Liu et al., 2023b), embodied AI (Huang et al., 2023b), and mathematical reasoning (Trinh et al., 2024; Yang et al., 2023a; Jiang et al., 2022). Although MLLMs have demonstrated their powerful generalization ability in a wide range of multi-modal tasks, their performance in multi-modal reasoning tasks still falls short of human abilities (Yang et al., 2023b; Bubeck et al., 2023; Achiam et al., 2023). For instance, humans can easily extract numerical values from a given visual chart and engage in a series of complicated logical reasoning based on the extracted values. However, at present, the MLLMs’ ability to perform complicated logical reasoning based on chart data has not been fully explored.

In this paper, to further validate their capabilities in more complicated reasoning tasks involving chart data, we propose a multi-modal benchmark for comprehensive chart understanding. As illustrated in Fig. 1, our work comprises two contributions: (1) ChartX, which is a comprehensive, high-quality evaluation set designed to sufficiently assess the chart understanding abilities of the off-the-shelf MLLMs, and (2) An interpretable Chart-domain Vision-Language Model (ChartVLM) for general-purpose chart applications.

To construct a comprehensive chart evaluation set, we collected 48K multi-modal chart data covering 22 topics, 18 chart types, and 7 tasks. Each chart data within this dataset includes four modalities, including image, Comma-Separated Values (CSV), python code, and text description. According to the task complexity, we classify the proposed 7 chart tasks into two general categories: perception tasks (chart structural extraction, chart type classification, and chart title extraction) and cognition tasks (QA, chart description, chart summarization, and chart re-drawing).

For certain scientific domains such as chart reasoning, where interpretability is paramount, our primary observation is prioritizing the perception tasks before engaging in the complicated reasoning tasks. The statistical information extracted via the perception tasks plays a pivotal role in providing essential support for the interpretability of the model’s reasoning tasks. Building upon this observation, we introduce ChartVLM, characterized by the integration of perception task predictions (e.g., structural data extraction) into reasoning tasks to enhance the interpretability of the reasoning results. Furthermore, ChartVLM utilizes an instruction adapter to dynamically select tasks that users expect to perform according to the users’ instructions, ensuring both interpretability and interactivity concurrently.

On top of this, the existing open-source chart datasets are consolidated for the training of ChartVLM, including ChartQA (Masry et al., 2022), Chart-to-text (Obeid and Hoque, 2020), PlotQA (Methani et al., 2020), and SimChart9K (Xia et al., 2023). Note that during the training process, ChartVLM has no access to any data from the ChartX evaluation set. Then, we conduct comprehensive comparisons of ChartVLM with current MLLMs (Bai et al., 2023; Liu et al., 2023b) on the ChartX evaluation set, including base abilities, e.g., data extraction, and advanced abilities, e.g., complicated problem-solving, where we demonstrate the superiority of our ChartVLM.

2 Related Work

Refer to caption — Figure 1: Our work offers two insights: a) ChartX: a comprehensive multi-modal chart evaluation set encompassing 22 disciplinary topics, 18 chart types, and 7 tasks where models are evaluated using task-specific metrics such as EM, GPT-acc GPT-score, SCRM (Xia et al., 2023), and b) ChartVLM: a novel framework to perform the multi-tasks in the chart domain. Our key point is to leverage the instruction adapter to dynamically choose the task that needs to be executed. For downstream tasks that rely on querying chart information, we prioritize chart structural extraction before engaging in chart reasoning tasks. This sequence aims to enhance the interpretability of the reasoning results.

Study Works	# Chart	# Chart	# Task	# Evaluation	# Evaluation	Evaluation	Open-
Study Works	Topic	Type	Type	Chart Images	Dataset	Metric	source
Single-task Evaluation
PlotQA (Methani et al., 2020)	N/A	3	1	33.7K	PlotQA	EM & AP	✓
Chart-to-text (Obeid and Hoque, 2020)	6	6	1	6.6K	Chart-to-text	EM	✓
ChartQA (Masry et al., 2022)	15	3	1	1.5K	ChartQA	EM	✓
OpenCQA (Kantharaj et al., 2022)	10	5	1	1.2K	OpenCQA	EM & BLEU & ROUGE	✓
Multi-task Evaluation
ChartLlama (Han et al., 2023)	N/A	10	7	1.5K	ChartQA & Chart-to-text	EM & GPT	✗
ChartBench (Xu et al., 2023)	N/A	9	4	2K	ChartBench	Accuracy	✗
MMC (Liu et al., 2023a)	5	6	9	2.1K	MMC	GPT	✗
ChartAssisstant (Meng et al., 2024)	N/A	9	5	1.5K	ChartQA & OpenCQA	EM& BLEU	✗
Ours	22	18	7	6K	ChartX	EM & SCRM & GPT-acc & GPT-score	✓

Table 1: Comparison with the existing chart-related benchmarks, where ChartX is constructed for comprehensively evaluating the off-the-shelf vision-language large models from more chart types and topics. Besides, EM denotes Exact Match and SCRM represents the Structuring Chart-oriented Representation Metric described in StructChart (Xia et al., 2023).

Chart Perception aims to extract the numerical and textual information from a given visual chart. By leveraging the OCR tools (Luo et al., 2021) to supplement the textual information, the basic function of extracting chart information can be achieved. Recently, some researchers (Hassan et al., 2023; Rane et al., 2021; Huang et al., 2023a) have attempted to perform a chart-to-table transformation for the visual chart perception task, by means of self-supervision from image-table pairs. For example, Deplot (Liu et al., 2022a) fine-tuned an image-to-text transformer for such conversion. StructChart (Xia et al., 2023) utilizes the encoder-decoder framework to achieve transformation. These methods extract the tabular format of a visual chart and leverage the external module such as GPT (Ouyang et al., 2022; Brown et al., 2020) to perform downstream tasks. However, their chart-related reasoning abilities strongly depend on external modules, whose scalability is hard to guarantee.

Chart Cognition is defined as a process to deal with intricate tasks related to both chart-related knowledge and common sense knowledge. A typical example is to query numerical points from a chart and give the prediction results using mathematical or logical reasoning. Recent studies (He et al., 2023; Tian et al., 2023; Zha et al., 2023; Lee et al., 2023; Liu et al., 2022b) focus on showing the reasoning ability of their models on chart domain. Pix2Struct (Lee et al., 2023) presents a pre-training method using masked screenshots from web pages, which is verified to be effective in chart understanding tasks such as ChartQA dataset (Masry et al., 2022). Besides, MatCha (Masry et al., 2022) decodes the answers to chart questions in an end-to-end manner, where the chart reasoning ability can be enhanced from MATH data (Saxton et al., 2019).

Multi-Modal Chart Generation and Benchmark. Chart data generation is a crucial step for scaling up the model ability (Tian et al., 2023; Liu et al., 2022b; Akhtar et al., 2023). Previous chart-related benchmarks only cover general three types of charts (line, pie, bar charts) and focus on a few tasks such as chart-to-table tasks for ChartQA (Masry et al., 2022), PlotQA (Methani et al., 2020), and Chart-to-Text (Obeid and Hoque, 2020), and QA tasks for DVQA (Kafle et al., 2018) and OpenCQA (Kantharaj et al., 2022). Recently, various benchmarks have been proposed in some works, e.g. MMC (Liu et al., 2023a), ChartLlama (Han et al., 2023), ChartBench (Xu et al., 2023), and ChartAssisstant (Meng et al., 2024), with the common characteristics of more types, more tasks, and more modalities of chart data, which is insightful for the chart community. However, as shown in Table 1, the data and metric diversity of charts used for evaluating multi-modal large models is relatively limited. For example, ChartBench (Xu et al., 2023) merely uses a two-sided judgment (yes or no) to evaluate model performance. The types of charts and data in MMC (Liu et al., 2023a) are also insufficient in verifying the chart ability of the off-the-shelf MLLMs.

3 ChartX: Multi-task Chart Evaluation Set

3.1 Coverage Analysis of the Evaluation Set

We describe the coverage range of ChartX from chart types, chart topics, and chart-related tasks, respectively.

Chart Types. ChartX covers all chart types where chart data can be directly converted into a structural data format, e.g., CSV format, resulting in a total of more than 18 chart types. For a clear visualization, we categorize different chart types into three groups based on their usage frequency and application fields. (1) General Chart Types: bar chart (with or w/o numerical data), line chart (with or w/o numerical data), and pie chart. These five chart types are commonly employed to represent a wide range of chart data distribution. (2) Fine-grained Chart Types: ring chart, radar chart, box plot, 3D-bar chart, histogram, treemap, rose chart, bubble chart, multi-axes chart, and area chart. These 10 chart types are mostly variations of the general chart types to present the complex data distribution more vividly. (3) Domain-specific Chart Types: heatmap, funnel, and candlestick. These three chart types are specially designed to visualize data distribution within domain-specific fields. For example, heatmap is commonly used to visualize the significant difference trend in a 2D space. Funnel charts are widely used in the analysis of market sales, while candlestick is primarily utilized for depicting stock trends. The distribution statistics of chart type in ChartX are shown in Fig. 1. Specifically, we generate more images on general chart types to expand the chart diversity, which are more frequently utilized with more diversity. For the fine-grained chart types, the image number of each type is balanced to avoid the long-tail distribution issue in our benchmark.

Chart Topics. ChartX contains various chart topics covering as many themes as possible. Specifically, the high-level topics in ChartX can be divided into five perspectives: commerce, industry, society, culture, and lifestyle. And fine-grained topic types can be subdivided into 22 sub-disciplinaries, which are listed in Fig. 1. The topic distribution of ChartX is presented in Fig. 1. More statistical results of chart topics are shown in Appendix A.1.

Chart Tasks. Unlike previous chart benchmarks focusing on the category of visual logic reasoning tasks, the ChartX benchmark emphasizes the interpretability for all downstream chart-related tasks. Given that interpretability relies heavily on the ability to perceive chart information, ChartX categorizes perception-related tasks as base tasks, including title perception, chart type recognition, and Structural Extraction (SE). On the other hand, other chart-related tasks are classified as intricate cognition tasks, including chart-related Question Answering (QA), Chart Description, Chart Summarization, and Chart Redrawing. In the context of ChartX, QA refers to answering questions that are formulated solely based on the chart data, requiring reasoning derived directly from the provided chart information. This characteristic distinguishes ChartX from previous chart-related QA datasets like ChartQA (Masry et al., 2022). In ChartQA (Masry et al., 2022), there exists a certain number of QA pairs that cannot be answered solely based on the information presented in the given chart image. Chart Description aims at presenting detailed information and some insights from the distribution of chart data, while Chart Summarization features summarizing the trend-like or high-level characteristics from the given data in a few sentences. Chart Redrawing refers to plotting the given data into a new chart image with the same chart type of original data. The distribution of each task is listed in Fig. 1. For each image, together with labels of base tasks, we collect two QA samples, one description sample, one summarization sample, and one redrawing code sample. Overall, the samples from multi-tasks reach 48K in ChartX.

3.2 Distribution Analysis of the Evaluation Set

We analyze the distribution diversity of the ChartX benchmark by considering both style distribution and content distribution. Fig. 2 visually depicts the diversity comparison among various chart benchmarks using t-SNE.

Style Distribution. In terms of style distribution, the inner-class diversity is considered to augment the style fashion of each chart type. Such diversification is achieved by both package and hyper-parameter diversity performed by human efforts. For each chart type, we design an individual diversification scheme with different plotting package candidates and different hyper-parameter settings. A general alternative plotting scheme includes matplotlib, seaborn, and plotly packages, etc, while some domain-specific packages like mplfinance are also employed to increase the diversity. The hyper-parameter diversity involves the adjustment of all possible hyper-parameter settings in plotting, e.g., figure size, background setting, axis/legend location, line, marker style, tick, filling styles, alpha, annotation, etc.

Content Distribution. As for content diversity, the CSV data length distribution and task-wise token distribution for each chart are visualized for different chart benchmarks to compare the content distribution diversity. As shown in Fig. 2, the ChartX benchmark presents a higher diversity in both CSV data length and token distribution than the existing benchmarks.

3.3 Two-stage Chart Data Generation

Utilizing the strong generation capabilities of GPT-4 (Achiam et al., 2023), ChartX is created through an automated online generation process with manual instructions. This involves a data-centric two-stage generation paradigm, encompassing the creation of perception and cognition data.

Data Acquisition: Chart Perception. As mentioned earlier, chart perception data includes chart data, chart title, and chart type. To generate chart titles and types, we initialize selection spaces with GPT-4, which are later refined by human adjustment to align closely with real-world chart contents and ensure practical conversion potential to CSV-format data. For chart data generation, GPT-4 is employed to generate the actual data distribution based on the specified length requirements for the given chart type and chart topic.

Data Acquisition: Chart Cognition. The generation of chart cognition data is based on the generated chart perception data. For each chart perception data sample, we design individual instructions with special task templates (refer to Appendix A.2) to generate different cognition task data. Additionally, some chart type-specific instruction examples will be randomly sampled to guide the data generation, which is widely and specially designed for the corresponding chart type and topic. Among these tasks, the generated redrawing code is utilized to further render the chart image, thus constructing the image-label pairs as metadata for the ChartX benchmark, which is further illustrated in Fig. 3 and Appendix A.2.

3.4 Task Evaluation Metrics

SCRM. Given that data in the chart has matrix-like row-column transformation invariance and transpose transformation invariance, Structuring Chart-oriented Representation Metric (SCRM) Xia et al. (2023) is employed to evaluate the extracted chart information (i.e. SE task), in which the linearized CSV tokens predicted by all models will be transformed to triplet format for performing SCRM evaluation.

GPT-acc & GPT-score. The GPT-acc metric is designed for tasks with unambiguous answers like question-answering, where outputs are evaluated against an exact ground truth using GPT-4. To make a rational evaluation, GPT-acc incorporates a 5% margin of error for numerical responses. Conversely, the GPT-score metric addresses open-ended tasks where responses are subjectively graded. Here, GPT-4 rates summarization, description, and code-redrawing outputs on a 0-5 scale based on manually adjudicated scoring criteria. All the prompts about the manual criteria for each task are described in Appendix B.1, which considers completeness, relevance, accuracy, and creativity of responses.

4 ChartVLM: Chart Vision-Language Model

4.1 Overall Model Design

Here, we introduce ChartVLM, an innovative framework illustrated in Fig. 4. This architecture comprises an instruction adapter, a pixel-level encoder, and a pair of text-level cascaded decoders. The instruction adapter serves as the initial chart task routing module, selecting chart tasks to be executed based on the user’s instructions. For base tasks, such as the prediction of chart title, type, and CSV data, only the base decoder engages. Conversely, the auxiliary decoder will be activated for more intricate generative tasks, building upon the CSV predictions obtained by the base decoder.

The motivations of the cascaded mechanism are: 1) to augment the model’s interpretability in cognition tasks through the incorporation of intermediate chart representations, such as CSV data and title, type, and etc, and 2) to improve computational efficiency by allocating the workload across decoders of varying parameters, wherein the base decoder is significantly smaller than auxiliary decoder.

4.2 Instruction Adapter: Instruction Selection

The purposes of designing an instruction adapter are: 1) to meet a broad spectrum of user instructions, and 2) to dynamically select the decoder assigned based on user instructions. The instruction adapter has a simple structure, consisting of only three linear layers, efficiently map** diverse user instructions to one of seven chart task categories. For training the instruction adapter, we construct a simple dataset using GPT-3.5, containing 7K pairs of user instructions and their task labels. The designed instruction adapter demonstrates flawless performance on the validation subset we constructed, with a 100% accuracy rate.

4.3 Cascaded Decoders Design

The base decoder is developed to extract chart information (mainly CSV data) from a visual chart. If a task is classified as a basic perception task by instruction adapter, the chart at pixel-level will be converted to textual representations output directly (e.g. chart title, type, and CSV data) without the need for auxiliary decoder intervention. Conversely, when dealing with complicated tasks that require intricate generative processes, the auxiliary decoder will be activated. It leverages both the textual representational outputs from the base decoder and user instructions to execute its sophisticated operations. Once the chart task is determined by the adapter, the cascaded decoders are dynamically and efficiently allocated to meet the varying task requirements.

For basic perception tasks, we fine-tune all the network weights pre-trained from Pix2Struct-base and Pix2Struct-large (Lee et al., 2023) model, using image-CSV pair data. The fine-tuned encoder and decoder are regarded as chart image encoder and base decoder in ChartVLM. After the fine-tuning stage is completed, the encoder-decoder can effectively transform the chart in image format into a CSV format (i.e. chart representation in Fig. 4). For intricate cognition tasks, we utilize LoRA (Hu et al., 2021) and fine-tune the pre-trained Vicuna-7B and Vicuna-13B as auxiliary decoders using text-text pair data including CSV, QA, summarization, and drawing codes.

Ultimately, two model variants are developed: ChartVLM-Base-7.3B (0.3B chart image encoder & base decoder + 7B auxiliary decoder) and ChartVLM-Large-14.3B (1.3B chart image encoder & base decoder + 13B auxiliary decoder). All the data we used during fine-tuning stage comes from ChartQA (Masry et al., 2022), PlotQA (Methani et al., 2020), Chart2Text (Kanthara et al., 2022), and SimChart9K (Xia et al., 2023). Besides, the ChartVLM is trained using 32 $\times$ NVIDIA Tesla A100.

Model	#Params	Perception Tasks					Cognition Tasks
		Structural Extraction			Chart Type	Chart Title	QA	Chart Desc.	Chart Summ.	Chart Redraw.
		AP@Strict	AP@Slight	AP@High	EM	EM	GPT-acc	GPT-score	GPT-score	GPT-score
Multi-modal Models
LLaVA-1.5 Liu et al. (2023b)	13B	0.04	0.04	0.24	47.05	44.18	17.19	1.48	1.29	0.75
CogVLM Wang et al. (2023)	18B	0.38	0.56	1.01	59.46	94.01	28.30	2.21	1.48	1.38
QWen-VL Bai et al. (2023)	9.6B	4.18	5.86	8.99	69.53	94.62	23.26	1.67	1.45	0.86
SPHINX-V2 Lin et al. (2023)	13B	10.95	23.75	32.07	43.66	92.71	31.16	1.53	1.39	0.96
GPT-4V OpenAI (2023)	-	20.91	26.00	36.09	70.43	95.22	33.04	3.17	3.12	2.63
Chart-related Models
Deplot Liu et al. (2022a)	1.3B	8.89	19.04	24.08	-	89.84	-	-	-	-
Matcha Liu et al. (2022b)	0.3B	0.92	1.10	1.16	5.03	7.90	14.41	-	-	-
ChartLlama Han et al. (2023)	13B	1.63	2.01	3.19	50.52	40.36	13.80	1.04	1.02	0.94
StructChart Xia et al. (2023)	1.3B	0.46	0.94	1.77	-	-	-	-	-	-
ChartAst Meng et al. (2024)	13B	11.35	22.77	30.18	43.23	92.71	30.99	0.33	1.03	0.82
Ours
ChartVLM-B	7.3B	18.49	26.02	32.65	95.67	94.27	36.46	2.05	1.84	1.36
ChartVLM-L	14.3B	23.18	30.68	38.30	96.82	97.05	40.71	2.17	2.05	1.58

Table 2: Zero-shot results on both perception and cognition tasks. Comparison with state-of-the-art multi-modal language methods and chart-oriented large models on our proposed ChartX, where Desc. and Summ. denote that chart description and summarization task, respectively. The used evaluation metric for each task is introduced in Sec. 3.4.

5 Experiments

5.1 Evaluation Settings

Considering the diversity in different chart types and downstream tasks, the evaluation process of each task should be meticulously designed. Here, we present each necessary post-processing of model predictions on different chart tasks to achieve a more objective evaluation and comparison.

Post-processing of Structural Extraction. For the evaluation of the SE task, considering that the mechanism of SCRM is based on triplet-format matching and some entities may be invisible or irrelative to the visual data in some chart types, the perceived data of several chart types should be post-processed to avoid the prediction errors induced by meaningless perceptions. Specifically, for the percentage-related chart types, e.g., pie chart, ring chart, treemap, funnel chart, etc., the column label of values is usually invisible. Thus, the prediction of this entity for all task evaluations will be manually replaced as ‘value’ or ‘percentage’ to uniform the value representation, namely entity replacement.

Prompt Setting for Evaluation. To make a fair comparison between different model performances on the ChartX benchmark, the prompts of different tasks are fine-tuned according to different baseline models to achieve the best performance on each task. The detailed prompt content for each task is illustrated in Fig. A.5 and A.6 of Appendix B.1.

5.2 Baseline Models and Main Results

We select two kinds of MLLMs to make a comprehensive comparison. One group of MLLMs is made up of multi-modal large models, where models are trained towards general capability for various vision-language tasks. Here we select five of the most advanced MLLMs for evaluation comparison: LLaVA-1.5 Liu et al. (2023b), CogVLM Wang et al. (2023), QWen-VL Bai et al. (2023), SPHINX-V2 Lin et al. (2023), and GPT-4V OpenAI (2023). The other group of MLLMs represents the chart-related large models that are especially fine-tuned on chart-related tasks, including Deplot Liu et al. (2022a), Matcha Liu et al. (2022b), StructChart Xia et al. (2023), ChartLlama Han et al. (2023), and ChartAssistant Meng et al. (2024).

Table 2 shows the main comparison results with various models on ChartX benchmark, from which we can observe the comprehensive evaluation results for each model across various chart tasks and the superiority of ChartVLM. Notably, the proposed ChartVLM-B and ChartVLM-L consistently outperform most models in these tasks (except GPT-4V in the cognition tasks), showcasing the effectiveness of ChartVLMs in understanding information from charts.

Results on Each Chart Type. The class-wise performance of ChartVLMs in seven tasks is shown in Fig. 6. For better visualization, we skip six relatively difficult chart types (rose chart, area chart, 3D-bar chart, bubble chart, multi-axes chart, and radar chart) whose performance is zero-value in all AP metrics for almost all models. The numerical accuracy of these models on seven tasks can be referred to Appendix B.3. From the four subfigures, it can be observed that the type-wise performance of different compared models and our ChartVLM can give a better understanding of different model performances across different chart types.

Comparison with GPT-4V. As shown in Table 2, among all models, GPT-4V (OpenAI, 2023) is the only model that outperforms our ChartVLM in a few cognition tasks of the ChartX benchmark. This result is reasonable as GPT-4V is currently regarded as the most powerful MLLM for its strong ability to understand and describe information from images, e.g., summarization ability and description ability. However, for the perception tasks, since GPT-4V is a relatively general model, the structural extraction ability is inferior to our ChartVLM, which is specially designed for chart-related tasks. Furthermore, ChartVLM’s stronger ability to extract structural data from a chart image partially leads to a higher accuracy on the chart QA task (40.71%).

Method	CSV Source	QA Task	Chart Summ.
Method	CSV Source	Metric: GPT-acc	Metric: GPT-score
ChartVLM-B	Golden Table	50.6	3.01
ChartVLM-B	Predicted	36.5	1.84
GPT-4V OpenAI (2023)	/	33.0	3.12

Table 3: Accumulated prediction errors of structural extraction task towards other downstream reasoning tasks such as chart QA and chart summarization.

Model	Perception Tasks				Cognition Tasks
Model	SE	Title	Type	Avg.	QA	Summ.	Desc.	Redraw	Avg.
Inference Speed (s):
LLaVA-1.5 Liu et al. (2023b)	12.29	0.56	0.41	4.42	0.99	3.48	3.50	11.63	4.90
QWen-VL Bai et al. (2023)	4.96	0.93	1.00	2.30	0.38	2.98	2.81	7.43	3.40
SPHINX-V2 Lin et al. (2023)	5.53	1.51	1.21	2.75	1.38	3.96	4.09	9.73	4.79
Deplot Liu et al. (2022a)	3.82	-	-	3.82	-	-	-	-	-
ChartLlama Han et al. (2023)	8.13	0.53	0.42	3.03	0.48	4.13	4.35	13.09	5.51
ChartAst Meng et al. (2024)	55.24	3.55	1.37	20.05	3.81	6.06	6.04	34.14	12.51
ChartVLM-B (ours)	2.28	0.39	0.25	0.97	3.41	5.05	4.90	5.85	4.80
ChartVLM-L (ours)	2.87	0.42	0.29	1.19	4.38	6.02	5.98	7.14	5.88

Table 4: Inference speed for both perception and cognition tasks tested on a single Tesla A100 with batch size of 1. The maximum number of tokens generated for each task remains consistent.

Model	#Params	Structural Extraction
Model	#Params	AP@Strict	AP@Slight	AP@High
SCRM without Entity Replacement:
LLaVA-1.5 Liu et al. (2023b)	13B	0	0	0
QWen-VL Bai et al. (2023)	9.6B	1.14	2.40	4.70
SPHINX-V2 Lin et al. (2023)	13B	4.70	12.46	18.86
GPT-4V OpenAI (2023)	-	14.35	19.00	27.22
Deplot Liu et al. (2022a)	1.3B	7.03	16.22	20.76
ChartLlama Han et al. (2023)	13B	1.39	1.68	2.37
ChartAst Meng et al. (2024)	13B	5.99	14.93	21.19
ChartVLM-L (ours)	14.3B	22.38	29.22	36.77
SCRM with Entity Replacement:
LLaVA-1.5 Liu et al. (2023b)	13B	0.04	0.04	0.24
QWen-VL Bai et al. (2023)	9.6B	4.18	5.86	8.99
SPHINX-V2 Lin et al. (2023)	13B	10.95	23.75	32.07
GPT-4V OpenAI (2023)	-	20.91	26.00	36.09
Deplot Liu et al. (2022a)	1.3B	8.89	19.04	24.08
ChartLlama Han et al. (2023)	13B	1.63	2.01	3.19
ChartAst Meng et al. (2024)	13B	11.35	22.77	30.18
ChartVLM-L (ours)	14.3B	23.18	30.68	38.30

Table 5: Evaluation results of structural extraction with or without entity replacement.

5.3 Insightful Analyses

In this part, we conclude five important findings as follows:

1) In our cascaded decoder mechanism, increased precision in structural data extraction by the base decoder is positively correlated with improved outcomes in intricate reasoning task performance. In Table 2, it is evident that the ChartVLM-L model outperforms ChartVLM-B in SE task, also exhibiting superior performance in intricate cognition tasks, including QA, summarization, etc. Notably, when SE accuracy attains 100% (corresponding to ‘golden table’ in Table 3), our model’s performance on cognition tasks peaks, indicating a direct correlation of performance between basic perception tasks and complicated cognition tasks.

2) Our ChartVLM exhibits stronger performance in complicated reasoning tasks, owing to our reasoning tasks taking the text representations obtained by the perception task as a conditional input. Table 2 demonstrates that, despite SPHINX-V2 (32.07%) exhibiting performance close to our ChartVLM (32.65%) in SE task, ChartVLM still demonstrates superior reasoning performance in downstream tasks such as QA tasks (36.46 %). This improvement mainly stems from the novel design of the cascaded decoder mechanism, in which the base decoder enhances complicated reasoning tasks by incorporating the basic perceived results.

3) Our ChartVLM demonstrates faster inference speed while maintaining a parameter count comparable to the existing open-source models. Table 4 illustrates a comparative analysis of inference speeds between ChartVLM and other open-source models. Although the inference performance on cognitive tasks is comparable across all models, a significant enhancement in speed is observed for perception tasks in ChartVLM, which is attributed to the exclusive involvement of the lightweight base decoder.

4) The post-processing implementation of entity replacement significantly alleviates assessment biases. As shown in Table 5, entity replacement has led to enhanced performance across all baseline models in the SE task, verifying its effectiveness in refining evaluation outcomes.

5) Current MLLMs exhibit a significant deficit in their capacity to interpret type-specific charts, yielding inferior results in downstream cognitive tasks when benchmarked against GPT-4V. As evidenced in Tables A.1, A.2, A.3, A.4, and A.5, the existing open-source models demonstrate markedly inferior performance in both the perception and cognition tasks of specialized chart types, such as rose, area, 3D-bar, bubble, multi-axes, and radar charts.

6 Conclusion

In this study, to comprehensively evaluate the chart-related capabilities of MLLMs, we construct ChartX, which is a high-quality, multi-modal, multi-type, multi-topic, and multi-task chart evaluation set. Besides, the ChartVLM framework is developed, which leverages a new cascaded decoder mechanism to boost the interpretability of MLLMs in handling scientific chart data.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Akhtar et al. [2023] Mubashara Akhtar, Nikesh Subedi, Vivek Gupta, Sahar Tahmasebi, Oana Cocarascu, and Elena Simperl. Chartcheck: An evidence-based fact-checking dataset over real-world chart images. arXiv preprint arXiv:2311.07453, 2023.
Bai et al. [2023] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Han et al. [2023] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.
Hassan et al. [2023] Muhammad Yusuf Hassan, Mayank Singh, et al. Lineex: Data extraction from scientific line charts. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6213–6221, 2023.
He et al. [2023] Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, et al. Text2analysis: A benchmark of table question answering with advanced data analysis and unclear queries. arXiv preprint arXiv:2312.13671, 2023.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Huang et al. [2023a] Kung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan, Yi R Fung, Zhenhailong Wang, Lingyu Zhang, Shih-Fu Chang, and Heng Ji. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. arXiv preprint arXiv:2312.10160, 2023a.
Huang et al. [2023b] Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng Li. Instruct2act: Map** multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023b.
Jiang et al. [2022] Albert Q Jiang, Sean Welleck, ** Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022.
Kafle et al. [2018] Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
Kanthara et al. [2022] Shankar Kanthara, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq R. Joty. Chart-to-text: A large-scale benchmark for chart summarization. In Annual Meeting of the Association for Computational Linguistics, 2022.
Kantharaj et al. [2022] Shankar Kantharaj, Xuan Long Do, Rixie Tiffany Ko Leong, Jia Qing Tan, Enamul Hoque, and Shafiq Joty. Opencqa: Open-ended question answering with charts. arXiv preprint arXiv:2210.06628, 2022.
Lee et al. [2023] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
Lin et al. [2023] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
Liu et al. [2022a] Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022a.
Liu et al. [2022b] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. arXiv preprint arXiv:2212.09662, 2022b.
Liu et al. [2023a] Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023a.
Liu et al. [2023b] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
Luo et al. [2021] Junyu Luo, Zekun Li, **peng Wang, and Chin-Yew Lin. Chartocr: Data extraction from charts images via a deep hybrid framework. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1917–1925, 2021.
Masry et al. [2022] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
Meng et al. [2024] Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, and ** Luo. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384, 2024.
Methani et al. [2020] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
Obeid and Hoque [2020] Jason Obeid and Enamul Hoque. Chart-to-text: Generating natural language descriptions for charts by adapting the transformer model. arXiv preprint arXiv:2010.09142, 2020.
OpenAI [2023] OpenAI. Gpt-4v(ision) system card. https://openai.com/contributions/gpt-4v, 2023.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Rane et al. [2021] Chinmayee Rane, Seshasayee Mahadevan Subramanya, Devi Sandeep Endluri, Jian Wu, and C Lee Giles. Chartreader: Automatic parsing of bar-plots. In 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), pages 318–325. IEEE, 2021.
Saxton et al. [2019] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. arXiv preprint arXiv:1904.01557, 2019.
Tian et al. [2023] Yuan Tian, Weiwei Cui, Dazhen Deng, Xin**g Yi, Yurun Yang, Haidong Zhang, and Yingcai Wu. Chartgpt: Leveraging llms to generate charts from abstract natural language. arXiv preprint arXiv:2311.01920, 2023.
Trinh et al. [2024] Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
Wang et al. [2023] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
Xia et al. [2023] Renqiu Xia, Bo Zhang, Haoyang Peng, Ning Liao, Peng Ye, Botian Shi, Junchi Yan, and Yu Qiao. Structchart: Perception, structuring, reasoning for visual chart understanding. arXiv preprint arXiv:2309.11268, 2023.
Xu et al. [2023] Zhengzhuo Xu, Sinan Du, Yiyan Qi, Cheng** Xu, Chun Yuan, and Jian Guo. Chartbench: A benchmark for complex visual reasoning in charts. arXiv preprint arXiv:2312.15915, 2023.
Yang et al. [2023a] Kaiyu Yang, Aidan M Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626, 2023a.
Yang et al. [2023b] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023b.
Zha et al. [2023] Liangyu Zha, Junlin Zhou, Liyao Li, Rui Wang, Qingyi Huang, Saisai Yang, **g Yuan, Changbao Su, Xiang Li, Aofeng Su, et al. Tablegpt: Towards unifying tables, nature language and commands into one gpt. arXiv preprint arXiv:2307.08674, 2023.

Appendix A Details of ChartX Evaluation Set

We present the zoom-in characteristics of the ChartX evaluation set by detailing the data distribution and its generation pipeline.

A.1 Introduction of Chart Topics

The categories of chart topics have been concisely displayed in Fig. 1 of the main text. Here a more detailed distribution is introduced for a clear visualization. As shown in Fig. A.1, there are a total of 22 chart topics, generally covering the fields of commerce, industry, lifestyle, society, and culture. Each topic is evenly distributed in ChartX, demonstrating its comprehensiveness.

A.2 Overall of Data Generation Pipeline

We first describe the overall data generation pipeline, including perception data and cognition data. Then, the prompt templates for different data generation are provided.

Data Generation Pipeline. As shown in Fig. 3, during the first stage, we prepare a chart type pool and a chart topic pool in which the candidates are pre-selected based on GPT-4, where those chart types of an explicit connection or map** with CSV-format data are selected as candidates of the chart type pool. After achieving such two pools, we iteratively and randomly sample the candidates from two pools and fill them into the pre-designed prompt template to generate CSV data associated with the chart title. Once the pair of CSV data and the corresponding chart title are generated, they are both filled into various task-specific and type-specific prompt templates to generate cognition task samples.

Prompt Design for Overall Data Generation. We provide a general prompt template for overall data generation, including perception data and cognition data in Fig. A.2. For perception data generation, we impose constraints on the magnitude and length of the data to make most data visible and recognizable in the chart image. For cognition data generation, we impose task-specific guidance to generate the corresponding ground-truth labels for each task. The diversity in different tasks is achieved through designing type-specific prompts. Here we provide two examples to illustrate type-specific prompts (marked red in Fig. A.2) in overall data generation. Fig. A.3 shows the detailed type-specific prompts to generate code data and QA samples of 3D-bar charts, rose charts, box plots and candlesticks.

A.3 Examples of ChartX

Fig. A.4 provides more examples of metadata in the generated dataset, including the chart type, title, topic, CSV data, QA pairs, summarization, description, and the redrawing code. It can be observed that:

(1) The generated data are closely related to the assigned chart types and topics.

(2) The generated QA pairs are closely related to the characteristics of the given chart types and topics, increasing the overall diversity.

(3) The generated summary and description concisely and accurately describe the content of the assigned chart data.

Appendix B Experimental Details

We provide detailed experimental information in this section, including the evaluation criteria of all tasks, the quantitative results for each chart type, and more visualizations of prediction results.

B.1 Evaluation Settings

Prompt Design for GPT-acc and GPT-score. We adopt GPT-acc as the evaluation metric for the QA task, and GPT-score for the description, summarization, and redrawing tasks, respectively. The complete prompts and manual criteria are concluded in Fig. A.5 and A.6.

Employed Threshold of SCRM. According to the definition of SCRM metric proposed in StructChart Xia et al. (2023), three different levels of tolerance ( $tol:=\{strict,slight,high\}$ ) are set for fine-grained evaluation of SE task. Considering the different perception difficulties of different types of charts, we divide all 18 types of charts into two difficulty levels: normal and difficult, and set different thresholds for tolerance respectively.

For normal charts, including bar chart, line chart, pie chart, bar chart with number, line chart with number, ring chart, heatmap, box plot, candlestick, funnel chart, histogram, and treemap:

$\displaystyle strict:=$	$\displaystyle\left\{J_{thr}\|_{tol}=0\wedge e_{thr}\|_{tol}=0\right\},$	(A.1)
$\displaystyle slight:=$	$\displaystyle\left\{J_{thr}\|_{tol}=2\wedge e_{thr}\|_{tol}=0.05\right\},$
$\displaystyle high:=$	$\displaystyle\left\{J_{thr}\|_{tol}=5\wedge e_{thr}\|_{tol}=0.1\right\},$

For difficult charts, including rose chart, area chart, 3D-Bar chart, bubble chart, multi-axes chart, and radar chart:

$\displaystyle strict:=$	$\displaystyle\left\{J_{thr}\|_{tol}=0\wedge e_{thr}\|_{tol}=0.1\right\},$	(A.2)
$\displaystyle slight:=$	$\displaystyle\left\{J_{thr}\|_{tol}=2\wedge e_{thr}\|_{tol}=0.3\right\},$
$\displaystyle high:=$	$\displaystyle\left\{J_{thr}\|_{tol}=5\wedge e_{thr}\|_{tol}=0.5\right\},$

where $J_{thr}|_{tol}$ indicates the edit distance threshold between prediction and GT string, $e_{thr}|_{tol}$ refers to the relative error threshold between prediction numeric value and GT value.

B.2 Maximum number of generate token settings

To fairly compare the performance of models on various tasks, we unify the maximum number of generate token (max_token) of different models on the same task. The details of max_token can be concluded: 1) 1280 for SE, 2) 100 for title, 3) 20 for type, 4) 100 for QA, 5) 512 for description and summarization, and 6) 1024 for redrawing code. This setting is still maintained for inference speed testing

B.3 Quantitative Results for Each Chart Type

We have presented part of the class-wise performance in Fig. 6 of the main text. Here, more comprehensive testing results of various models on all tasks are listed in Tables A.1, A.2, A.3, A.4, and A.5. Specifically, we compare recent multi-modal language models and chart-related models with ChartVLMs on QA, SE, description, summarization and redrawing tasks. The results show a comprehensive superiority of ChartVLMs to the existing models in most chart types and tasks. It should be noted that except for GPT-4V, whose scores of summary and description are higher than the average score, the downstream reasoning tasks seem quite tough for all models. This shed light on the common challenge in learning chart-related language models: how to fully learn multiple tasks in a single model without sacrificing the generalization ability to a new chart domain.

		General Chart Types					Fine-grained Chart Types
Models	Tasks	bar	bar_num	line	line_num	pie	ring	box	hist	treemap	rose	area	3D-bar	bubble	multi	radar	heatmap	funnel	candle	Avg.
SPHINX-V2	SE	2.50	20.10	7.20	9.90	35.10	9.00	0.00	2.00	17.60	15.40	0.00	0.00	0.00	0.00	0.00	9.81	26.00	0.00	10.95
		17.40	34.20	36.40	27.80	65.40	22.60	0.00	20.20	17.60	51.00	0.00	2.60	0.00	0.00	8.00	14.81	28.20	0.00	23.75
		39.40	46.00	47.90	39.70	76.20	27.80	0.80	25.60	18.40	71.40	1.40	4.60	0.00	0.60	16.00	18.46	34.20	0.00	32.07
Deplot		2.20	33.70	16.00	22.30	0.00	14.20	0.00	20.20	2.40	0.00	0.00	0.00	0.00	0.00	0.00	0.00	19.60	0.00	8.89
		21.70	41.30	51.20	52.90	0.00	14.20	0.00	66.00	2.40	0.20	0.20	0.00	0.00	0.20	0.40	0.00	20.80	0.00	19.04
		42.10	48.70	60.10	61.20	0.00	14.60	0.00	82.20	3.00	4.20	0.60	0.00	0.00	1.40	1.00	0.00	23.60	0.00	24.08
ChartAst		7.80	22.10	8.20	11.50	44.30	4.40	0.00	8.40	13.60	2.40	0.00	0.00	1.40	0.00	0.00	13.65	9.20	0.00	11.35
		21.70	33.80	40.10	35.20	53.00	14.80	0.00	24.80	14.60	25.20	0.00	3.80	1.80	0.00	26.00	20.00	11.20	0.00	22.77
		38.40	44.60	48.00	41.70	63.70	14.80	0.00	30.80	15.80	40.60	0.00	7.00	4.20	0.00	38.00	24.04	15.00	0.00	30.18
GPT-4V		0.00	25.00	0.00	15.50	65.50	60.00	0.00	20.00	33.00	0.00	0.00	0.00	0.00	0.00	0.00	76.00	80.00	0.00	20.91
		0.00	46.00	2.50	21.00	65.50	60.00	0.00	23.00	33.00	10.00	0.00	12.00	3.00	0.00	20.00	87.00	80.00	0.00	26.00
		0.00	53.00	24.50	41.00	67.00	80.00	9.00	21.00	49.00	18.00	0.00	20.00	22.00	0.00	62.00	88.00	80.00	0.00	36.09
ChartVLM-B		10.60	20.40	26.30	29.10	40.70	15.80	0.00	38.00	12.80	0.00	0.00	0.00	0.00	0.00	0.00	28.08	76.00	0.00	18.49
		17.70	27.50	42.90	45.00	41.50	15.80	1.60	67.00	12.80	5.80	0.00	2.20	0.80	0.00	12.20	33.46	77.00	20.40	26.02
		21.20	33.00	51.90	54.80	43.20	20.60	13.20	75.00	15.20	22.40	4.20	9.60	1.60	1.20	18.40	35.77	77.80	47.60	32.65
ChartVLM-L		16.30	34.00	37.60	34.70	49.90	24.80	0.00	45.80	21.20	0.00	0.00	0.00	0.00	0.00	0.40	23.65	72.20	0.00	23.18
		19.50	37.50	55.80	48.10	49.90	24.80	0.40	77.20	21.20	3.00	1.80	2.40	2.00	0.00	19.60	23.65	72.20	36.00	30.68
		27.90	42.00	60.40	55.10	51.80	28.40	19.80	87.20	21.80	27.60	9.80	8.00	4.00	0.60	32.20	25.19	73.20	69.20	38.30

Table A.1: Class-wise mean precision for Structural Extraction (SE) task evaluated using SCRM Xia et al. (2023). For some hard fine-grained classes such as bubble chart, radar chart, etc, we use the relatively high tolerance for evaluating the SCRM results as introduced in Sec. A.6. Note that the color blocks represent the tolerance level we set in SCRM, where , , indicate strict, slight, high tolerance, respectively.

		General Chart Types					Fine-grained Chart Types
Models	Tasks	bar	bar_num	line	line_num	pie	ring	box	hist	treemap	rose	area	3D-bar	bubble	multi	radar	heatmap	funnel	candle	Avg.
QWen-VL	QA	33.00	31.00	22.00	22.00	45.00	24.00	16.00	24.00	20.00	10.00	10.00	16.00	16.00	8.00	16.00	26.92	28.00	14.00	23.26
SPHINX-V2		35.00	51.00	31.00	25.00	64.00	30.00	16.00	30.00	30.00	22.00	14.00	18.00	16.00	12.00	20.00	40.38	42.00	14.00	31.16
ChartLlama		14.00	13.00	9.00	10.00	39.00	14.00	20.00	12.00	18.00	10.00	8.00	14.00	12.00	4.00	16.00	5.77	10.00	4.00	13.80
ChartAst		36.00	51.00	29.00	22.00	67.00	36.00	14.00	38.00	28.00	24.00	14.00	16.00	14.00	4.00	22.00	38.46	40.00	14.00	30.99
LLaVA-1.5		24.00	26.00	10.00	16.00	29.00	6.00	30.00	10.00	22.00	20.00	12.00	20.00	22.00	8.00	18.00	9.62	8.00	0.00	17.18
Matcha		10.00	18.00	13.00	12.00	35.00	6.00	4.00	10.00	26.00	10.00	6.00	8.00	4.00	4.00	8.00	19.23	44.00	6.00	14.41
GPT-4V		20.00	40.00	25.00	35.00	65.00	50.00	30.00	30.00	70.00	20.00	10.00	10.00	30.00	0.00	30.00	50.00	60.00	0.00	33.04
ChartVLM-B		34.00	38.00	32.00	37.00	62.00	44.00	54.00	40.00	38.00	16.00	14.00	16.00	26.00	18.00	26.00	40.38	74.00	26.00	36.46
ChartVLM-L		41.00	46.00	33.00	39.00	68.00	52.00	56.00	44.00	44.00	26.00	26.00	28.00	24.00	10.00	24.00	34.62	80.00	38.00	40.71

Table A.2: Class-wise accuracy for Question Answering (QA) task evaluated using GPT-acc.

		General Chart Types					Fine-grained Chart Types
Models	Tasks	bar	bar_num	line	line_num	pie	ring	box	hist	treemap	rose	area	3D-bar	bubble	multi	radar	heatmap	funnel	candle	Avg.
QWen-VL	Desc	1.58	1.30	1.80	1.75	2.40	1.60	1.50	1.70	1.90	1.50	1.70	1.50	1.60	1.80	1.30	1.60	1.90	1.30	1.67
SPHINX-V2		1.36	1.60	1.50	1.75	2.35	1.60	1.00	1.10	1.70	1.80	1.30	1.20	1.40	1.60	1.30	1.20	1.80	0.70	1.53
ChartLlama		1.05	1.00	1.05	1.00	1.20	1.10	0.70	1.10	1.30	1.20	0.90	0.90	0.90	1.20	1.10	1.50	0.90	0.60	1.04
ChartAst		0.00	0.40	0.25	0.15	2.00	0.90	0.40	0.00	0.00	0.60	0.00	0.00	0.00	0.00	0.00	0.00	0.20	0.00	0.34
LLaVA-1.5		1.79	1.30	1.60	1.70	1.45	1.10	1.20	1.20	1.90	2.00	1.20	1.80	1.30	1.60	1.30	1.60	1.20	1.10	1.48
GPT-4V		2.84	3.00	2.95	2.90	3.55	3.20	3.10	3.40	3.60	3.60	3.40	2.90	3.50	2.90	3.00	3.70	3.70	2.40	3.17
ChartVLM-B		1.95	2.70	2.05	1.90	3.90	2.40	2.00	2.40	2.60	1.60	1.70	1.70	1.30	1.50	2.00	2.60	2.40	2.40	2.05
ChartVLM-L		1.47	2.75	2.45	1.85	4.00	2.50	2.60	3.00	2.50	1.40	0.90	1.50	1.00	1.40	1.00	2.00	3.30	1.70	2.17

Table A.3: Class-wise accuracy for Chart Description (Desc) evaluated using GPT-score. The score of each individual description is an integer between 0-5.

		General Chart Types					Fine-grained Chart Types
Models	Tasks	bar	bar_num	line	line_num	pie	ring	box	hist	treemap	rose	area	3D-bar	bubble	multi	radar	heatmap	funnel	candle	Avg.
QWen-VL	Summ	1.58	1.10	1.55	1.65	1.95	1.50	1.40	1.50	1.60	1.50	1.50	1.20	1.20	1.30	1.40	1.30	1.50	1.00	1.45
SPHINX-V2		1.16	1.60	1.25	1.10	2.50	1.40	1.40	1.50	1.80	1.40	1.10	1.10	1.30	1.10	1.00	1.10	1.60	1.10	1.39
ChartLlama		1.05	1.00	0.95	1.25	1.00	1.00	1.00	1.30	1.10	1.20	0.80	1.00	0.70	0.60	1.30	1.00	0.70	1.20	1.02
ChartAst		1.00	1.05	0.85	1.00	2.40	1.70	2.70	1.00	0.30	1.30	0.50	0.30	1.60	0.20	0.70	0.20	0.30	0.40	1.03
LLaVA-1.5		1.42	1.05	2.00	1.65	1.30	1.10	1.10	1.50	1.30	1.10	1.00	1.40	1.30	0.90	1.20	1.00	0.80	1.20	1.29
GPT-4V		3.10	2.80	3.20	2.75	3.30	3.10	2.70	4.00	3.50	3.60	2.40	2.70	3.00	3.10	3.10	4.10	3.60	2.70	3.12
ChartVLM-B		1.26	2.20	1.95	1.20	3.30	2.30	2.70	2.40	2.40	1.50	1.00	1.30	1.00	1.40	1.00	1.80	2.30	1.50	1.84
ChartVLM-L		1.37	2.50	2.35	1.90	3.80	3.00	2.40	2.90	2.10	1.30	0.90	1.00	1.00	1.40	1.00	1.70	3.20	1.30	2.05

Table A.4: Class-wise accuracy for Chart Summarization (Summ) evaluated using GPT-score. The score of each individual summarization is an integer between 0-5.

		General Chart Types					Fine-grained Chart Types
Models	Tasks	bar	bar_num	line	line_num	pie	ring	box	hist	treemap	rose	area	3D-bar	bubble	multi	radar	heatmap	funnel	candle	Avg.
QWen-VL	Redraw	0.89	0.60	0.80	1.30	1.25	1.10	0.80	0.80	0.80	1.10	0.60	1.10	0.90	0.60	0.50	0.50	0.70	0.70	0.86
SPHINX-V2		1.00	1.75	1.60	1.65	1.80	0.50	0.40	1.60	0.60	1.10	0.20	0.50	0.40	0.20	0.00	0.50	0.30	0.20	0.96
ChartLlama		1.16	1.05	0.90	1.15	1.80	0.70	0.80	1.00	1.00	0.70	0.70	1.10	0.70	0.40	0.50	1.00	0.70	0.30	0.94
ChartAst		0.95	1.35	0.00	0.60	0.30	0.00	1.50	2.40	0.60	1.70	1.80	0.60	0.60	1.20	0.00	0.00	2.10	0.00	0.82
LLaVA-1.5		0.95	0.75	0.80	0.95	0.90	0.60	0.60	0.80	0.70	1.00	0.60	0.80	0.90	0.40	0.60	0.70	0.50	0.50	0.75
GPT-4V		2.05	2.70	2.05	2.75	3.55	3.40	2.00	2.70	2.70	2.80	2.20	2.70	2.40	2.80	2.30	3.20	3.50	1.60	2.63
ChartVLM-B		1.63	1.50	1.70	1.65	1.90	1.10	1.90	1.10	0.40	1.20	0.80	1.00	1.70	1.30	0.80	1.20	1.00	1.10	1.36
ChartVLM-L		1.53	1.85	1.85	1.70	2.75	1.90	1.40	1.20	0.90	1.00	1.10	1.60	1.30	1.50	0.80	1.90	1.20	1.10	1.58

Table A.5: Class-wise accuracy for Chart Re-drawing (Redraw) evaluated using GPT-score. The score of each individual redrawing code is an integer between 0-5.

B.4 Visualization Results of Perception Tasks

We provide four visualization perception results for different types of charts in Fig. A.7, including funnel chart, histogram, radar chart and line chart. The results demonstrate that our ChartVLM performs well on chart title and the chart-type prediction task. Even if the SE result of the radar chart is slightly wrong, ChartVLM still has strong SE performance on the funnel chart, histogram, and line chart.