MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Baizhou Huang, Xu Zhang, Xinyu Hu, Xiaojun Wan
Wangxuan Institute of Computer Technology, Peking University
{junzhezhang, zhanghuixuan}@stu.pku.edu.cn
{xjyin, hbz19, zhangxu, huxinyu, wanxiaojun}@pku.edu.cn
Abstract

Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edits distinct part of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate three multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in develo** effective techniques for this task.

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency


Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Baizhou Huang, Xu Zhang, Xinyu Hu, Xiaojun Wan Wangxuan Institute of Computer Technology, Peking University {junzhezhang, zhanghuixuan}@stu.pku.edu.cn {xjyin, hbz19, zhangxu, huxinyu, wanxiaojun}@pku.edu.cn


1 Introduction

With the developments of multimodal large language models (MLLMs) , their application has become widespread across various fields. However, these models struggle with a challenge that the knowledge stored within them could be inaccurate or outdated. This issue manifests in two errors: misreading and misrecognition Cheng et al. (2024). As shown in Figure 1, misrecognition occurs when a model mistakenly identifies an image, such as mistaking Mac Allister as Messi. On the other hand, misreading refers to incorrect textual knowledge, such as misremembering Messi’s football team. Recent researches have introduced knowledge editing in multimodal contexts to address these issues.

Refer to caption
Figure 1: An illustration of multimodal knowledge and the two types of multimodal errors: misrecognizing a picture of Mac Allister as Messi, and misreading Messi’s football team.

Following the conventional definition of knowledge-editing in LLMs, a few studies have proposed benchmarks for knowledge editing in MLLMs Cheng et al. (2024); Huang et al. (2024); Li et al. (2024). However, these benchmarks over-simplify the evaluation of multimodal knowledge editing, and do not distinguish the differences between misreading and misrecognition errorsCheng et al. (2024); Huang et al. (2024). Mixing evaluation of the two types of errors leads to inaccurate assessments of knowledge editing methods in real-world scenarios. Methods may appear to successfully inject objective multimodal knowledge, but actually conduct incorrect edits. Take the misreading error in Figure 1 for an example, where a MLLM misrecognizes the image of Messi to Mac Allister, leading to the erroneous knowledge that "the person in the image plays for Liverpool". If a knowledge editing method falsely injecting an knowledge triplet (Mac Allister, Play for, Inter Miami), it may still achieve great performance on prior benchmarks, since the multimodal knowledge (Image of Messi, Play for, Inter Miami) is actually corrected.

To better handle and evaluate these two types of knowledge editing scenarios, we for the first time define the multimodal knowledge in a decomposed format consist of visual knowledge and textual knowledge. In this way, the misreading and misrecognition errors can be distinguished, and thereby be independently corrected by editing different knowledge components. The decomposition of multimodal knowledge also brings up another requirement Consistency. We believe that a knowledge editing method should always ensure the consistency of knowledge across different modalities. This property is the essential difference between the multimodal knowledge editing and uni-modal knowledge editing.

Following the decomposed definition of multimodal knowledge, we propose a multimodal knowledge editing benchmark emphasizing modality consistency (MC-MKE). MC-MKE consists of three subsets, corresponding to the three different formats of multimodal knowledge. Our benchmark aligns more closely with multimodal knowledge editing in real-life scenarios and can more systematically and comprehensively evaluate the performance of a multimodal knowledge editing method in a fine-grained manner.

We evaluate three of most renowned multimodal knowledge editing methods including fine-tuning, MENDMitchell et al. (2022) and IKEZheng et al. (2023) on the three subsets of different editing formats. We find that the performance of these methods is far from satisfaction on MC-MKE. None of them can achieve great performance on all three different editing formats, especially for the consistency metric. It is demonstrated that multimodal knowledge editing is still challenging and requires further exploration.

In summary, our contributions are as follows111Our code and data will be released to the community to facilitate future research.:

  • We first propose a decomposed definition of multimodal knowledge according to different multimodal knowledge error types.

  • We present MC-MKE, a new multimodal knowledge editing benchmark that can evaluate Reliability, Locality, Generality, and Consistency of multimodal editing methods under different editing formats.

  • We conduct experiments with various knowledge editing methods on MC-MKE. The results reveal the limitations of existing methods especially for modality consistency. Different from previous research, we find that editing the corresponding component sometimes yields better performance.

2 Related Works

2.1 Knowledge Editing

Knowledge editing aims to provide efficient and lightweight solutions for updating knowledge in models (Zhu et al., 2020). Several benchmarks have been developed for this task, including COUNTERFACT (Meng et al., 2022) for counterfactual knowledge, MQuake (Zhong et al., 2023) for multi-hop knowledge, AToKE (Yin et al., 2024) for retaining old knowledge, and WIKIUPDATE (Wu et al., 2024) for unstructured knowledge.

These benchmarks primarily address language model editing, leaving multimodal model editing underexplored. To address this gap, Cheng et al. (2024) introduced the MMEdit benchmark based on Visual QA (Antol et al., 2015) and Image Captioning (Herdade et al., 2019). Wu et al. (2024) developed KEBench, which uses multimodal Knowledge Graphs (Liu et al., 2019) to evaluate vision knowledge editing. Additionally, MIKE (Li et al., 2024) focuses on fine-grained multimodal entity knowledge editing. However, as shown in Table 1, all previous work has neglected the organization of multimodal knowledge and lacked a more careful definition of multimodal knowledge editing, which is what our work focuses on.

2.2 Multimodal Models

Multimodal large language models have developed rapidly in recent years. BLIP-2 (Li et al., 2023b) apply Q-Former architecture to transform image input into LLMs input tokens. LLaVA(Liu et al., 2024b) and LLaVA-v1.5(Liu et al., 2024a) utilize linear layers or perceptrons to map the vision features into the inputs of LLMs. Through instruction tuning on BLIP2, InstructBLIP(Dai et al., 2024) gains the ability to follow the instructions on different tasks. Notably, MiniGPT-4Zhu et al. (2023) and MiniGPT-v2Chen et al. (2023) are also powerful LVLMs that exhibit strong performance across various vision-language tasks. There are many other MLLMs such as mPLUG-Owl(Ye et al., 2023), Otter(Li et al., 2023a) and Qwen-VL (Bai et al., 2023). Among all MLLMs, GPT-4V(OpenAI, 2023) is the most powerful one now. We select some of these MLLMs on our research.

Refer to caption
Figure 2: The upper represents editing different components of MLLMs. The bottom provides an overview of different editing formats. With an input image and its corresponding textual knowledge (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ), we show three different editing formats. Although the final output is the same, the edited multimodal knowledge differs when editing its visual or textual knowledge, and the consistency property is also different given different edit inputs.

3 Multimodal Knowledge Editing

3.1 Definition of Multimodal Knowledge

There are two types of knowledge updating scenarios, namely misrecognition and misread. The misrecognition scenario refers to the model’s recognized entity from the image being incorrect and needs correcting. So we define a visual knowledge (i,e)𝑖𝑒(i,e)( italic_i , italic_e ) related to this scenario, where i𝑖iitalic_i represents an image and e𝑒eitalic_e represents the recognized entity.

In contrast, the misread scenario focuses on the model that successfully recognizes the entity in the image but fails to provide the correct object within the context of the entity and relation. In this scenario the corresponding textual knowledge (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) is related.

Therefore, we believe a piece of multimodal knowledge can be represented as a combination of visual knowledge (i,e)𝑖𝑒(i,e)( italic_i , italic_e ) from image recognition of an entity and textual knowledge triplet (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) about the recognized entity. We finally decompose a piece of multimodal knowledge as:

K(i,e,s,r,o)=(i,e)×e=s(s,r,o)𝐾𝑖𝑒𝑠𝑟𝑜subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜K(i,e,s,r,o)=(i,e)\times_{e=s}(s,r,o)italic_K ( italic_i , italic_e , italic_s , italic_r , italic_o ) = ( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o ) (1)

Further, in many cross-modal datasets, most instances represent knowledge in the final form of (i,r,o)𝑖𝑟𝑜(i,r,o)( italic_i , italic_r , italic_o ) because there is no need to explicitly mention the intermediate entity e𝑒eitalic_e (and s𝑠sitalic_s). So another combined form of multimodal knowledge can be denoted as:

(i,e)×e=s(s,r,o)=(i,r,o)subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜𝑖𝑟𝑜(i,e)\times_{e=s}(s,r,o)=(i,r,o)( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o ) = ( italic_i , italic_r , italic_o ) (2)

In summary, (i,e),(s,r,o),(i,r,o)𝑖𝑒𝑠𝑟𝑜𝑖𝑟𝑜(i,e),(s,r,o),(i,r,o)( italic_i , italic_e ) , ( italic_s , italic_r , italic_o ) , ( italic_i , italic_r , italic_o ) are three types of knowledge involved in multimodal knowledge editing. However, regardless of the type of knowledge being edited, a good editing method must ensure that the consistency of multimodal knowledge is maintained after editing the corresponding type of knowledge.

3.2 Definition of MMEdit

We define three different edit formats, IE_edit, SRO_edit, IRO_edit.

IE_edit IE_edit is focused on editing knowledge related to image-to-entity recognition, denoted as (i,e)𝑖𝑒(i,e)( italic_i , italic_e ). If we want to edit the model’s recognition of an entity in an image, we input the image and modify the model’s entity output for this image to a new output , which is (i,ee~)𝑖𝑒~𝑒(i,e\rightarrow\tilde{e})( italic_i , italic_e → over~ start_ARG italic_e end_ARG ).

Benchmark Edit_formats Edit_requirements
IE SRO IRO Fine-grained Reliability Locality Generality Portability Consistency
MMEdit
KEBench
MIKE
MC-MKE
Table 1: Comparisons of current multimodal knowledge editing benchmarks, MMEdit (Cheng et al., 2024), KEBench (Wu et al., 2024) and MIKE (Li et al., 2024). IE, SRO, IRO represent different editing formats. ✓ and ✗ mean whether the benchmark can provide data of corresponding editing format. In Fine-grained, ✓ means that the corresponding benchmark is constructed based on fine-grained entity information, while ✗ means that the benchmark is constructed around multimodal task data. Edit_requirements are the properties we expect from a good editing method. ✓ and ✗ indicate whether the benchmark contains the ability to test these properties of editing methods.

SRO_edit SRO_edit is focused on editing specific textual knowledge triplets (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ). When we know the exact way to edit the corresponding textual knowledge tuple (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ), we do not need to find the corresponding multimodal data pair. Instead, we can directly use textual editing way. To ensure consistency in input format of multimodal language models, we use a black image as visual input. Subsequent experiments in appendix A have shown that when using questions generated from textual knowledge as input, the type of input image does not significantly impact the accuracy of the answers. In this case, the model’s textual input is the same as the textual knowledge editing task.

IRO_edit In many multimodal datasets, numerous examples do not present the complete construction information of an instance of multimodal knowledge. We only possess the final multimodal data (i,r,o)𝑖𝑟𝑜(i,r,o)( italic_i , italic_r , italic_o ) and may not be able to accurately decompose it into the corresponding visual knowledge and textual knowledge. Nonetheless, we still need to edit such multimodal knowledge. Even though we may not explicitly identify the corresponding visual knowledge and textual knowledge, an effective method should implicitly understand and update the corresponding knowledge.

Therefore, we hope that a good multimodal knowledge editing method can maintain consistency, even when editing with the final multimodal knowledge input. Theoretically, modifying only (i,r,oo~)𝑖𝑟𝑜~𝑜(i,r,o\rightarrow\tilde{o})( italic_i , italic_r , italic_o → over~ start_ARG italic_o end_ARG ) should lead to consistency, whether through (i,ee~)𝑖𝑒~𝑒(i,e\rightarrow\tilde{e})( italic_i , italic_e → over~ start_ARG italic_e end_ARG ) or (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ). However, there is an issue that there could be many non-unique e𝑒e\textquoterightitalic_e ’. Our dataset provides automatically generated reasons to determine it is a modification of (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ). A good editing method should automatically use the provided information to determine that the modification should be implemented on the corresponding textual knowledge triplet in IRO_edit of our benchmark.

3.3 Requirements of MMEdit Method

Consistency Consistency means that a piece of multimodal knowledge is answered consistently across different modalities after multimodal knowledge editing. In IE_edit, if we modify the corresponding visual knowledge (i,ee~)𝑖𝑒~𝑒(i,e\rightarrow\tilde{e})( italic_i , italic_e → over~ start_ARG italic_e end_ARG ), consistency means that the corresponding multimodal knowledge should also change as: (i,r~,o~)=(i,ee~)×e~=s~(s~,r~,o~)𝑖~𝑟~𝑜subscript~𝑒~𝑠𝑖𝑒~𝑒~𝑠~𝑟~𝑜(i,\tilde{r},\tilde{o})=(i,e\rightarrow\tilde{e})\times_{\tilde{e}=\tilde{s}}(% \tilde{s},\tilde{r},\tilde{o})( italic_i , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ) = ( italic_i , italic_e → over~ start_ARG italic_e end_ARG ) × start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG = over~ start_ARG italic_s end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ). In SRO_edit, if we modify the corresponding textual knowledge (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ) while kee** the visual knowledge unchanged, the corresponding multimodal knowledge will also be modified to (i,e)×e=s(s,r,oo~)=(i,r,oo~)subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜~𝑜𝑖𝑟𝑜~𝑜(i,e)\times_{e=s}(s,r,o\rightarrow\tilde{o})=(i,r,o\rightarrow\tilde{o})( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ) = ( italic_i , italic_r , italic_o → over~ start_ARG italic_o end_ARG ). In IRO_edit, due to the reasons mentioned above, our dataset provides information that allows the corresponding multimodal knowledge to change as follows: (i,r,oo~)=(i,e)×e=s(s,r,oo~)𝑖𝑟𝑜~𝑜subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜~𝑜(i,r,o\rightarrow\tilde{o})=(i,e)\times_{e=s}(s,r,o\rightarrow\tilde{o})( italic_i , italic_r , italic_o → over~ start_ARG italic_o end_ARG ) = ( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ). The definition of portability is similar to consistency in IE_edit. However, our consistency also includes situation on the SRO_edit and IRO_edit directions.

The property of consistency imposes higher demands on the multimodal knowledge editing method, requiring that the edited knowledge remains unified across different modalities in the multimodal model.

Reliability Reliability requirement of multimodal knowledge editing refers to the success rate of edits under the corresponding editing format.

Locality Locality means that multimodal editing should not affect unrelated knowledge when editing the corresponding knowledge.

Generality Generality means that after a piece of multimodal knowledge is edited, the model should not only respond to edited output under the format used for editing. It needs to provide correct edited responses under various generalizations, such as rephrased textual input or different images of the same entity.

4 MC-MKE Benchmark Construction

Since pure textual knowledge editing datasets are constructed from textual knowledge triplets (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) and contain editing information (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ), we opt for using the textual knowledge editing dataset MQuAKE as the starting point to construct our multimodal knowledge editing dataset MC-MKE. MQuAKE, as a text knowledge editing dataset, contains knowledge triplets and related editing information. Each instance in MQuAKE corresponds to a textual knowledge triplet and its textual editing information.

4.1 Data Selection

Unlike previous editing datasets, we performed filtering in three directions step by step on the original MQuAKE dataset Drawsubscript𝐷𝑟𝑎𝑤D_{raw}italic_D start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT to achieve a high-quality dataset.

First, we filter the data using a completely black image paired with generated questions. We selected data that our MLLMs could correctly answer. This step ensures that all the edited knowledge is originally known by the model to make sure we are “editing" instead of “learning". The filtered dataset is referred to as Dfilter1subscript𝐷𝑓𝑖𝑙𝑡𝑒subscript𝑟1D_{filter_{1}}italic_D start_POSTSUBSCRIPT italic_f italic_i italic_l italic_t italic_e italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

From Dfilter1subscript𝐷𝑓𝑖𝑙𝑡𝑒subscript𝑟1D_{filter_{1}}italic_D start_POSTSUBSCRIPT italic_f italic_i italic_l italic_t italic_e italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we obtain related images from Google, of the subject s𝑠sitalic_s in the textual knowledge triplets (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ). We then used ChatGPT to generate fine-grained entity categories for these subjects and construct image queries using specific templates. If the subject in the image could be correctly recognized by all MLLMs, the data is then retained. This step ensures that all entities in our dataset are known by the models. This constitutes the dataset Dfilter2subscript𝐷𝑓𝑖𝑙𝑡𝑒subscript𝑟2D_{filter_{2}}italic_D start_POSTSUBSCRIPT italic_f italic_i italic_l italic_t italic_e italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Finally, we replaced the subject in the questions generated from (s,r)𝑠𝑟(s,r)( italic_s , italic_r ) with “the {category} in the picture” where {category} is the entity type previously generated by ChatGPT, seen in appendix D. If the combined question can be correctly answered by all models, the data is then retained. This step ensures the original multimodal knowledge consistency. The final retained multimodal knowledge (i,r,o)=(i,e)×(s,r,o)𝑖𝑟𝑜𝑖𝑒𝑠𝑟𝑜(i,r,o)=(i,e)\times(s,r,o)( italic_i , italic_r , italic_o ) = ( italic_i , italic_e ) × ( italic_s , italic_r , italic_o ) constitutes our knowledge editing source dataset Dorigsubscript𝐷𝑜𝑟𝑖𝑔D_{orig}italic_D start_POSTSUBSCRIPT italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT.

More details are shown in appendix C.

Model Method ScoreRsubscriptScore𝑅\text{Score}_{R}Score start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ScoreLsubscriptScore𝐿\text{Score}_{L}Score start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ScoreGTsubscriptsuperscriptScore𝑇𝐺\text{Score}^{T}_{G}Score start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ScoreGMsubscriptsuperscriptScore𝑀𝐺\text{Score}^{M}_{G}Score start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ScoreCsubscriptScore𝐶\text{Score}_{C}Score start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT
InstructBLIP FT(Vision) 89.57 0.34 24.10 90.30 38.07
FT(LLM) 98.60 3.00 77.43 96.77 10.15
MEND(Vision) 32.39 93.15 29.73 23.43 18.37
MEND(LLM) 88.58 53.23 86.49 85.21 9.46
IKE 68.26 / 76.33 / 49.05
MiniGPT-v2 FT(Vision) 96.08 2.07 94.52 54.02 10.79
FT(LLM) 95.87 0.78 93.91 93.20 10.80
MEND(Vision) 4.34 26.08 4.23 5.13 6.81
MEND(LLM) 45.21 24.30 44.41 26.17 11.74
IKE 47.50 / 68.76 / 26.41
Table 2: Experimental results on IE_edit data for three editing methods editing two different model components on two MLLMs.

4.2 Dataset Construction

Editing Dataset Construction

For a multimodal knowledge (i,r,o)=(i,e)×e=s(s,r,o)𝑖𝑟𝑜subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜(i,r,o)=(i,e)\times_{e=s}(s,r,o)( italic_i , italic_r , italic_o ) = ( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o ) in our filtered multimodal knowledge source dataset Dorigsubscript𝐷𝑜𝑟𝑖𝑔D_{orig}italic_D start_POSTSUBSCRIPT italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT, we sequentially construct editing data under different editing formats. For IE_edit, our editing inputs consist of images and automaticaly generated questions. We choose to use an entity e~~𝑒\tilde{e}over~ start_ARG italic_e end_ARG of the same category as the entity e𝑒eitalic_e as the editing target. For SRO_edit, our editing inputs consist of generated questions, with the editing target being the corresponding new knowledge o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG given in MQuAKE dataset. We require that o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG is of the same entity category as o𝑜oitalic_o. For IRO_edit, our editing input is constructed based on the input from SRO_edit, combined with entity types and templates. The target o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG is chosen from the corresponding data in the SRO_edit editing dataset. more strict requirements can be seen in appendix C.

Reliability Dataset Construction Our Reliability metric is calculated as shown in the following formula. Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the editing dataset corresponding to the editing format. For each piece of multimodal knowledge k=(i,e)×(s,r,o)𝑘𝑖𝑒𝑠𝑟𝑜k=(i,e)\times(s,r,o)italic_k = ( italic_i , italic_e ) × ( italic_s , italic_r , italic_o ) in Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, k~~𝑘\tilde{k}over~ start_ARG italic_k end_ARG is the corresponding edited knowledge. prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the multimodal input used for testing the Reliability of the corresponding editting format. trsubscript𝑡𝑟t_{r}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the target reliability output after knowledge editing. F𝐹Fitalic_F is the multimodal model, and θkk~subscript𝜃𝑘~𝑘\theta_{k\tilde{k}}italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT represents the parameters of the model after editing a multimodal knowledge kk~𝑘~𝑘k\rightarrow\tilde{k}italic_k → over~ start_ARG italic_k end_ARG.

ScoreR=𝔼(k,k~,pr,tr)De[𝟙F(pr;θkk~)=tr]subscriptScore𝑅subscript𝔼similar-to𝑘~𝑘subscript𝑝𝑟subscript𝑡𝑟subscript𝐷𝑒delimited-[]subscript1𝐹subscript𝑝𝑟subscript𝜃𝑘~𝑘subscript𝑡𝑟\text{Score}_{R}=\mathbb{E}_{(k,\tilde{k},p_{r},t_{r})\sim D_{e}}\left[% \mathbbm{1}_{F(p_{r};\theta_{k\tilde{k}})=t_{r}}\right]Score start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_k , over~ start_ARG italic_k end_ARG , italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_F ( italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) = italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (3)

Consistency Dataset Construction Our consistency knowledge editing test data requires constructing according to different editing formats. In IE_edit, consistency is defined as (i,ee~)×e~=s~(s~,r~,o~)=(i,r,o)(i,r~,o~)subscript~𝑒~𝑠𝑖𝑒~𝑒~𝑠~𝑟~𝑜𝑖𝑟𝑜𝑖~𝑟~𝑜(i,e\rightarrow\tilde{e})\times_{\tilde{e}=\tilde{s}}(\tilde{s},\tilde{r},% \tilde{o})=(i,r,o)\rightarrow(i,\tilde{r},\tilde{o})( italic_i , italic_e → over~ start_ARG italic_e end_ARG ) × start_POSTSUBSCRIPT over~ start_ARG italic_e end_ARG = over~ start_ARG italic_s end_ARG end_POSTSUBSCRIPT ( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ) = ( italic_i , italic_r , italic_o ) → ( italic_i , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ). Therefore, we construct the input pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT corresponding to the multimodal knowledge (i,r~,o~)𝑖~𝑟~𝑜(i,\tilde{r},\tilde{o})( italic_i , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ). The edited model should output the corresponding o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG for this input to ensure consistency. In SRO_edit, we will edit the corresponding textual knowledge triplet (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ), and then construct the input pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for multimodal knowledge (i,r,o~)𝑖𝑟~𝑜(i,r,\tilde{o})( italic_i , italic_r , over~ start_ARG italic_o end_ARG ) based on definition of consistency to test whether the edited model can provide a consistent edited answer o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG. In IRO_edit, for each piece of knowledge (i,r,o)𝑖𝑟𝑜(i,r,o)( italic_i , italic_r , italic_o ), we find its corresponding textual knowledge (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ). After editing the multimodal knowledge (i,r,oo~)𝑖𝑟𝑜~𝑜(i,r,o\rightarrow\tilde{o})( italic_i , italic_r , italic_o → over~ start_ARG italic_o end_ARG ), we will analyze whether the corresponding textual knowledge (s,r,oo~)𝑠𝑟𝑜~𝑜(s,r,o\rightarrow\tilde{o})( italic_s , italic_r , italic_o → over~ start_ARG italic_o end_ARG ) provides a consistent response.

The consistency score is shown in the following formula. pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the multimodal input, θkk~subscript𝜃𝑘~𝑘\theta_{k\tilde{k}}italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT is the edited parameters, tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the corresponding consistency output in different editing format. Others are the same as (3).

ScoreC=𝔼(k,k~,pc,tc)De[𝟙F(pc;θkk~)=tc]subscriptScore𝐶subscript𝔼similar-to𝑘~𝑘subscript𝑝𝑐subscript𝑡𝑐subscript𝐷𝑒delimited-[]subscript1𝐹subscript𝑝𝑐subscript𝜃𝑘~𝑘subscript𝑡𝑐\text{Score}_{C}=\mathbb{E}_{(k,\tilde{k},p_{c},t_{c})\sim D_{e}}\left[% \mathbbm{1}_{F(p_{c};\theta_{k\tilde{k}})=t_{c}}\right]Score start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_k , over~ start_ARG italic_k end_ARG , italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_F ( italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) = italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (4)

Locality Dataset Construction In the edited datasets for the three editing formats, we used data unrelated to the current editing format but of the same type as locality data. In IE_edit, we randomly selected visual information (ilocsubscript𝑖𝑙𝑜𝑐i_{loc}italic_i start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT, elocsubscript𝑒𝑙𝑜𝑐e_{loc}italic_e start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT) different from the current entity in Dorigsubscript𝐷𝑜𝑟𝑖𝑔D_{orig}italic_D start_POSTSUBSCRIPT italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT as locality data. In SRO_edit, we randomly selected data (sloc,rloc,oloc)subscript𝑠𝑙𝑜𝑐subscript𝑟𝑙𝑜𝑐subscript𝑜𝑙𝑜𝑐(s_{loc},r_{loc},o_{loc})( italic_s start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) different from the current textual knowledge triplet (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) in Dorigsubscript𝐷𝑜𝑟𝑖𝑔D_{orig}italic_D start_POSTSUBSCRIPT italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT as locality data. In IRO_edit, we randomly selected multimodal knowledge (i,e)×e=s(s,r,o)subscript𝑒𝑠𝑖𝑒𝑠𝑟𝑜(i,e)\times_{e=s}(s,r,o)( italic_i , italic_e ) × start_POSTSUBSCRIPT italic_e = italic_s end_POSTSUBSCRIPT ( italic_s , italic_r , italic_o ) where i,e,s,r𝑖𝑒𝑠𝑟i,e,s,ritalic_i , italic_e , italic_s , italic_r, and o𝑜oitalic_o are all different in Dorigsubscript𝐷𝑜𝑟𝑖𝑔D_{orig}italic_D start_POSTSUBSCRIPT italic_o italic_r italic_i italic_g end_POSTSUBSCRIPT to form locality data (iloc,eloc)×eloc=sloc(sloc,rloc,oloc)subscriptsubscript𝑒𝑙𝑜𝑐subscript𝑠𝑙𝑜𝑐subscript𝑖𝑙𝑜𝑐subscript𝑒𝑙𝑜𝑐subscript𝑠𝑙𝑜𝑐subscript𝑟𝑙𝑜𝑐subscript𝑜𝑙𝑜𝑐(i_{loc},e_{loc})\times_{e_{loc}=s_{loc}}(s_{loc},r_{loc},o_{loc})( italic_i start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) × start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ).

The locality score is shown in the following formula. plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the multimodal input, θkk~subscript𝜃𝑘~𝑘\theta_{k\tilde{k}}italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT is the edited parameters, tlsubscript𝑡𝑙t_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the corresponding locality output in different editing format.

ScoreL=𝔼(k,k~,pl)De[𝟙F(pl;θkk~)=F(pl;θ)]subscriptScore𝐿subscript𝔼similar-to𝑘~𝑘subscript𝑝𝑙subscript𝐷𝑒delimited-[]subscript1𝐹subscript𝑝𝑙subscript𝜃𝑘~𝑘𝐹subscript𝑝𝑙𝜃\text{Score}_{L}=\mathbb{E}_{(k,\tilde{k},p_{l})\sim D_{e}}\left[\mathbbm{1}_{% F(p_{l};\theta_{k\tilde{k}})=F(p_{l};\theta)}\right]Score start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_k , over~ start_ARG italic_k end_ARG , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_F ( italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) = italic_F ( italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_θ ) end_POSTSUBSCRIPT ] (5)
      Model       Method       ScoreRsubscriptScore𝑅\text{Score}_{R}Score start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT       ScoreLsubscriptScore𝐿\text{Score}_{L}Score start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT       ScoreGTsubscriptsuperscriptScore𝑇𝐺\text{Score}^{T}_{G}Score start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT       ScoreCsubscriptScore𝐶\text{Score}_{C}Score start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT
      InstructBLIP       FT(Vision)       91.75       4.23       17.84       87.57
      FT(LLM)       99.49       3.95       79.59       90.43
      MEND(Vision)       13.64       95.03       10.00       3.86
      MEND(LLM)       66.49       79.34       72.85       55.90
      IKE       81.06       94.18       55.87       73.73
      MiniGPT-v2       FT(Vision)       82.48       2.36       81.38       1.93
      FT(LLM)       97.34       2.63       96.00       94.49
      MEND(Vision)       4.78       86.53       4.94       6.72
      MEND(LLM)       71.89       19.93       69.79       6.41
      IKE       38.59       58.10       24.78       26.37
Table 3: Experimental results on SRO_edit data for three editing methods editing two different model components on two MLLMs.

Generality Dataset Construction For the three forms of multimodal knowledge editing IE_edit, SRO_edit, and IRO_edit we constructed corresponding generalization test datasets from both image and text perspectives. For the image generalization dataset, we used CLIP to process the images previously crawled from the web. Then, we calculated the relevance between the images and entities using the CLIP model and selected the top 5 most relevant images as the test images for entity image generalization. For the text generalization dataset, we use ChatGPT to rewrite 5 variations of the textual input to serve as the test inputs for text generalization. The prompts can be seen in appendix C.

The generality score is shown in the following formula. pgTsubscriptsuperscript𝑝𝑇𝑔p^{T}_{g}italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, pgMsubscriptsuperscript𝑝𝑀𝑔p^{M}_{g}italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPTis the multimodal input for text, and image generalization testing, respectively. θkk~subscript𝜃𝑘~𝑘\theta_{k\tilde{k}}italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT is the edited parameters. tgTsubscriptsuperscript𝑡𝑇𝑔t^{T}_{g}italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, tgMsubscriptsuperscript𝑡𝑀𝑔t^{M}_{g}italic_t start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT are the corresponding text, and image generality output, respectively, in different editing formats.

ScoreGT=𝔼(k,k~,pgT,tgT)De[𝟙F(pgT;θkk~)=tgT]subscriptsuperscriptScore𝑇𝐺subscript𝔼similar-to𝑘~𝑘subscriptsuperscript𝑝𝑇𝑔subscriptsuperscript𝑡𝑇𝑔subscript𝐷𝑒delimited-[]subscript1𝐹subscriptsuperscript𝑝𝑇𝑔subscript𝜃𝑘~𝑘subscriptsuperscript𝑡𝑇𝑔\text{Score}^{T}_{G}=\mathbb{E}_{(k,\tilde{k},p^{T}_{g},t^{T}_{g})\sim D_{e}}% \left[\mathbbm{1}_{F(p^{T}_{g};\theta_{k\tilde{k}})=t^{T}_{g}}\right]Score start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_k , over~ start_ARG italic_k end_ARG , italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_F ( italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) = italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (6)
ScoreGM=𝔼(k,k~,pgM,tgM)De[𝟙F(pgM;θkk~)=tgM]subscriptsuperscriptScore𝑀𝐺subscript𝔼similar-to𝑘~𝑘subscriptsuperscript𝑝𝑀𝑔subscriptsuperscript𝑡𝑀𝑔subscript𝐷𝑒delimited-[]subscript1𝐹subscriptsuperscript𝑝𝑀𝑔subscript𝜃𝑘~𝑘subscriptsuperscript𝑡𝑀𝑔\text{Score}^{M}_{G}=\mathbb{E}_{(k,\tilde{k},p^{M}_{g},t^{M}_{g})\sim D_{e}}% \left[\mathbbm{1}_{F(p^{M}_{g};\theta_{k\tilde{k}})=t^{M}_{g}}\right]Score start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( italic_k , over~ start_ARG italic_k end_ARG , italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ∼ italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_F ( italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_k over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) = italic_t start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (7)

Construction details about multimodal input p𝑝pitalic_p and corresponding t𝑡titalic_t can be seen in appendix C.

Edit format IE_edit SRO_edit IRO_edit All
#Data 920 982 982 2884
#Relation 28 30 30 30
#Entity 810 1041 1041 1424
#Alias(avg.) 20.46 17.02 17.02 18.11
#Image 2358 - 1311 2550
#Category 49 76 76 76
Table 4: The statistic of different subsets of MC-MKE. #Entity refers to the total number of entities appeared including s,o𝑠𝑜s,oitalic_s , italic_o and e𝑒eitalic_e. #Alias refers to the number of answer aliases.

Benchmark statistics Eventually, we create MC-MKE, cosist of a total of 2884 pieces of knowledge across three different edit formats. The associated knowledge involves a large amount of entities and relations, indicating the diversity of MC-MKE. It also has an average of 18.1118.1118.1118.11 answer aliases per sample, significantly reducing misjudgements of the exact match metrics. More details about dataset statistics are presented in Table 4.

5 Experiments

Model Method ScoreRsubscriptScore𝑅\text{Score}_{R}Score start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ScoreLsubscriptScore𝐿\text{Score}_{L}Score start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ScoreGTsubscriptsuperscriptScore𝑇𝐺\text{Score}^{T}_{G}Score start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ScoreGMsubscriptsuperscriptScore𝑀𝐺\text{Score}^{M}_{G}Score start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ScoreCsubscriptScore𝐶\text{Score}_{C}Score start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT
InstructBLIP FT(Vision) 84.83 2.75 34.25 85.07 76.37
FT(LLM) 91.56 4.86 81.50 91.34 86.33
MEND(Vision) 24.13 85.88 33.11 19.20 5.49
MEND(LLM) 70.57 64.78 86.00 72.05 50.50
IKE 71.59 / 82.83 / 48.17
Mini-GPTv2 FT(Vision) 86.86 3.12 86.33 52.87 6.01
FT(LLM) 95.82 2.40 94.30 94.65 87.35
MEND(Vision) 6.61 50.06 6.06 7.61 2.74
MEND(LLM) 63.13 46.55 59.79 38.77 5.09
IKE 17.51 / 38.57 / 11.09
Table 5: Experimental results on IRO_edit data for three editing methods editing two different model components on two MLLMs.

5.1 Multimodal Large Language Models

InstructBLIP InstructBLIP is a multimodal large language model that consists of three modules. Its multimodal alignment module consists of a Qformer structure and a linear layer network to connect its vision and large language model module. We use InstructBLIP equipped with Vicuna-7B Chiang et al. (2023).

MiniGPT-v2 MiniGPT-v2 utilizes a linear projection layer as an alignment module to map visual features to LLM feature space. Compared with InstructBLIP, MiniGPT-v2 has a smaller alignment module but still more input visual features. We use MiniGPT-v2 equipped with Llama-2-Chat-7B Touvron et al. (2023).

5.2 MMEdit Methods

There have been many language knowledge editing methods, while multi-modality knowledge editing methods have not been fully explored. Therefore, we select the following representative editing methods in multimodal knowledge editing according to the editing requirements of the task and the applicability of the methods.

Finetune

Finetune is one of the most widely used and apparent method to improving or modifying the abilities of pre-trained models and is also generally used as a baseline for knowledge editing. Since one can select the model component to finetune, it is natural to explore the differences between finetuning different model components. We focus on finetuning two parts: the alignment module and the LLM component of an MLLM. For LLM component, we only finetune the last layer of the LLM.

MEND

Model Editor Networks with Gradient Decomposition (MEND) (Mitchell et al., 2022) is a editor network map** a single desired input-output knowledge pair to the corresponding parameter update of the original model. Specifically, the input-output knowledge pair provides a standard fine-tuning gradient as a starting point for editing updates. Then MEND directly transforms the gradient to a better parameter update ensuring both generality and locality.

IKE

In-Context Knowledge Editing (IKE) (Zheng et al., 2023) enables knowledge editing by incorporating demonstration examples within the input data to update and acquire new factual knowledge without the requirement of further training. Considering the instruction-following ability of the MLLM and the limitation on the number of input images, we choose to implement the zero-shot model of IKE. More experimental details can be seen in appendix B.

5.3 Results & Analysis

Pros and Cons of Different Editing Methods We observe that across all editing formats and model modules, no existing editing method perfectly meets our editing requirements.

The Finetune method is characterized by high Reliability, demonstrating good Generality and Consistency when editing the LLM part in SRO_edit and IRO_edit. However, its Locality is very low, meaning it has a significant impact on unrelated knowledge.

MEND, although it also modifies the model’s parameters, uses a meta-learning approach to control the model’s changes to other unrelated knowledge. As shown from the results, MEND’s Reliability is much lower than that of Finetune, but its Locality is extremely higher.

As for IKE, since many MLLMs do not support in-context learning for image inputs, we do not test it for Locality and M-Generality in IE_edit and IRO_edit. Since it relies on in-context learning, this method is inherently sensitive to prompts. Different types of editing formats correspond to different prompts, and different models have varying degrees of sensitivity to prompts, resulting in significant fluctuations in all of IKE’s metrics. Overall, IKE performs better on InstructBLIP, achieving the highest Locality in SRO_edit. It also shows high Consistency on InstructBLIP, achieving the highest Consistency metric in IE_edit, indicating that the model can infer the edited multimodal knowledge (i,r~,o~)𝑖~𝑟~𝑜(i,\tilde{r},\tilde{o})( italic_i , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ) based on context and the image.

Consistency On Different Editing Formats In SRO_edit and IRO_edit, the output of their corresponding Consistency test data matches the required edited output, with only the input information being different. Therefore, if the model can correctly understand the different formats of questions, it can accurately answer the Consistency questions. In these two editing formats, Finetune and IKE achieve high Consistency on InstructBLIP. However, on MiniGPT-v2, these methods only maintain some degree of Consistency. In these two editing formats, high Consistency without high Locality may come from overfitting. Thus, to accurately assess the Consistency property, we need to analyze the IE_edit format. On InstructBLIP, the Finetune method maintains high Consistency with high Reliability, indicating that the Finetune method is not solely overfitting since e~o~~𝑒~𝑜\tilde{e}\neq\tilde{o}over~ start_ARG italic_e end_ARG ≠ over~ start_ARG italic_o end_ARG. Only when a method achieves high Consistency across all three editing formats can its Consistency property be considered trustworthy. Considering the results across the two models, IKE shows best Consistency, managing to maintain high Consistency while ensuring a certain degree of Locality.

Editing Different Components Cheng et al. (2024) mentioned the visual module is harder to edit compared to the text module. In SRO_edit and IRO_edit, apart from Locality, the effectiveness of editing the visual module is much lower than that of editing the LLM module. For Locality, editing the alignment module using MEND has a smaller impact, possibly because MEND’s parameter fitting is not as strong, and the editing knowledge in SRO_edit and IRO_edit is the textual knowledge (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) triplet. However, in IE_edit, although editing the LLM module generally yields higher Reliability and slightly higher Generality, in the case of InstructBLIP, editing the LLM module has a lower Consistency. This indicates that editing the LLM module often leads to overfitting, where the model outputs the edited knowledge regardless of the input information. In contrast, editing the Vision module, although resulting in lower other metrics, still maintains high Consistency. This shows that in IE_edit, editing the visual module can still ensure the Consistency of the corresponding knowledge.

6 Conclusion

We refine the definition of multimodal knowledge and introduce a new benchmark MC-MKE. We conduct experiments to analyze the effectiveness of several multimodal knowledge editing methods across different models, editing formats, and components. We find that these methods have limitations, and cannot achieve perfect performance on different editing formats. To maintain consistency, it may be better to edit the model components corresponding to the specific knowledge part.

Limitations

The main limitations of our work is related to limited knowledge editing methods and multimodal large language models. We only provide results on the two latest MLLMs, including InstructBLIP and MiniGPT-v2, leaving many others behind. As we study the latest MLLMs on knowledge editing methods which have not been discussed in prior work, we only analyze three knowledge editing methods, Finetune, MEND and IKE.

Ethical Considerations

CMKoBe is a synthetic dataset constructed by randomly modifying the factual knowledge triplets, rather than being crafted by humans. The data samples could accidentally involve context which is toxic or offensive in nature. ChatGPT is used for data annotation and assisting writing.

References

  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433.
  • Bai et al. (2023) **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
  • Chen et al. (2023) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. CoRR, abs/2310.09478.
  • Cheng et al. (2024) Siyuan Cheng, Bozhong Tian, Qingbin Liu, Xi Chen, Yongheng Wang, Huajun Chen, and Ningyu Zhang. 2024. Can we edit multimodal large language models? Preprint, arXiv:2310.08475.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  • Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  • Herdade et al. (2019) Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Huang et al. (2024) Han Huang, Haitian Zhong, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2024. Kebench: A benchmark on knowledge editing for large vision-language models. Preprint, arXiv:2403.07350.
  • Li et al. (2023a) Bo Li, Yuanhan Zhang, Liangyu Chen, **ghao Wang, **gkang Yang, and Ziwei Liu. 2023a. Otter: A multi-modal model with in-context instruction tuning. Preprint, arXiv:2305.03726.
  • Li et al. (2024) Jiaqi Li, Miaozeng Du, Chuanyi Zhang, Yongrui Chen, Nan Hu, Guilin Qi, Haiyun Jiang, Siyuan Cheng, and Bozhong Tian. 2024. Mike: A new benchmark for fine-grained multimodal entity knowledge editing. Preprint, arXiv:2402.14835.
  • Li et al. (2023b) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023b. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR.
  • Lin et al. (2015) Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft coco: Common objects in context. Preprint, arXiv:1405.0312.
  • Liu et al. (2024a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024a. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306.
  • Liu et al. (2024b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024b. Visual instruction tuning. Advances in neural information processing systems, 36.
  • Liu et al. (2019) Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio, and David S. Rosenblum. 2019. Mmkg: Multi-modal knowledge graphs. Preprint, arXiv:1903.05485.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual knowledge in GPT. CoRR, abs/2202.05262.
  • Mitchell et al. (2022) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Fast model editing at scale. Preprint, arXiv:2110.11309.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
  • Wu et al. (2024) Xiaobao Wu, Liangming Pan, William Yang Wang, and Anh Tuan Luu. 2024. Updating language models with unstructured facts: Towards practical knowledge editing. Preprint, arXiv:2402.18909.
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  • Yin et al. (2024) Xunjian Yin, ** Jiang, Liming Yang, and Xiaojun Wan. 2024. History matters: Temporal knowledge editing in large language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19413–19421.
  • Zheng et al. (2023) Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, **g**g Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? Preprint, arXiv:2305.12740.
  • Zhong et al. (2023) Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. 2023. Mquake: Assessing knowledge editing in language models via multi-hop questions. Preprint, arXiv:2305.14795.
  • Zhu et al. (2020) Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying memories in transformer models. Preprint, arXiv:2012.00363.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592.

Appendix A Pre-experiments

SRO_edit focuses on editing a textual knowledge triplets (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ), inherently requiring no additional visual inputs. But to align with the standard input format of MLLMs, we input a black image as the visual placeholder. In this section, we present an preliminary experiment to explore different choices of the input visual images including black images, white images and random noise. The results of InstructBLIP with these three types of images on SRO_edit are 95.11, 96.53 and 94.70 respectively. It is shown that these uninformative images barely have influence on the results.

Appendix B Experiment Details

Finetuning Details

We list the hyper-parameters used for finetuning in Table 6. MiniGPT-v2 and InstructBLIP share the same hyper-parameters.

Learning Rate 5e-4
Steps 16
Optimizer AdamW
Weight Decay 0.05
Table 6: Hyper-Parameters used for finetuning.

MEND Details

Training process of MEND requires additional training data specific to the underlying model. Following Mitchell et al. (2022), we construct an edit dataset and a locality dataset for both InstructBLIP and MiniGPT-v2. We leverage the data filtered in Section 4.1 as the edit dataset, sharing identical distribution with MC-MKE. Since both InstructBLIP and MiniGPT-v2 leverage MS COCOLin et al. (2015) for pretraining, we include it as the locality dataset. We search for three important hyper-parameters clocsubscript𝑐𝑙𝑜𝑐c_{loc}italic_c start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT, ceditsubscript𝑐𝑒𝑑𝑖𝑡c_{edit}italic_c start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT and learning rate on each experimental setting for ten times. We found that MEND is very sensitive to hyperparameters, especially when the target module is small (e.g. the MEND(Vision) setting in our main experiment).

Appendix C Data Details

Entity Alias To facilitate entity evaluation, we collect alias of entities for all answers from the original dataset Drawsubscript𝐷𝑟𝑎𝑤D_{raw}italic_D start_POSTSUBSCRIPT italic_r italic_a italic_w end_POSTSUBSCRIPT. However, since we will edit some of the subject entities, we also used alias data from Wiki as a supplement to construct the final entity alias library. All of our matching is performed with entities and their corresponding aliases.

Edit input Construction Details We choose to use an entity e~~𝑒\tilde{e}over~ start_ARG italic_e end_ARG of the same category as the entity e𝑒eitalic_e and we require that the corresponding textual knowledge triplet (s~,r~,o~)~𝑠~𝑟~𝑜(\tilde{s},\tilde{r},\tilde{o})( over~ start_ARG italic_s end_ARG , over~ start_ARG italic_r end_ARG , over~ start_ARG italic_o end_ARG ), which s~=e~~𝑠~𝑒\tilde{s}=\tilde{e}over~ start_ARG italic_s end_ARG = over~ start_ARG italic_e end_ARG exists in Dfilter1subscript𝐷𝑓𝑖𝑙𝑡𝑒subscript𝑟1D_{filter_{1}}italic_D start_POSTSUBSCRIPT italic_f italic_i italic_l italic_t italic_e italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Locality Construction Details We ensure that these selected entities differ from those of the current knowledge. Formally, the knowledge Kloc(i,e,s,r,o)subscript𝐾𝑙𝑜𝑐superscript𝑖superscript𝑒superscript𝑠superscript𝑟superscript𝑜K_{loc}(i^{\prime},e^{\prime},s^{\prime},r^{\prime},o^{\prime})italic_K start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for locality test of knowledge K(i,e,s,r,o)𝐾𝑖𝑒𝑠𝑟𝑜K(i,e,s,r,o)italic_K ( italic_i , italic_e , italic_s , italic_r , italic_o ) must satisfy the condition ii,ee,ss,rr,ooformulae-sequencesuperscript𝑖𝑖formulae-sequencesuperscript𝑒𝑒formulae-sequencesuperscript𝑠𝑠formulae-sequencesuperscript𝑟𝑟superscript𝑜𝑜i^{\prime}\neq i,e^{\prime}\neq e,s^{\prime}\neq s,r^{\prime}\neq r,o^{\prime}\neq oitalic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_i , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_e , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_s , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_r , italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_o. We randomly sample five pieces of knowledge to serve as the locality test data.

Appendix D Prompts

We designed specific prompts and instructions for GPT-3.5-turbo-16k to rephrase the textual input for the text generalization dataset and generate fine-grained entity types, as shown in Table 7 and Table 8, respectively.

We provide editing and testing inputs of different types of multimodal knowledge editing in Table D, Table D and Table D.

Prompts and Instructions
You are a helpful assistant.
Please rephrase the following original text with 10 different and diverse expressions, maintaining exactly the same meanings.
Note that you must not add any additional information and not delete or lose any information of the original text.
Original Text:
{source}
5 Rephrased Texts:
Table 7: Prompts and instructions used for rephrasing the textual input for the text generalization dataset.
Prompts and Instructions
You are a powerful fine-grained entity category generator. User will give the name of entity, and you will help answer the fine-grained categoty of the entity. The answer is the categoty only.
There are some examples: Given entity Cameroon, a possible answer should be "country".
Given entity David Beckham, a possible answer should be "person".
Given entity The Great Gatsby, a possible answer should be "book".
Given entity Producers’ Showcase, a possible answer should be "TV show".
Given entity Lady Madonna, a possible answer should be "song".
Given entity Cox Enterprises, a possible answer should be "company".
The given entity is {}, a possible answer is:
Table 8: Prompts and instructions used for generating fine-grained entity types.
Input Visual Inputs Textual Inputs
Edit
input [Uncaptioned image] Question: The country in the picture is
e~~𝑒\tilde{e}over~ start_ARG italic_e end_ARG: Lithuania
prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [Uncaptioned image] Question: The country in the picture is
trsubscript𝑡𝑟t_{r}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: Lithuania
Alias: Lietuva, Lietuvos Respublika, …
pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [Uncaptioned image] Question: The capital of the country in the picture is
tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: Vilnius
Alias: Vilnia, Vilna, Wilno, …
plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [Uncaptioned image]
Question: Which TV channel is shown in the picture?
tlsubscript𝑡𝑙t_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT: ESPN
Alias: Entertainment and Sports Programming Network
pgMsubscriptsuperscript𝑝𝑀𝑔p^{M}_{g}italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [Uncaptioned image]
Question: The country in the picture is
tgMsubscriptsuperscript𝑡𝑀𝑔t^{M}_{g}italic_t start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT: Lithuania
Alias: Lietuva, Lietuvos Respublika, …
pgTsubscriptsuperscript𝑝𝑇𝑔p^{T}_{g}italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [Uncaptioned image]
Question: Can you tell me which country is depicted in the image?
tgTsubscriptsuperscript𝑡𝑇𝑔t^{T}_{g}italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT: Lithuania
Alias: Lietuva, Lietuvos Respublika, …
Table 9: IE_edit multimodal input examples.
Input Visual Inputs Textual Inputs
Edit
input / Question: The capital of United Kingdom is
o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG: Rupnagar
prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT / Question: The capital of United Kingdom is
trsubscript𝑡𝑟t_{r}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [Uncaptioned image] Question: The capital of the country in the picture is
tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / Question: What is the country of citizenship of Warren Buffett?
tlsubscript𝑡𝑙t_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT: United States of America
Alias: the United States, the United States of America, …
pgTsubscriptsuperscript𝑝𝑇𝑔p^{T}_{g}italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT / Question: What is the capital of United Kingdom?
tgTsubscriptsuperscript𝑡𝑇𝑔t^{T}_{g}italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
Table 10: SRO_edit multimodal input examples.
Input Visual Inputs Textual Inputs
Edit
input [Uncaptioned image] Question: As a result of World War III, the country in the picture moves its capital. The capital of the country in the picture is
o~~𝑜\tilde{o}over~ start_ARG italic_o end_ARG: Rupnagar
prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [Uncaptioned image] Question: The capital of the country in the picture is
trsubscript𝑡𝑟t_{r}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT / Question: The capital of United Kingdom is
tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [Uncaptioned image]
Question: Which TV channel is shown in the picture?
tlsubscript𝑡𝑙t_{l}italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT: English
Alias: English language, …
pgMsubscriptsuperscript𝑝𝑀𝑔p^{M}_{g}italic_p start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [Uncaptioned image]
Question: The capital of the country in the picture is
tgMsubscriptsuperscript𝑡𝑀𝑔t^{M}_{g}italic_t start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
pgTsubscriptsuperscript𝑝𝑇𝑔p^{T}_{g}italic_p start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT [Uncaptioned image]
Question: Can you tell me the capital of the country shown in the picture?
tgTsubscriptsuperscript𝑡𝑇𝑔t^{T}_{g}italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT: Rupnagar
Alias: Rupar, Ropar
Table 11: IRO_edit multimodal input examples.