GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Hritik Bansal, Po-Nien Kung, P. Jeffrey Brantingham, Kai-Wei Chang, Nanyun Peng
University of California Los Angeles
[email protected], [email protected]

Abstract

Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M ${}^{2}$ E ${}^{2}$ and SWiG datasets, respectively. In addition, we outperform CLIP-Event by $22\%$ precision on M ${}^{2}$ E ${}^{2}$ dataset. The framework also allows flexible adaptation and generalization to unseen domains.

Hritik Bansal, Po-Nien Kung, P. Jeffrey Brantingham, Kai-Wei Chang, Nanyun Peng University of California Los Angeles [email protected], [email protected]

1 Introduction

Multimodal event extraction (EE) aims at identifying the event types depicted in an image (e.g., ‘Arrest’), the event participates (objects), and event argument role labeling (EARL) for the objects (e.g., A person is the ‘Agent’ performing an ‘Arrest’ event). In many real-world applications such as analysis of news articles and social media, important event-related information is grounded in the multimodal image-text data Stephens et al. (1998); Blandfort et al. (2019). For example, a tweet about a protest may not mention whether the protest is violent, which can be readily discerned from the accompanying image Giorgi et al. (2022). Hence, it becomes increasingly important to extract the multimodal events and relations, beyond text, as they provide a useful lens for analyzing social events.

Refer to caption — Figure 1: Overview of Multimodal Event Argument Role Labeling task. Given the image that depicts an Arrest event type, a list of possible event argument roles (Agent, Person, Instrument), and three participant objects (bounding boxes) of color ‘red’, ‘blue’, and ‘yellow’. The task is to assign an event argument to each of the objects based on their role in the depicted event. Here, the ‘blue’ bounding box plays the role of the Person who gets arrested whereas the object in the ‘red’ and ‘yellow’ plays the role of the Agent performing the arrest.

In this work, our focus is on multimodal Event Argument Role Labeling (EARL) (illustrated in Figure 1) given its practical significance and the unique technical challenges it presents within the multimodal EE pipeline. For example, social scientists will benefit from understanding the Demonstrators, Police, and Instrument from a Protest scene. In the meanwhile, performing an accurate multimodal EARL requires fine-grained understanding of the entire image depicting the event along with comprehending the roles and interactions between various objects present in the image, which poses challenges for large-scale pretrained vision-language models such as CLIP Radford et al. (2019).

Existing methods Li et al. (2020, 2022b) trained on event-annotated data, such as ACE Doddington et al. (2004), imSitu Yatskar et al. (2016), and VOA Li et al. (2020) tend to overfit to the event types seen during training, and fail to generalize well to unseen event types and new domains Parekh et al. (2023). Subsequent finetuning of these models will require the creation of new event-annotated data, which is expensive and hard to acquire.

To tackle these challenges, our major contribution is the proposal of GenEARL, a two-stage training-free generative framework for multimodal EARL. Our framework utilizes the ability of the instruction-following generative vision-language models (GVLM) and large language models (LLM) to effectively comprehend multimodal and text-only input prompts. First, we prompt the GVLM with visual features of the image and object, along with the target event and its associated argument role in a predefined template such that it generates an event-centric object description. This object description summarizes the role of the object in the context of the event depicted in the image. Since the generative VLM reasons over the task description in the input prompt, it does not require expensive event-annotated data for generating object role descriptions. In the next stage, we prompt instruction-following LLM with the generated object descriptions, image caption, along with the targeted event and its argument role labels in a predefined template to predict event argument role labels. We further show that in-context learning Dong et al. (2022); Liu et al. (2023c) for each of the steps can improve EARL extraction.

Our second contribution is a comprehensive empirical study of the generalization ability of our framework to perform multimodal EARL on the M ${}^{2}$ E ${}^{2}$ and SWiG datasets, as used in Li et al. (2022b). We find that GenEARL outperforms the CLIP-L/14 Radford et al. (2019) baseline by $9.4\%$ and $14.2\%$ accuracy in the zero-shot EARL on the M ${}^{2}$ E ${}^{2}$ Li et al. (2020) and SWiG Pratt et al. (2020) datasets, respectively (§5.1). In addition, we find that combining CLIP for event detection and GenEARL for EARL outperforms CLIP-Event Lu et al. (2021) on M ${}^{2}$ E ${}^{2}$ by 22% on precision.

We further study the discrepancies between the training-free and supervised paradigms by finetuning a CLIP-B/32 on the training data of the SWiG dataset (§4.3), which enhances the CLIP’s understanding of event semantics and structures. We observe that our framework reduces the gap between the supervised and training-free paradigms from $44.5\%$ to $16.1\%$ on the SWiG dataset. In addition, our framework outperforms this finetuned CLIP model on the M ${}^{2}$ E ${}^{2}$ dataset by $3\%$ accuracy with just three-shot in-context learning examples. This highlights that our generative paradigm is competitive with supervised methods without any training on event data.

2 Background: Multimodal EARL

In this section, we first describe the task of multimodal event argument role labeling (EARL). Given an input image $\mathcal{I}$ that is depicting an event type $E$ . The event type belongs to one of the possible events types $E\in\mathcal{E}$ under the predefined event ontology. Further, each event type $E$ consists of a set of event arguments $\mathcal{A}_{E}=\{a_{1},\ldots,a_{m}\}$ that specifies the roles associated with it. In our datasets, each image contains a set of objects $\mathcal{O}=\{o_{1},\ldots,o_{n}\}$ where each object $o_{i}$ is represented as a bounding box and participates in the depicted event.

The task of multimodal event argument role labeling (EARL) is to label every participating object (bounding box) $o_{i}$ with an event argument role label $a_{j}$ . The predicted argument role label for the participating object highlights its interactions with the other objects in the scene and its contribution to the understanding of the event scenarios. For example, Figure 1 the object in the ‘blue’ bounding box plays the role of the Person who gets arrested whereas the object in the ‘red’ and ‘yellow’ plays the role of the Agent performing the arrest. Overall, EARL is a challenging task since it requires a model to reason over the fine-grained features of the image and the interactions between the participating objects. We highlight the assumptions of our work in Appendix §A.

3 The GenEARL Model

To tackle the challenging task of multimodal EARL, we propose GenEARL, a two-stage training-free generative framework that utilizes the unique capabilities of the modern generative models Liu et al. (2023a); OpenAI (2021) to understand event task descriptions and perform well on them. Specifically, the framework consists frozen (a) generative vision-language model (GVLM) and (b) large language model (LLM). Consider an input instance $\mathcal{Q}_{i}=\{\mathcal{I},E,o_{i},\mathcal{A}_{E}\}$ where we aim to label the event argument role for the object $o_{i}$ in the event $E$ depicted in the image $\mathcal{I}$ . $\mathcal{A}_{E}$ comprises the possible event argument role labels for the event $E$ along with their definition from the dataset guidelines.

We propose to decouple the task of multimodal EARL on the given input instance into two distinct sub-tasks. This approach allows each sub-task to be effectively addressed by the individual capabilities of the GVLM and LLM respectively. Specifically, we (a) generate an event-centric object description that captures the role played by the object $o_{i}$ in the context of the event $E$ in the image $\mathcal{I}$ (§3.1) followed by (b) extracting the EARL from the generated object role descriptions (§3.2).

3.1 Generative Vision-Language Model

We consider a generative vision-language model (GVLM) $\mathcal{G}$ that can reason over the text and visual input and generate an output text. Here, the GVLM is capable of capturing the fine-grained details in the event task description with image(s) in its context. We illustrate this stage in Figure 2 (left).

For the input instance $\mathcal{Q}_{i}$ we prompt the GVLM with a multimodal prompt $\mathcal{P}=\mathcal{T}(\mathcal{Q}_{i})$ where $\mathcal{T}$ is a predefined template that converts the input query details into a multimodal prompt (Appendix Table 8). We highlight the example multimodal input prompt in Figure 2 (left). Subsequently, we generate the event-centric object role description $d_{o_{i}}$ from the GLVM with the context $\mathcal{P}$ .

3.2 Large Language Model

We illustrate this stage in Figure 2 (right). In the first stage, we generate an object role description for a given image $\mathcal{I}$ and participant object $o_{i}$ . Here, we us a large language model (LLM) $\mathcal{L}$ since they are capable of generalizing to novel input queries to solve a task Wei et al. (2021). We provide the LLM with a text-only prompt $\mathcal{C}=\mathcal{T}_{\ell}({d_{\mathcal{I}},E,\mathcal{A}_{E},d_{o_{i}}})$ . Here, $\mathcal{T}_{\ell}$ is a predefined template that converts the input information including the generated object role description $d_{o_{i}}$ from the previous stage, event $E$ , image description $d_{\mathcal{I}}$ ¹¹1 $d_{\mathcal{I}}$ is the image caption generated using the GVLM $\mathcal{G}$ with the prompt ‘Describe this image in detail.’., and the possible argument roles $\mathcal{A}_{E}$ to detailed task description for the LLM. We illustrate an example input prompt at the top of Figure 2 (right).

The template $\mathcal{T}_{\ell}$ is designed in a way that prompts the LLM to output the predicted event argument role $\tilde{a}_{i}$ for the object $o_{i}$ . Following this, we can evaluate if the predicted EARL matches with the ground-truth argument role label $a_{i}$ . We show the example predicted EARL in the lower half of Figure 2 (right). In our work, we use ChatGPT OpenAI (2021), a state-of-the-art LLM that can adapt to new tasks and perform well on them.

4 Experimental Setup

4.1 Dataset

We evaluate our approach on two datasets: M ${}^{2}$ E ${}^{2}$ Li et al. (2020) and SWiG Yatskar et al. (2016). In both these datasets, each image has an associated event type, a bounding box for the participating objects, and their associated event argument roles.

M ${}^{2}$ E ${}^{2}$ consists of 391 images with event mentions taken from the Voice of America (VOA) news website²²2https://www.voanews.com/ where the event types and their argument roles are derived from the ACE ontology Doddington et al. (2004). To maintain a good argument role density for evaluation, we filter the event types with less than three argument roles (Appendix Table 6). In total, we label 990 bounding boxes spanning 275 images in the M ${}^{2}$ E ${}^{2}$ dataset.

SWiG dataset consists of images that can belong to one of 504 action verbs. Following Li et al. (2022b), we consider the action verbs to be analogous to the event types. Like our M ${}^{2}$ E ${}^{2}$ setup, we filter the action verbs with less than three event argument roles (Appendix Table 7). In our experiments, we label 1600 bounding boxes spanning 600 test images from this dataset.

Following Li et al. (2022b), we do not consider ‘place’ as an event argument role since it cannot be grounded as a bounding box for both datasets. In the zero-shot setting, we evaluate the GenEARL’s capability to reason about the examples, without training on event data, by prompting the pretrained generative models with event information, its associated argument role labels, and the visual context from the image and the object. In the few-shot setting, we provide the k-shot examples for all the bounding boxes in k images sampled randomly from the dataset. For example, in case k = 1, we show the labels for the objects in one of the images from the dataset (Appendix Table 12).

Paradigm	Method	M ${}^{2}$ E ${}^{2}$ Accuracy (%)	SWiG Accuracy (%)
Training Free	CLIP-B/32 (0-shot)	28.1	22.6
	CLIP-L/14 (0-shot)	30	24.7
	GenEARL (0-shot)	39.4	38.9
	GenEARL (1-shot)	44.1	50.1
	GenEARL (3-shot)	45.8	51
Supervised	CLIP-B/32 (+ SWiG)	41.9	67.1

Table 1: Multimodal EARL accuracy (%) of GenEARL (zero-shot to three-shot), pretrained CLIP (zero-shot), and supervised CLIP on the M

{}^{2}

{}^{2}

and SWiG dataset. Zero-shot GenEARL framework outperforms the zero-shot CLIP models on both datasets. The performance of our framework improves from zero-shot to three-shot example without any training of the underlying models on the event-annotated data. A supervised baseline is only used as a reference as it accesses event-annotated data. We present precision, recall, and F1 scores for the same in Appendix §C.

4.2 Implementation Details

GenEARL is a two-stage framework that consists of a vision-language model and a large language model. In our experiments, we use LLaVA-7B Liu et al. (2023a), a state-of-the-art generative vision-language model. It is finetuned with 150K multimodal instructions that enable it to understand a wide breadth of concepts and generate helpful and coherent outputs. Traditionally, LLaVa is used for generating outputs for a single image and text inputs, however, we adapt it reason over multiple images during inference i.e., an image depicting the event and a bounding box containing the object to be labeled. Due to this design, the multimodal input prompt can flexibly incorporate the input instance details to generate the event-centric object descriptions with any training on the event-annotated data. Existing models Li et al. (2022b) cannot adapt to new event types or new domains without further finetuning on an event-annotated dataset.

For the second-stage of our framework where we predict the event argument role label based on the generated event-centric object description, we use ChatGPT OpenAI (2021) as our default LLM. It is a powerful LLM that can adapt to novel task descriptions and perform well on them. In addition, it is capable of learning about the target domains through few-shot examples in its context. In our work, we leverage this capability by prompting the LLM with a few-shot solved examples. Since prompting ChatGPT comes with cost considerations ³³3https://openai.com/pricing, we label the event argument roles in batches where all the participating objects $o_{i}\in\mathcal{O}$ in an event depicting image $\mathcal{I}$ form a single batch. We access this model through OpenAI’s API ⁴⁴4https://openai.com/blog/openai-api where it corresponds to ‘gpt-3.5-turbo’. In §5.2, we use Alpaca Taori et al. (2023) by replacing ChatGPT in the second-stage for multimodal EARL.

4.3 Baselines

We compare the GenEARL framework against the open-source CLIP models Radford et al. (2019). We evaluate our framework against zero-shot CLIP to understand the generalization capability of our model without training on any event-annotated data. Subsequently, we train a supervised CLIP model to assess the assess the gap between our training-free paradigm and a supervised model that learns about the events from event annotated data.

Following Li et al. (2020), we compare against the large-scale representation learning vision-language models like CLIP-B/32 and CLIP-L/14 Radford et al. (2019) since they can leverage any event knowledge acquired during its pretraining for multimodal EARL in a zero-shot setting.

Consider an event $E$ depicted in an image, a set of argument role labels $\mathcal{A}_{E}=\{a_{1},\ldots,a_{m}\}$ for the event, and the object $o_{i}$ that needs to be labeled. We calculate the similarity score $s(o_{i},T_{K}(a_{j}))$ between the object representation in the image embedding space and the text embedding of a set of text-based templates $T_{K}$ containing information about the event and event argument role. Specifically, our template will be ‘An object playing $a_{j}$ role in the $E$ event.’ for every argument role label $a_{j}$ in $\mathcal{A}_{E}$ ’. The predicted argument role label $\tilde{a}_{i}$ will be the one that maximizes the similarity score between the object representation and its corresponding text template i.e., $\tilde{a}_{i}=\text{argmax}_{j}(s(o_{i},T_{K}(a_{j}))$ .

To impart the knowledge of the event semantics and structures to the pretrained baseline, we find a CLIP-B/32 model on the 75K images of the SWiG dataset. We create a text-based prompt with the event and argument role label for every example, as describe above. Finally, the model is finetuned with the contrastive pretraining objective to bring the representations of the objects and their EARL together in the joint embedding space and pull apart the representations of the unmatched objects and event roles. We provide the finetuning details in Appendix §B.1. Given the lack of training data for the M ${}^{2}$ E ${}^{2}$ domain, we conduct an evaluation of CLIP (+SWiG) as a transfer learning setup.

5 Experimental Results

Method	M ${}^{2}$ E ${}^{2}$ Accuracy (%)	SWiG Accuracy (%)	Average (%)
GenEARL w/ ChatGPT (0-shot)	39.4	38.9	39.2
GenEARL w/ Alpaca-7B (0-shot)	50.4	36.5	43.5
GenEARL w/ ChatGPT (1-shot)	44.1	50.1	47.1
GenEARL w/ Alpaca-7B (1-shot)	54.6	37.5	46.1
LLaVA-7B (0-shot)	28.6	26	27.3

Table 2: Multimodal EARL accuracy (%) of the GenEARL framework with the choice of LLM (§5.2). The last row highlights the performance if we were to only use the GVLM for multimodal EARL (§5.4).

5.1 Main Results

We evaluate the accuracy of the GenEARL framework for multimodal EARL on the M ${}^{2}$ E ${}^{2}$ and SWiG datasets, in Table 1. We find that zero-shot GenEARL achieves a multimodal EARL accuracy of $39.4\%$ and $38.9\%$ on the M ${}^{2}$ E ${}^{2}$ and SWiG datasets, respectively, without any specialized training on the event-annotated data. We observe that zero-shot GenEARL framework outperforms the CLIP-L/14 baseline by $9.4\%$ on M ${}^{2}$ E ${}^{2}$ and $14.2\%$ on SWiG dataset. This highlights the strong generalization ability of the generative models to adapt to multimodal EARL task descriptions and perform well on them.

Prior works have shown that the LLM are capable of improving their reasoning capabilities by learning from the few-shot examples in their context Dong et al. (2022); Brown et al. (2020). Hence, we prompt the ChatGPT model in the GenEARL framework with one-shot and three-shot settings. We observe that even a single-shot example in the context greatly enhances the performance of the framework to 44.1% and 50.1% on the M ${}^{2}$ E ${}^{2}$ and SWiG datasets, respectively. In addition, we find that providing three-shot also slightly improves the EARL accuracy over the one-shot setting. Due to the context length limitations and cost considerations, while using the OpenAI API, we do not experiment with more than three-shot examples. Our results highlight the unique capability of our framework to improve its performance with a very small amount of event-annotated data.

We further aim to understand the performance improvements and generalization of the pretrained CLIP baseline when it is finetuned with the SWiG training data, which was originally acquired by expensive annotation from the human labelers. We observe that our framework reduces the gap between the supervised paradigm and training-free paradigm from $44.5\%$ to $16.1\%$ on the SWiG dataset. In addition, our framework outperforms the finetuned CLIP model on the M ${}^{2}$ E ${}^{2}$ dataset by $3\%$ accuracy with just three-shot examples. Thus, our framework provides a flexible, training-free, and generalizable method for accurate multimodal EARL that can easily adapt to new event types. We provide a few qualitative examples for model predictions in Appendix Figure 5.

Comparison with CLIP-Event.

We propose GenEARL as a multimodal EARL framework where we assume access to the ground-truth event depicted in the image. This differs from prior work CLIP-Event Li et al. (2022b) that performs event detection followed by EARL. Hence, we use a combination of the CLIP-B/32 for event type detection and GenEARL framework for EARL to directly compare with the reported CLIP-Event results.⁵⁵5The code for CLIP-Event is publicly available, however, its pretraining dataset or pretrained checkpoint are unavailable.

Method	M ${}^{2}$ E ${}^{2}$
Method	Precision	Recall	F1
CLIP-Event	21.1%	13.1%	17%
CLIP + GenEARL (0-shot)	42.4%	24.1%	31.4%
CLIP + GenEARL (3-shot)	43%	29%	34.6%

Table 3: Comparison between the combination of CLIP (for event detection) and GenEARL (for EARL) with CLIP-Event that performs end-to-end event detection followed by EARL.

In Table 3, we find that the combination of CLIP and GenEARL framework outperforms CLIP-Event on precision and recall by 22%, 15%, 17% on precision, recall, and F1 respectively. With advancements in the event detection modules, the performance of the complete event detection and GenEARL pipeline will further improve.

5.2 The Impact of Large Language Models

Here, we aim to study whether ChatGPT can be replaced by a relatively smaller instruction-following LLM like Alpaca-7B Taori et al. (2023) in our GenEARL framework. We provide the details for predicting EARL using Alpaca in Appendix §E.

We compare the EARL accuracy with ChatGPT and Alpaca in the GenEARL framework under the zero-shot and one-shot settings in Table 2. We observe that Alpaca-7B outperforms ChatGPT in the zero-shot setting ( $43.5\%$ vs $39.2\%$ ) while lagging behind ChatGPT in the one-shot setting ( $46.1\%$ vs $47.1\%$ ) in the average EARL accuracy across the datasets.⁶⁶6We do not perform a three-shot evaluation with Alpaca due to the limitations of its context length and ability to capture long-range dependencies. It suggests that ChatGPT is able to improve more with the few-shot examples, while the base performance of Alpaca is already high. Specifically, we find that Alpaca-7B outperforms ChatGPT by a large margin on the M ${}^{2}$ E ${}^{2}$ dataset suggesting that it understands the events grounded in the news data better than ChatGPT. We observe that ChatGPT outperforms Alpaca-7B by a large margin on the SWiG dataset which suggests that ChatGPT is better at reasoning over the event types associated with the common action verbs. Such differences in the understanding of the event domains can be attributed to the difference in the base language models. InstructGPT Ouyang et al. (2022) for ChatGPT and LLaMA Touvron et al. (2023) for Alpaca, and their pretraining data.

5.3 Human Assessment

We hypothesize that the GenEARL’s performance depends on the quality of the object role descriptions generated by the GVLM (§3.1). To verify this, we perform a human assessment for 100 instances of the generated object descriptions from the M ${}^{2}$ E ${}^{2}$ dataset. We ask the human labelers to choose whether the generated object role descriptions provide sufficient information for a human or LLM to correctly label the object’s event argument role. We conduct the experiment using 4 labelers from Amazon’s Mechanical Turk (MTurk) platform.⁷⁷7https://www.mturk.com/ Prior to the annotation, each labeler passed a qualification test, involving 10 sample instances, different from the actual 100 instances, to assess their understanding of the task. We provide the annotators with a detailed slide deck describing the task with a few solved examples. We attach a screenshot for one of the solved examples in Figure 14. We assign 3 labelers per instance to evaluate the annotator agreement and get majority votes on the description quality. The annotators were paid at a rate of 18$ per hour ⁸⁸8The total cost of human annotations was close to $100..

We find that the agreement between any two annotators averaged over 100 instances was $75\%$ which is high given the subjective complexity involved in assessing the dense object role descriptions. In Figure 3a, we provide the distribution of all the labeler responses for 300 annotations (100 instances $\times$ 3 labelers). We observe that the labelers find $47\%$ , $10\%$ , $43\%$ of the object role descriptions sufficient, ambiguous, and insufficient, respectively, to correctly label the object’s argument role. This suggests that there is a scope for enhancements in the quality of the generated object descriptions with advancements in the GVLM design and prompting techniques. In Figure 3b, we assess the performance of the GenEARL framework on the instances that were marked containing ‘insufficient’ and ‘sufficient’ information to accurately predict the object role labels. We find that the results verify our hypothesis since the performance of the framework is $\sim 80\%$ and $\sim 20\%$ on the high and low-quality examples, respectively. The result highlights that innovations in the generative VLM will improve the performance of the framework.

5.4 Ablations

EARL via GVLM

In our framework, we utilize LLaVA-7B for generating the event-centric object role descriptions that subsequently prompt the LLM for multimodal EARL prediction. Here, we study whether we can perform multimodal EARL without GenEARL’s two-stage pipeline. To answer this, we use LLaVA-7B to label the event argument roles for the participating objects given the visual features of the image and object, and the event and its associated argument roles. We provide more details for predicting EARL using LLaVA-7B in Appendix §F.

We present our findings regarding multimodal EARL with LLaVA-7B under the zero-shot setting, as shown in the last row of Table 2. Our analysis reveals that LLaVA-7B exhibits subpar performance when compared to the GenEARL with LLMs on both datasets. The underperformance of LLaVA-7B in the novel event argument role labeling task can potentially be attributed to its fine-tuning data, which lacked classification-oriented instructions and instead emphasized providing detailed image descriptions following the input instruction. Consequently, we leverage LLaVA’s capability to generate event-centric object descriptions when prompted, which can be effectively utilized by a robust LLM within the GenEARL framework. This LLM is likely to have encountered a greater number of classification-style instructions during its finetuning process which leads to its labeling capabilities.

Effect of the Conditioning Variables

In the previous experiments, we show that the quality of the generated object role descriptions affects the predictions made by the LLM in the GenEARL framework. In this experiment, we aim to study the factors that affect this quality. Specifically, we compare the performance of our framework where object descriptions are generated (a) without the visual features of the event depicting image $\mathcal{I}$ , and the event details including the event $E$ and argument role label and their definitions $\mathcal{A}_{E}$ , and (b) without the visual features of the event depicting image $\mathcal{I}$ . In both scenarios, the visual features of the object $o_{i}$ being labeled are present in the input prompt to the GVLM. The input template for these settings are present in Table 9 and Table 10.

GVLM prompting	M ${}^{2}$ E ${}^{2}$ (%)	SWiG (%)
w/ $\mathcal{I}$ , ( $E$ , $\mathcal{A}_{E}$ ) [Ours]	44.1	51
wo/ $\mathcal{I}$	42.1	44
wo/ $\mathcal{I}$ , ( $E$ , $\mathcal{A}_{E}$ ) [Object Caption]	39.9	43

Table 4: Multimodal EARL accuracy (%) on the M

{}^{2}

{}^{2}

and SWiG dataset. We compare three GVLM prompting strategies to understand the effect of various conditioning variables used in the multimodal prompt to GVLM for generating object descriptions. The experiment is performed in the one-shot setting.

We report the results in Table 4 by prompting ChatGPT under the one-shot setting. We find that the performance of the GenEARL framework with the object descriptions generated without any context of the event image and event details i.e., object caption performs the worst on both datasets (Row 3). We observe that providing the event details improves the performance on both datasets (Row 2 and Row 3). Finally, we observe that the best performance is achieved when the GVLM is provided with the event image and event information in its context where the accuracy gains are higher for the SWiG dataset than the M ${}^{2}$ E ${}^{2}$ dataset. Our experiment reveals that we need to provide event-oriented information in the input prompt to generate event-centric object descriptions that will eventually contribute to improved multimodal EARL. Finally, we observe that our results are robust to the ordering perturbations in the predefined template to the LLM in Appendix §D, which makes our approach flexible.

6 Related Work

Event Extraction: Most of the prior work has focused on extracting events and their structures grounded in the text modality such as text documents Ahn (2006); Nguyen et al. (2016); Nguyen and Grishman (2015); Nguyen and Nguyen (2019); Lin et al. (2020); Yang and Mitchell (2016); Paolini et al. (2021); Li et al. (2021). These works depend on training their models on the large-scale dataset with a complete event annotation Doddington et al. (2004); Song et al. (2015). Previous studies have shown that these models do not generalize to unseen event types and new domains Parekh et al. (2023). Other works in text-based event extraction have thus focused on training their models in a data-efficient manner Lu et al. (2021); Hsu et al. (2022). In this work, we utilize modern generative models that are capable of understanding the event task descriptions with image contexts and performing well on them, thus kee** our approach training-free.

Multimodal Event Extraction: While traditional event extraction has focused on text modality, multimodal event extraction is popularized by recent works Yatskar et al. (2016); Li et al. (2020, 2022b). These works focus on training weakly supervised representation models such as WASE and CLIP-Event on event annotated data Doddington et al. (2004); Yatskar et al. (2016); Li et al. (2020) to learn event semantics and structures. Similar to the previous works in text-based event extraction, these models will suffer from generalization issues with new event types. Additionally, there is no easy way to flexibly adapt them to new event types without further finetuning on expensive and hard-to-acquire event data. In this work, we show that GenEARL generalizes to the unseen events by reasoning over the visual features of the image and object, and the event and its argument roles in its context.

Generative Models: Recently, there has been a surge of large-scale pretrained language models Brown et al. (2020); OpenAI (2021); Touvron et al. (2023); Zhang et al. (2022); Chowdhery et al. (2022); Hoffmann et al. (2022); Lewis et al. (2019); Raffel et al. (2020) and multimodal generative models Liu et al. (2023b); Li et al. (2022a, 2023); Dai et al. (2023); Awadalla et al. (2023); Alayrac et al. (2022); Zhu et al. (2023); Gong et al. (2023). These models have a remarkable ability to generalize to new tasks based on their descriptions in their context. Wang et al. (2022) use a combination of multimodal generative models and large language models to perform strong baselines on the few-shot video-language benchmarks. In our work, we leverage the flexibility and generalization capabilities of these models to perform well on multimodal EARL.

7 Conclusion

We propose GenEARL, a training-free framework for multimodal event argument role labeling. Our experiments reveal that GenEARL generalizes to various event types from the two datasets M ${}^{2}$ E ${}^{2}$ and SWiG. This generalization capability is desirable for event extraction models since the existing methods rely on event-annotated data, which is expensive and time-consuming to acquire. Our work relies on the input prompts that follow the predefined human-crafted templates. Future work can focus on the automation of such templates based on the event information. Existing text-only and multimodal generative models Bansal et al. (2022); Pratt et al. (2020); Nadeem et al. (2020) have been shown to encode harmful social biases. Future work should aim to contextualize social biases in multimodal EARL, and focus on reducing them.

8 Acknowledgement

Hritik Bansal and Po-Nien Kung are supported in part by AFOSR MURI grant FA9550-22-1-0380. In addition, this work was partially supported by Defense Advanced Research Project Agency (DARPA) grant HR00112290103/HR0011260656, CISCO and ONR grant N00014-23-1-2780.

Limitations

GenEARL framework prompts modern vision-language generative models and large language models with event-specific task descriptions. These models have shown to encode harmful social biases and stereotypes that may be reflected in their output generations. For example, people with a specific color in an image might get labeled as ‘attackers’ or ‘offenders’ without understanding their true role in the depicted event. Future work should focus on the study of these generated object descriptions from GVLM and EARL predictions from the LLMs through human evaluation.

In our experiments, we perform the human assessment by employing mechanical turkers. All the labelers belonged to the US and thus will limit our assessment due to their own perceptual biases when presented with an event depicting an image and the participant objects. To obtain more reliable annotations, we can involve annotators from diverse regions.

Multimodal event extraction is a relatively new endeavor in comparison to traditional text-based event extraction methods. Thus, there are very few evaluation datasets to assess the multimodal models for their generalization capabilities to diverse domains and event types. It will be imperative to benchmark our approach on such diverse datasets for more comprehensive evaluation.

References

Ahn (2006) David Ahn. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1–8.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198.
Awadalla et al. (2023) Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. 2023. Openflamingo.
Bansal et al. (2022) Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. 2022. How well can text-to-image generative models understand ethical natural language interventions? arXiv preprint arXiv:2210.15230.
Blandfort et al. (2019) Philipp Blandfort, Desmond U Patton, William R Frey, Svebor Karaman, Surabhi Bhargava, Fei-Tzin Lee, Siddharth Varia, Chris Kedzie, Michael B Gaskell, Rossano Schifanella, et al. 2019. Multimodal social media analysis for gang violence prevention. In Proceedings of the International AAAI conference on web and social media, volume 13, pages 114–124.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
Doddington et al. (2004) George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M Strassel, and Ralph M Weischedel. 2004. The automatic content extraction (ace) program-tasks, data, and evaluation. In Lrec, volume 2, pages 837–840. Lisbon.
Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
Giorgi et al. (2022) Salvatore Giorgi, Sharath Chandra Guntuku, McKenzie Himelein-Wachowiak, Amy Kwarteng, Sy Hwang, Muhammad Rahman, and Brenda Curtis. 2022. Twitter data of the #blacklivesmatter movement and counter protests: 2013 to 2021.
Gong et al. (2023) Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, ** Luo, and Kai Chen. 2023. Multimodal-gpt: A vision and language model for dialogue with humans.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Hsu et al. (2022) I-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee, Scott Miller, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng. 2022. Degree: A data-efficient generation-based event extraction model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1890–1908.
Lewis et al. (2019) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022a. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
Li et al. (2022b) Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022b. Clip-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16420–16429.
Li et al. (2020) Manling Li, Alireza Zareian, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji, and Shih-Fu Chang. 2020. Cross-media structured common space for multimedia event extraction. arXiv preprint arXiv:2005.02472.
Li et al. (2021) Sha Li, Heng Ji, and Jiawei Han. 2021. Document-level event argument extraction by conditional generation. arXiv preprint arXiv:2104.05919.
Lin et al. (2020) Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7999–8009, Online. Association for Computational Linguistics.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023a. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023b. Visual instruction tuning.
Liu et al. (2023c) Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. 2023c. Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Lu et al. (2021) Yaojie Lu, Hongyu Lin, ** Xu, Xianpei Han, Jialong Tang, Annan Li, Le Sun, Meng Liao, and Shaoyi Chen. 2021. Text2event: Controllable sequence-to-structure generation for end-to-end event extraction. arXiv preprint arXiv:2106.09232.
Nadeem et al. (2020) Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456.
Nguyen et al. (2016) Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 300–309, San Diego, California. Association for Computational Linguistics.
Nguyen and Grishman (2015) Thien Huu Nguyen and Ralph Grishman. 2015. Event detection and domain adaptation with convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 365–371, Bei**g, China. Association for Computational Linguistics.
Nguyen and Nguyen (2019) Trung Minh Nguyen and Thien Huu Nguyen. 2019. One for all: Neural joint modeling of entities and events. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6851–6858.
OpenAI (2021) OpenAI. 2021. ChatGPT: Large-scale language model. Accessed: June 17, 2023.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
Paolini et al. (2021) Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, Rishita Anubhai, Cicero Nogueira dos Santos, Bing Xiang, and Stefano Soatto. 2021. Structured prediction as translation between augmented natural languages. arXiv preprint arXiv:2101.05779.
Parekh et al. (2023) Tanmay Parekh, I-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng. 2023. Geneva: Benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In Proceedings of the Conference of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).
Pratt et al. (2020) Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. 2020. Grounded situation recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 314–332. Springer.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8).
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Song et al. (2015) Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, and Xiaoyi Ma. 2015. From light to rich ere: annotation of entities, relations, and events. In Proceedings of the the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pages 89–98.
Stephens et al. (1998) Mitchell Stephens et al. 1998. The Rise of the Image, the Fall of the Word. Oxford University Press.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Tay et al. (2020) Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, **feng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2020. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Wang et al. (2022) Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. 2022. Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747.
Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Yang and Mitchell (2016) Bishan Yang and Tom M. Mitchell. 2016. Joint extraction of events and entities within a document context. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 289–299, San Diego, California. Association for Computational Linguistics.
Yatskar et al. (2016) Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5534–5542.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

Appendix A Multimodal EARL

Firstly, we assume that a given image has a single major event associated with it since there can be no event or multiple events depicted in the image. Secondly, we assume that every object (bounding box) participates in the event and hence has an associated event argument role label. However, we do incorporate the ability to predict ‘Other’ if the model does find a good match in the candidate argument role labels. Thirdly, we allow multiple objects in a given image to have the same event argument role label, as shown in Figure 1.

Existing methods Li et al. (2020, 2022b) train the models on high-quality unimodal or multimodal data annotated with the events, entities, and their relations. For example, ACE is a text-based dataset, imSitu Yatskar et al. (2016) is for actions (or events) in the images, and Li et al. (2022b) utilizes VOA news as a source for multimedia events. In practice, collecting and annotating these datasets is time-consuming and expensive. Furthermore, the inability of the existing models to adapt to new event types and new domains without specialized training data makes them less generalizable.

Appendix B Setup-Baseline

B.1 Supervised Setting

We train our model with a batch size = 32, epochs = 5, AdamW optimizer Loshchilov and Hutter (2017) with a linear warmup of 100 steps followed by cosine annealing. We perform a hyperparameter search over three learning rates = $\{1e-4,1e-5,1e-6\}$ . We find that the model trained with $1e-5$ performs the best on the test examples of the SWiG dataset during inference. Each finetuning experiment took $4-5$ hours on a single A6000 48GB Nvidia GPU.

Given the lack of available training data for the M ${}^{2}$ E ${}^{2}$ domain, we conduct an evaluation of CLIP (+SWiG) as a transfer learning setup. Our findings reveal that the finetuned CLIP (+SWiG) model, utilizing a learning rate of $1e-4$ , demonstrates the most optimal performance on the M ${}^{2}$ E ${}^{2}$ dataset.

Appendix C Precision, Recall, and F1 Metrics

In Table 5, we report the precision, recall and F1 of GenEARL, Zero-shot CLIP baseline, and Supervised CLIP finetuned on the M ${}^{2}$ E ${}^{2}$ and SWiG dataset. We find that the trends (rankings) of the multimodal EARL methodologies are the same as reported in Table 1. Specifically, we observe that GenEARL (3-shot) outperforms Supervised CLIP under the transfer learning settings on M ${}^{2}$ E ${}^{2}$ . In addition, we find that the GenEARL (0-shot and 3-shot) outperform the zero-shot CLIP baseline on the SWiG dataset.

Method	M ${}^{2}$ E ${}^{2}$			SWiG
Method	Precision	Recall	F1	Precision	Recall	F1
Zero-shot CLIP-B/32	41.6%	28%	33.2%	22%	27.3%	24.4%
GenEARL (0-shot)	65.3%	38.4%	48.4%	36.5%	47%	41%
GenEARL (3-shot)	66.1%	44.7%	53.3%	47%	55%	50.7%
Supervised CLIP-B/32 (+SwiG)	66%	40.3%	50%	61.5%	72%	66%

Table 5: Precision, Recall, and F1 scores for Multimodal Event Argument Role Labeling on M

{}^{2}

{}^{2}

and SWiG datasets.

Appendix D Effect of ordering perturbation in the predefined template

In §3.2, we describe a predefined template $\mathcal{T}_{\ell}$ that converts the input variables (object role description, event type, image description, possible argument roles and their definitions) from the generative vision language model to detailed task description for LLM. In our experiments, we find that the LLM (ChatGPT) predictions are not sensitive to the order in which the input variables are presented to it in the predefined template. Specifically, we observe a slight change of $\pm 2\%$ for GenEARL (3-shot) on both datasets over different orderings of the input variables to the LLM template. This indicates that the predefined template is very flexible and can easily generalize to new datasets.

Appendix E Using Alpaca for EARL Prediction

We evaluate the probability scores from the Alpaca model $\mathcal{M}$ over the possible event argument roles $a_{j}\in\mathcal{A}_{E}$ given by $p_{\mathcal{M}}(a_{j}|\mathcal{C})$ where $p_{\mathcal{M}}(.|\mathcal{C})$ is the learned probability distribution of the Alpaca model, and $\mathcal{C}$ is the context described in §3.2. The predicted EARL $\tilde{a}_{i}$ for the object $o_{i}$ is the one that maximizes the probability $\tilde{a}_{i}=\text{argmax}_{j}p_{\mathcal{M}}(a_{j}|\mathcal{C})$ . We provide the EARL prediction template used for Alpaca experiments in the zero-shot settings in Table 13.

Appendix F Using LLaVA for EARL Prediction

Input: Image $\mathcal{I}$ , Object $o$ , Event $E$ , Event argument role labels (and definitions) $\mathcal{A}$
Prompt:
Image 1: 256 $\times<\mathcal{I}$ visual feature tokens $>$
Image 2: 256 $\times<o$ visual feature tokens $>$
What is the role of the entity in the "Image 2" in the context of the $E$ event in the "Image 1"? The possible argument roles of the objects in the $E$ event include $\mathcal{A}$ . Choose only one of these options.

Figure 4: Multimodal input template used to prompt the GLVM for event argument role labeling. Here, the model is provided with the image, event and possible argument role labels (and definitions). We add ‘Other’ to the list of possible argument role labels in case it does not prefer any of the existing event argument role labels. We get 256 visual tokens for the image or the object by projecting the raw input into the vision embedding space using the visual processing module of the GVLM.

To this end, we evaluate the probability scores from the LLaVA model $\mathcal{G}$ over the possible event argument roles $a_{j}\in\mathcal{A}_{E}$ given by $p_{\mathcal{G}}(a_{j}|\mathcal{P})$ where $p_{\mathcal{M}}(.|\mathcal{P})$ is the learned probability distribution of the LLaVA model, and $\mathcal{P}$ is the context described in §3.1. The predicted EARL $\tilde{a}_{i}$ for the object $o_{i}$ is the one that maximizes the probability $\tilde{a}_{i}=\text{argmax}_{j}p_{\mathcal{G}}(a_{j}|\mathcal{P})$ . We provide the EARL prediction template used for LLaVA in the zero-shot settings in Figure 4.

Appendix G Input Prompts for LLaVA

We provide the multimodal input templates used for prompting LLaVA in the GenEARL framework. By default, we use the template presented in the Figure 8. We use the templates Figure 9 and Figure 10 for the experiments in §5.4. Since LlaVA is known to have a limited context token length of 2048 and like most transformer architectures Tay et al. (2020) is susceptible to suffer from ignoring long-range dependencies, we create object role descriptions for one object $o_{i}$ at a time.

Appendix H Event Types

To maintain the argument role density, we filter the event types with less than three argument roles from the M ${}^{2}$ E ${}^{2}$ dataset and the SWiG dataset. We provide the list of event types included in our multimodal EARL evaluation in Table 6 and Table 7.

Event Types	Event Argument Role Definition
Life: Die	[Victim] died at place, or was killed by [Agent] using [Instrument]
Movement.Transport	[Agent] transported [Artifact or Person] in [Instrument] from [Origin] to [Destination]
Conflict.Attack	[Attacker] attacked or assaulted [Target] [Instrument] at place
Conflict.Demonstrate	[Demonstrator] was in a demonstration at place holding [Instrument] and supervised by [Police]
Justice.Arrest Jail	[Agent] arrested or jailed [Person] using [Instrument] at place
Transaction.Transfer Money	[Giver] gave [Money] to [Recipient]

Table 6: The list of six event types from the M

{}^{2}

{}^{2}

dataset included in our experiments. The argument role definitions are taken from Li et al. (2020).

‘tattooing’, ’splashing’, ’emerging’, ’pasting’, ’inflating’, ’displaying’, ’feeding’, ’shredding’, ’flicking’, ’shaving’, ’flinging’, ’yanking’, ’putting’, ’decorating’, ’mashing’, ’baking’, ’moistening’, ’inserting’, ’chop**’, ’parachuting’, ’giving’, ’cooking’, ’falling’, ’ramming’, ’pushing’, ’wrap**’, ’lea**’, ’packing’, ’smearing’, ’whisking’, ’sharpening’, ’sealing’, ’examining’, ’raking’, ’punching’, ’whip**’, ’watering’, ’steering’, ’distributing’, ’spying’, ’uncorking’, ’squeezing’, ’pawing’, ’drip**’, ’carrying’, ’pinching’, ’tying’, ’arranging’, ’tearing’, ’blocking’, ’practicing’, ’potting’, ’installing’, ’dousing’, ’making’, ’unpacking’, ’mowing’, ’stacking’, ’stinging’, ’heaving’, ’pulling’, ’breaking’, ’docking’, ’destroying’, ’drying’, ’poking’, ’tickling’, ’tilting’, ’hoisting’, ’nailing’, ’lighting’, ’shelving’, ’wagging’, ’lifting’, ’surfing’, ’gluing’, ’hauling’, ’erasing’, ’bathing’, ’serving’, ’drinking’, ’soaking’, ’writing’, ’coaching’, ’flip**’, ’slicing’, ’educating’, ’bothering’, ’swee**’, ’washing’, ’spinning’, ’dyeing’, ’filling’, ’shaking’, ’signing’, ’catching’, ’scratching’, ’shearing’, ’ejecting’, ’peeling’, ’hanging’, ’sliding’, ’igniting’, ’pum**’, ’clenching’, ’fastening’, ’repairing’, ’jum**’, ’striking’, ’manicuring’, ’farming’, ’photographing’, ’mining’, ’selling’, ’bouncing’, ’leaking’, ’autographing’, ’scra**’, ’providing’, ’tip**’, ’patting’, ’interrogating’, ’emptying’, ’hurling’, ’scoo**’, ’sewing’, ’massaging’, ’attacking’, ’signaling’, ’dissecting’, ’spanking’, ’hugging’, ’crashing’, ’coloring’, ’scrubbing’, ’rinsing’, ’fueling’, ’begging’, ’spraying’, ’frying’, ’stitching’, ’spilling’, ’covering’, ’dip**’, ’wheeling’, ’constructing’, ’burying’, ’throwing’, ’drenching’, ’injecting’, ’pricking’, ’clearing’, ’plummeting’, ’wetting’, ’weighing’, ’deflecting’, ’opening’, ’lap**’, ’vacuuming’, ’kicking’, ’picking’, ’folding’, ’operating’, ’smashing’, ’stroking’, ’pouring’, ’tuning’, ’dusting’, ’attaching’, ’drop**’, ’resting’, ’rocking’, ’paying’, ’guarding’, ’leaning’, ’kissing’, ’slap**’, ’harvesting’, ’piloting’, ’welding’, ’stuffing’, ’rubbing’, ’buttering’, ’pruning’, ’pinning’, ’dragging’, ’fetching’, ’locking’, ’shoveling’, ’cramming’, ’disciplining’, ’planting’, ’cleaning’, ’spitting’, ’climbing’, ’caressing’, ’flexing’, ’drawing’, ’assembling’, ’lathering’, ’mending’, ’loading’, ’combing’, ’unloading’, ’trimming’, ’molding’, ’checking’, ’descending’, ’aiming’, ’performing’, ’fishing’, ’ta**’, ’immersing’, ’ailing’, ’sketching’, ’grinding’, ’filming’, ’measuring’, ’eating’, ’buckling’, ’submerging’, ’hel**’, ’crafting’, ’drumming’, ’plunging’, ’carting’, ’curling’, ’spreading’, ’clip**’, ’carving’, ’hitting’, ’retrieving’, ’placing’, ’launching’, ’crushing’, ’pitching’, ’offering’, ’strap**’, ’adjusting’, ’moisturizing’, ’tasting’, ’wi**’, ’sprinkling’, ’buying’, ’applying’, ’stapling’, ’stirring’, ’floating’, ’shooting’, ’microwaving’, ’fixing’, ’milking’, ’brushing’, ’vaulting’, ’unlocking’, ’extinguishing’, ’building’, ’painting’, ’waxing’, ’strip**’, ’prying’, ’crowning’

Table 7: The list of 262 event types from the SWiG dataset included in our experiments.

Appendix I Qualitative Example

We provide three qualitative examples from the M ${}^{2}$ E ${}^{2}$ dataset in Figure 5.

Appendix J Human Assessment Screenshot

Here, we provide the screenshots for our human assessments. Figure 14 illustrates a solved example shown to the labelers before the qualification test to make them understand the task. Figure 15 illustrates an unsolved example assigned to 3 labelers during the actual annotation.

Appendix K Can two-stage generative prompting perform multimodal event detection?

In this work, we present GenEARL, a combination of prompting a generative vision-language model followed by a large language model for multimodal EARL. Here, we aim to study whether the same principles can be adapted to perform multimodal event detection i.e., classifying the given image into one of the event types from the dataset.

To this end, we first generate an image caption from LLaVA-7B. Then, we prompt ChatGPT to predict the event type based on the generated image description and the list of possible event types (and their definitions). We perform this experiment in a zero-shot setting. Further, we compare our model against the zero-shot pretrained CLIP Radford et al. (2019), and zero-shot and supervised CLIP-Event Li et al. (2022b). We report the results in Table 8.

We find that the two-stage generative prompting framework outperforms all the other methods including the supervised baselines for multimodal event detection on the M ${}^{2}$ E ${}^{2}$ dataset. This suggests that generative prompting is a flexible and generalizable technique that can be used for event detection without access to any event-annotated training data. Surprisingly, we find that the two-stage framework performs poorly on the SWiG dataset. We perform a manual inspection of several examples where the predicted action does not match the ground action. Our investigation reveals that the two-stage generative prompting does provide reasonable predictions for the actions depicted in the images. For example, as shown in Figure 7, the ground-truth action annotated in the SWiG dataset is ‘dialing’. However, the two-stage setup predicts ‘pressing’ and ‘navigating’ actions which are legitimate through visual inspection of Figure 7. Although we do not conduct a large-scale human evaluation of our approach, we attribute the low performance of the two-stage generative feedback on SWiG dataset to its restricted ground-truth action verb assignments (‘dialing’) despite the presence of other related actions in the list of event types (‘pressing’, ‘ty**’). Although not shown here, we find that providing few-shot examples to ChatGPT does not improve the performance on the SWiG dataset.

Method	M ${}^{2}$ E ${}^{2}$ Acc. (%)	SWiG Acc. (%)
Zero-shot-LLaVA-ChatGPT	74.3	11
Zero-shot-CLIP (B/32)	65.7	28.3
Zero-shot-CLIP-Event	70.8	31.4
Supervised-CLIP-Event	72.8	45.6

Table 8: Event Detection Accuracy (%) of generative prompting (LLaVA followed by ChatGPT), pretrained CLIP, and CLIP-Event Li et al. (2022b) models on the M

{}^{2}

{}^{2}

and SWiG dataset.

Input: Image $\mathcal{I}$ , Object $o$ , Event $E$ , Event argument role labels (and definitions) $\mathcal{A}$
Prompt:
Image 1: 256 $\times<\mathcal{I}$ visual feature tokens $>$
Image 2: 256 $\times<o$ visual feature tokens $>$
Describe the role of the entity in “Image 2” in the context of the $E$ event in “Image 1”. The possible argument roles of the objects performing $E$ event include $\mathcal{A}$ . Please be concise with your answer.

Figure 8: Multimodal input template used to prompt the GLVM in the GenEARL framework. Here, the model is provided with the image, event and possible argument role labels (and definitions). We get 256 visual tokens for the image or the object by projecting the raw input into the vision embedding space using the visual processing module of the GVLM.

Input: Object $o$ , Event $E$ , Event argument roles (and definitions) $\mathcal{A}$
Prompt:
Image: 256 $\times<o$ visual feature tokens $>$
Describe the role of the entity in “Image” in the context of the $E$ event. The possible argument roles of the objects performing $E$ event include $\mathcal{A}$ . Please be concise with your answer.

Figure 9: Multimodal input template used to prompt the GLVM in the GenEARL framework. Here, the model is provided with the event and possible argument role labels (and definitions). We get 256 visual tokens for the object by projecting the raw input into the vision embedding space using the visual processing module of the GVLM.

Input: Object $o$
Prompt:
Image: 256 $\times<o$ visual feature tokens $>$
Describe the “Image” concisely.

Figure 10: Multimodal input template used to prompt the GLVM in the GenEARL framework. Here, the model is provided just the event participant object. We get 256 visual tokens for the object by projecting the raw input into the vision embedding space using the visual processing module of the GVLM.

Input: Image Caption $\mathcal{I}$ , Generated object descriptions $\{g_{1},g_{2},\ldots,g_{k}\}$ from LLaVA for the $k$ participants objects in the image , Event $E$ , Event argument role labels (and definitions) $\mathcal{A}$

Setting: Zero-shot

Prompt:

You are a helpful AI assistant with extensive knowledge of event argument extraction. Worldwide events are documented in raw text on various online platforms, and it is crucial to extract useful and concise information about them for downstream applications.

In this case, you will be provided with an “Event Image Description”, the “Event” portrayed in the image, a generic “Event Argument Roles Definition” that helps you to understand the argument roles that are grounded in different objects in the image, and the “Object Role” descriptions to describe the role of specific objects in the context of the "Event Image". Based on the provided information, you need to tell the argument roles associated with different objects in the image.

Remember that: 1) The number of possible event argument roles can sometimes be equal to, more, or less than the number of objects detected for the "Event Image".

2) Multiple objects may get identical event argument roles, but not always.

3) It is completely possible that some of the event argument roles are not grounded in any of the objects detected for the "Event Image".

Please keep your answer concise. You can choose to assign a "Other" argument role if you are not sure about the argument role for a particular object.

Event Image Description: $\mathcal{I}$

Event: $E$

Event Argument Role Definition: $\mathcal{A}$

Role of Object 1: $g_{1}$

Role of Object k: $g_{k}$

Argument Role of Object 1:

Argument Role of Object k:

Output:

Argument Role of Object 1: $\hat{a}_{1}$

Argument Role of Object k: $\hat{a}_{k}$

Figure 11: LLM template used to prompt the ChatGPT in the GenEARL framework. This template is utilized under the zero-shot settings. Due to cost considerations associated with prompting ChatGPT, we perform multimodal EARL in a batch of all the event participant objects in an image.

Solved Input: Image Caption $\mathcal{I}_{s}$ , Generated object descriptions $\{g_{s,1},g_{s,2},\ldots,g_{s,l}\}$ from LLaVA for the $l$ participants objects in the image , Event $E_{s}$ , Event argument role labels (and definitions) $\mathcal{A}_{s}$

Query Input: Image Caption $\mathcal{I}$ , Generated object descriptions $\{g_{1},g_{2},\ldots,g_{k}\}$ from LLaVA for the $k$ participants objects in the image , Event $E$ , Event argument role labels (and definitions) $\mathcal{A}$

Setting: One-shot

Prompt:

Remember that: 1) The number of possible event argument roles can sometimes be equal to, more, or less than the number of objects detected for the "Event Image".

2) Multiple objects may get identical event argument roles, but not always.

3) It is completely possible that some of the event argument roles are not grounded in any of the objects detected for the "Event Image".

Please keep your answer concise. You can choose to assign a "Other" argument role if you are not sure about the argument role for a particular object.

We will first show a single solved instance of the task, and then you will complete the task on a new query.

Solved Instance:

Event Image Description: $\mathcal{I}_{s}$

Event: $E_{s}$

Event Argument Role Description: $\mathcal{A}_{s}$

Role of Object 1: $g_{s,1}$

Role of Object l: $g_{s,l}$

Argument Role of Object 1: $a_{s,1}$

Argument Role of Object l: $a_{s,l}$

Query Instance:

Event Image Description: $\mathcal{I}$

Event: $E$

Event Argument Role Definition: $\mathcal{A}$

Role of Object 1: $g_{1}$

Role of Object k: $g_{k}$

Argument Role of Object 1:

Argument Role of Object k:

Output:

Argument Role of Object 1: $\hat{a}_{1}$

Argument Role of Object k: $\hat{a}_{k}$

Figure 12: LLM template used to prompt the ChatGPT in the GenEARL framework. This template is utilized under the one-shot settings.

Input: Image Caption $\mathcal{I}$ , Generated object description $g$ from LLaVA for the one of the participant objects in the image , Event $E$ , Event argument role labels (and definitions) $\mathcal{A}$

Setting: Zero-shot

Prompt:

Given an “Event Image Description”, the “Event” portrayed in the image, a generic “Event Argument Roles Definition” that helps you to understand the argument roles that are grounded in different objects in the image, and the “Object Role” description to describe the role of the object in the context of the "Event Image". Based on the provided information, you need to tell the argument roles associated with different objects in the image. You can choose to assign an "Other" argument role if you are not sure about the argument role for a particular object.

Event Image Description: $\mathcal{I}$

Event: $E$

Event Argument Role Definition: $\mathcal{A}$

Role of Object: $g$

Argument Role of Object:

Figure 13: LLM template used to prompt the Alpaca-7B. This template is utilized under the zero-shot settings. Due to the context length limitations, we label one participating object at a time.