GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Bansal, Hritik; Kung, Po-Nien; Brantingham, P. Jeffrey; Chang, Kai-Wei; Peng, Nanyun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.04763 (cs)

[Submitted on 7 Apr 2024]

Title:GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Authors:Hritik Bansal, Po-Nien Kung, P. Jeffrey Brantingham, Kai-Wei Chang, Nanyun Peng

View PDF HTML (experimental)

Abstract:Multimodal event argument role labeling (EARL), a task that assigns a role for each event participant (object) in an image is a complex challenge. It requires reasoning over the entire image, the depicted event, and the interactions between various objects participating in the event. Existing models heavily rely on high-quality event-annotated training data to understand the event semantics and structures, and they fail to generalize to new event types and domains. In this paper, we propose GenEARL, a training-free generative framework that harness the power of the modern generative models to understand event task descriptions given image contexts to perform the EARL task. Specifically, GenEARL comprises two stages of generative prompting with a frozen vision-language model (VLM) and a frozen large language model (LLM). First, a generative VLM learns the semantics of the event argument roles and generates event-centric object descriptions based on the image. Subsequently, a LLM is prompted with the generated object descriptions with a predefined template for EARL (i.e., assign an object with an event argument role). We show that GenEARL outperforms the contrastive pretraining (CLIP) baseline by 9.4% and 14.2% accuracy for zero-shot EARL on the M2E2 and SwiG datasets, respectively. In addition, we outperform CLIP-Event by 22% precision on M2E2 dataset. The framework also allows flexible adaptation and generalization to unseen domains.

Comments:	20 pages, 15 Figures, 13 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.04763 [cs.CV]
	(or arXiv:2404.04763v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.04763

Submission history

From: Hritik Bansal [view email]
[v1] Sun, 7 Apr 2024 00:28:13 UTC (2,965 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:GenEARL: A Training-Free Generative Framework for Multimodal Event Argument Role Labeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators