Mitigating Open-Vocabulary Caption Hallucinations

Ben-Kish, Assaf; Yanuka, Moran; Alper, Morris; Giryes, Raja; Averbuch-Elor, Hadar

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.03631 (cs)

[Submitted on 6 Dec 2023 (v1), last revised 19 Apr 2024 (this version, v3)]

Title:Mitigating Open-Vocabulary Caption Hallucinations

Authors:Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor

View PDF HTML (experimental)

Abstract:While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. We will release our code and models.

Comments:	Website Link: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.03631 [cs.CV]
	(or arXiv:2312.03631v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.03631

Submission history

From: Assaf Ben-Kish [view email]
[v1] Wed, 6 Dec 2023 17:28:03 UTC (38,522 KB)
[v2] Wed, 21 Feb 2024 15:04:45 UTC (9,073 KB)
[v3] Fri, 19 Apr 2024 14:29:02 UTC (9,095 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Mitigating Open-Vocabulary Caption Hallucinations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Mitigating Open-Vocabulary Caption Hallucinations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators