Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Zhang, Hongkuan; Sugawara, Saku; Aizawa, Akiko; Zhou, Lei; Sasano, Ryohei; Takeda, Koichi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.07075 (cs)

[Submitted on 14 Dec 2022]

Title:Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Authors:Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano, Koichi Takeda

View PDF

Abstract:Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.

Comments:	EMNLP 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2212.07075 [cs.CV]
	(or arXiv:2212.07075v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.07075

Submission history

From: Hongkuan Zhang [view email]
[v1] Wed, 14 Dec 2022 07:52:36 UTC (9,021 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators