Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning

Pourkeshavarz, Mozhgan; Nabavi, Shahabedin; Moghaddam, Mohsen Ebrahimi; Shamsfard, Mehrnoush

doi:10.1007/s11042-023-15869-x

Computer Science > Computer Vision and Pattern Recognition

arXiv:2302.04676 (cs)

[Submitted on 8 Feb 2023]

Title:Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning

Authors:Mozhgan Pourkeshavarz, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam, Mehrnoush Shamsfard

View PDF

Abstract:Recently, the attention-enriched encoder-decoder framework has aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To make better use of consolidated features potential, we further propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Cite as:	arXiv:2302.04676 [cs.CV]
	(or arXiv:2302.04676v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2302.04676
Journal reference:	Multimedia Tools and Applications, Volume 83, pages 12209-12233, 2024
Related DOI:	https://doi.org/10.1007/s11042-023-15869-x

Submission history

From: Shahabedin Nabavi [view email]
[v1] Wed, 8 Feb 2023 09:15:09 UTC (3,589 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators