Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Cao, Haoyu; Bao, Changcun; Liu, Chaohu; Chen, Huang; Yin, Kun; Liu, Hao; Liu, Yinsong; Jiang, Deqiang; Sun, Xing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.01131 (cs)

[Submitted on 3 Sep 2023]

Title:Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Authors:Haoyu Cao, Changcun Bao, Chaohu Liu, Huang Chen, Kun Yin, Hao Liu, Yinsong Liu, Deqiang Jiang, Xing Sun

View PDF

Abstract:We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation.
Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive,
SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module.
This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme.
We also designed several pre-training tasks to enhance the understanding and local awareness of the model.
Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks.
SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.

Comments:	Accepted to ICCV 2023 main conference
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2309.01131 [cs.CV]
	(or arXiv:2309.01131v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.01131

Submission history

From: Haoyu Cao [view email]
[v1] Sun, 3 Sep 2023 10:14:34 UTC (11,072 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators