Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Wang, He; Guo, Pengcheng; Wan, Xucheng; Zhou, Huan; Xie, Lei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.05466 (cs)

[Submitted on 8 Apr 2024 (v1), last revised 30 Apr 2024 (this version, v2)]

Title:Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Authors:He Wang, Pengcheng Guo, Xucheng Wan, Huan Zhou, Lei Xie

View PDF HTML (experimental)

Abstract:Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.

Comments:	6 pages, 3 figures, Accepted at ICMEW 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.05466 [cs.CV]
	(or arXiv:2404.05466v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.05466

Submission history

From: He Wang [view email]
[v1] Mon, 8 Apr 2024 12:44:24 UTC (170 KB)
[v2] Tue, 30 Apr 2024 15:51:21 UTC (170 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators