Denoising-Diffusion Alignment for Continuous Sign Language Recognition

Guo, Leming; Xue, Wanli; Zhou, Yuxi; Kang, Ze; Yuan, Tiantian; Gao, Zan; Chen, Shengyong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.03614v4 (cs)

[Submitted on 5 May 2023 (v1), last revised 3 May 2024 (this version, v4)]

Title:Denoising-Diffusion Alignment for Continuous Sign Language Recognition

Authors:Leming Guo, Wanli Xue, Yuxi Zhou, Ze Kang, Tiantian Yuan, Zan Gao, Shengyong Chen

View PDF HTML (experimental)

Abstract:Continuous sign language recognition (CSLR) aims to promote active and accessible communication for the hearing impaired, by recognizing signs in untrimmed sign language videos to textual glosses sequentially. The key challenge of CSLR is how to achieve the cross-modality alignment between videos and gloss sequences. However, the current cross-modality paradigms of CSLR overlook using the glosses context to guide the video clips for global temporal context alignment, which further affects the visual to gloss map** and is detrimental to recognition performance. To tackle this problem, we propose a novel Denoising-Diffusion global Alignment (DDA), which consists of a denoising-diffusion autoencoder and DDA loss function. DDA leverages diffusion-based global alignment techniques to align video with gloss sequence, facilitating global temporal context alignment. Specifically, DDA first proposes the auxiliary condition diffusion to conduct the gloss-part noised bimodal representations for video and gloss sequence. To address the problem of the recognition-oriented alignment knowledge represented in the diffusion denoising process cannot be feedback. The DDA further proposes the Denoising-Diffusion Autoencoder, which adds a decoder in the auxiliary condition diffusion to denoise the partial noisy bimodal representations via the designed DDA loss in self-supervised. In the denoising process, each video clip representation of video can be reliably guided to re-establish the global temporal context between them via denoising the gloss sequence representation. Experiments on three public benchmarks demonstrate that our DDA achieves state-of-the-art performances and confirm the feasibility of DDA for video representation enhancement.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.03614 [cs.CV]
	(or arXiv:2305.03614v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.03614

Submission history

From: Leming Guo [view email]
[v1] Fri, 5 May 2023 15:20:27 UTC (996 KB)
[v2] Thu, 1 Jun 2023 02:23:02 UTC (6,269 KB)
[v3] Mon, 5 Feb 2024 17:15:26 UTC (1,096 KB)
[v4] Fri, 3 May 2024 04:11:55 UTC (1,522 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Denoising-Diffusion Alignment for Continuous Sign Language Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Denoising-Diffusion Alignment for Continuous Sign Language Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators