LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Hu, Shengnan; Zheng, Ce; Zhou, Zixiang; Chen, Chen; Sukthankar, Gita

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.11934 (cs)

[Submitted on 21 Jul 2023 (v1), last revised 26 Jul 2023 (this version, v2)]

Title:LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Authors:Shengnan Hu, Ce Zheng, Zixiang Zhou, Chen Chen, Gita Sukthankar

View PDF

Abstract:Human-centric visual understanding is an important desideratum for effective human-robot interaction. In order to navigate crowded public places, social robots must be able to interpret the activity of the surrounding humans. This paper addresses one key aspect of human-centric visual understanding, multi-person pose estimation. Achieving good performance on multi-person pose estimation in crowded scenes is difficult due to the challenges of occluded joints and instance separation. In order to tackle these challenges and overcome the limitations of image features in representing invisible body parts, we propose a novel prompt-based pose inference strategy called LAMP (Language Assisted Multi-person Pose estimation). By utilizing the text representations generated by a well-trained language model (CLIP), LAMP can facilitate the understanding of poses on the instance and joint levels, and learn more robust visual representations that are less susceptible to occlusion. This paper demonstrates that language-supervised training boosts the performance of single-stage multi-person pose estimation, and both instance-level and joint-level prompts are valuable for training. The code is available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.11934 [cs.CV]
	(or arXiv:2307.11934v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.11934

Submission history

From: Shengnan Hu [view email]
[v1] Fri, 21 Jul 2023 23:00:43 UTC (3,648 KB)
[v2] Wed, 26 Jul 2023 18:08:10 UTC (3,648 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators