Learning to Detect and Segment for Open Vocabulary Object Detection

Wang, Tao; Li, Nan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.12130 (cs)

[Submitted on 23 Dec 2022 (v1), last revised 29 Apr 2023 (this version, v5)]

Title:Learning to Detect and Segment for Open Vocabulary Object Detection

Authors:Tao Wang, Nan Li

View PDF

Abstract:Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.

Comments:	Accepted to CVPR2023, code will be available later
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2212.12130 [cs.CV]
	(or arXiv:2212.12130v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.12130

Submission history

From: Tao Wang [view email]
[v1] Fri, 23 Dec 2022 03:54:59 UTC (3,314 KB)
[v2] Sat, 4 Feb 2023 04:12:03 UTC (3,452 KB)
[v3] Sat, 25 Mar 2023 02:10:59 UTC (3,455 KB)
[v4] Wed, 29 Mar 2023 01:05:39 UTC (3,455 KB)
[v5] Sat, 29 Apr 2023 01:29:39 UTC (3,455 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Detect and Segment for Open Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning to Detect and Segment for Open Vocabulary Object Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators