HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2310.19351v2 [cs.CV] 15 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: The University of Tokyo, Tokyo, Japan
11email: {furuta,ysato}@iis.u-tokyo.ac.jp

Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection

Ryosuke Furuta 0000-0003-1441-889X    Yoichi Sato 0000-0003-0097-4537
Abstract

Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD). In contrast to the conventional domain generalization for object detection that requires labeled data from multiple domains, SS-DGOD and WS-DGOD require labeled data only from one domain and unlabeled or weakly-labeled data from multiple domains for training. In this paper, we show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly-labeled data. We provide novel interpretations of why the Mean Teacher learning framework works well on the two settings in terms of the relationships between the generalization gap and flat minima in parameter space. On the basis of the interpretations, we also propose incorporating a simple regularization method into the Mean Teacher learning framework to find flatter minima. The experimental results demonstrate that the regularization leads to flatter minima and boosts the performance of the detectors trained with the Mean Teacher learning framework on the two settings. They also indicate that those detectors significantly outperform the state-of-the-art methods.

Keywords:
Object detection Domain generalization Semi-supervised learning Weakly-supervised learning

1 Introduction

Object detection has been attracting much attention because it has practically useful applications such as in autonomous driving. Object detectors have performed tremendously well on commonly used benchmark datasets for object detection, such as MSCOCO [19] and PASCAL VOC [7]. However, such performance significantly drops when they are deployed on unseen domains, i.e., when the training and testing domains are different. For example, Inoue et al. [13] reported a performance drop caused by the difference in image styles, and Li et al. [17] showed one caused by the weather and time difference in the images captured with car-mounted cameras.

To solve this problem, many researchers have been exploring unsupervised domain adaptive object detection (UDA-OD) [6, 17, 5]. On UDA-OD, we train object detectors using source domain data with ground-truth labels (bounding boxes and class labels) and unlabeled target domain data to adapt the detectors to the target domain. However, in the real world, target domain data cannot always be accessed in the training phase.

Domain generalizable object detection (DGOD) is another common problem setting for solving the problem of the performance drop caused by the domain gaps [18, 40]. On DGOD, we train object detectors using labeled data from multiple domains so that the detectors work well on unseen domains. However, it is labor-intensive to collect these data for object detection because both bounding boxes and class labels for all objects in the images must be annotated. Although single-DGOD [36, 8, 29, 35, 32], on which we train object detectors to generalize unseen domains using labeled data from one single domain, has been investigated, the performance gain is still limited.

In this paper, we tackle two tasks as more realistic settings: i) semi-supervised DGOD (SS-DGOD) [21] and ii) weakly-supervised DGOD (WS-DGOD). The goal of SS-DGOD is to generalize object detectors to unseen domains using labeled data only from one single domain and unlabeled data from multiple domains. Note that the target domain data are not included in the training data. On WS-DGOD, we use weakly labeled data from multiple domains instead of the unlabeled data in SS-DGOD. “Weakly labeled” means that we have only image-level labels that show the existence of each class in each training image and do not have bounding box annotations. To the best of our knowledge, this is the first attempt to tackle WS-DGOD. We show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly labeled data, and the teacher network is updated as the exponential moving average (EMA) of the student.

Not only do we experimentally demonstrate the good performance of the Mean Teacher learning framework, but also provide novel interpretations of why the Mean Teacher learning framework works well on these two settings in terms of the relationship between generalization ability and flat minima in parameter space. In the research area of domain generalization, it has been shown both theoretically and empirically that neural networks with flatter minima in parameter space have better generalization ability to unseen domains [9, 4, 3, 14, 2, 34, 15, 39]. We show that the two key components of the Mean Teacher learning framework, i) EMA update and ii) learning from pseudo-labels, lead to flat minima during the training.

On the basis of the interpretations, we also propose incorporating a regularization method into the Mean Teacher learning framework to find flatter minima. Specifically, because the teacher and the student have similar loss values around the flat minima, we introduce an additional loss term so that the output from the student network becomes similar to that from the teacher network.

The experimental results demonstrate that the detectors trained with the Mean Teacher learning framework perform well for unseen test domains on the two settings. We show that the simple yet effective regularization leads to flatter minima and boosts the performance of those detectors. We also confirm these detectors significantly outperform the state-of-the-art methods on the two settings.

Our contributions are summarized as follows:

  • We show that object detectors can be effectively trained on the SS-DGOD and WS-DGOD settings with the same Mean Teacher learning framework.

  • We provide interpretations of the reason the detectors trained with the Mean teacher learning framework achieve robustness to unseen test domains in terms of the flatness of minima in parameter space.

  • On the basis of the interpretations, we propose incorporating a simple regularization method into the Mean Teacher learning framework to achieve flatter minima.

  • We are the first to tackle the WS-DGOD setting.

2 Problem Settings

We formally describe the two problem settings of SS-DGOD and WS-DGOD. Their goal is to obtain object detectors that perform well on unseen target domain data 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } is a set of images from the target domain.

On SS-DGOD, we have labeled data from a source domain 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } and unlabeled data from multiple source domains 𝒟si={Xsi}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the training phase. Here, Xs1={xs1j}j=1Ns1subscript𝑋subscript𝑠1superscriptsubscriptsubscriptsuperscript𝑥𝑗subscript𝑠1𝑗1subscript𝑁subscript𝑠1X_{s_{1}}=\{x^{j}_{s_{1}}\}_{j=1}^{N_{s_{1}}}italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a set of Ns1subscript𝑁subscript𝑠1N_{s_{1}}italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT images from domain s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Bs1={bs1j}j=1Ns1subscript𝐵subscript𝑠1superscriptsubscriptsuperscriptsubscript𝑏subscript𝑠1𝑗𝑗1subscript𝑁subscript𝑠1B_{s_{1}}=\{b_{s_{1}}^{j}\}_{j=1}^{N_{s_{1}}}italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Cs1={cs1j}j=1Ns1subscript𝐶subscript𝑠1superscriptsubscriptsuperscriptsubscript𝑐subscript𝑠1𝑗𝑗1subscript𝑁subscript𝑠1C_{s_{1}}=\{c_{s_{1}}^{j}\}_{j=1}^{N_{s_{1}}}italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the corresponding bounding boxes and object-class labels, respectively. sisubscript𝑠𝑖{s_{i}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th source domain, and N𝒟subscript𝑁𝒟N_{\mathcal{D}}italic_N start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT is the number of the source domains. We assume that the data distributions differ between the domains, i.e., P(Xs1)P(Xs2)P(XsND)P(Xt)𝑃subscript𝑋subscript𝑠1𝑃subscript𝑋subscript𝑠2𝑃subscript𝑋subscript𝑠subscript𝑁𝐷𝑃subscript𝑋𝑡P(X_{s_{1}})\neq P(X_{s_{2}})\neq\cdots P(X_{s_{N_{D}}})\neq P(X_{t})italic_P ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≠ italic_P ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≠ ⋯ italic_P ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≠ italic_P ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

On WS-DGOD, we use labeled data from a source domain 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } and weakly labeled data from multiple domains 𝒟si={(Xsi,Csi)}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for training.

Table 1 compares SS-DGOD and WS-DGOD with related problem settings (Single-DGOD, DGOD, and UDA-OD). As discussed in Sec. 1, DGOD requires labeled data from multiple domains 𝒟si={(Xsi,Bsi,Csi)}i=1NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐵subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖1subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, but those data are sometimes hard to prepare due to the high annotation cost. In contrast, SS-DGOD (or WS-DGOD) requires labeled data from one domain 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } and unlabeled data 𝒟si={Xsi}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (or weakly labeled data 𝒟si={(Xsi,Csi)}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), which are easier to obtain. Therefore, SS-DGOD and WS-DGOD are more practical settings than DGOD. By using those data, we aim to better generalize object detectors to the unseen target domain data 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } than on Single-DGOD. Unlike on UDA-OD, we can train the detectors even when the target domain data 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } are not accessible.

Table 1: Formal comparisons of SS-DGOD, WS-DGOD, and related problem settings. DGOD stands for domain generalizable object detection, and SS-DGOD and WS-DGOD are semi-supervised DGOD and weakly-supervised DGOD, respectively. UDA-OD is unsupervised domain adaptive object detection.
task train data test data
Single-DGOD 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
SS-DGOD 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }, 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
𝒟si={Xsi}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
WS-DGOD 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }, 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
𝒟si={(Xsi,Csi)}i=2NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
DGOD 𝒟si={(Xsi,Bsi,Csi)}i=1NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐵subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖1subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
UDA-OD 𝒟si={(Xsi,Bsi,Csi)}i=1NDsubscript𝒟subscript𝑠𝑖superscriptsubscriptsubscript𝑋subscript𝑠𝑖subscript𝐵subscript𝑠𝑖subscript𝐶subscript𝑠𝑖𝑖1subscript𝑁𝐷\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }

3 Related Work

3.1 Domain Generalization for Image Classification

Many methods have been proposed for domain generalization on image classification tasks as summarized in recent survey papers [42, 30]. Among a variety of domain generalization methods, finding flat minima is one of the most common approaches [9, 4, 3, 14, 2, 34, 15, 39]. Those studies empirically and theoretically showed that finding flat minima in parameter space results in a better generalization ability. Ismailov et al. [14] and Cha et al. [3] demonstrated that empirical risk minimization (ERM) with stochastic gradient descent (SGD) converges to the vicinity of a flat minimum, and averaging the parameter weights over a certain number of training steps/epochs results in reaching the flat minimum. Cha et al. [3] also derived the theoretical relationship between flat minima and generalization gap. Inspired by these findings, we reveal that the Mean Teacher learning framework leads to flat minima, and thus can obtain good generalization ability.

3.2 Domain Generalization for Object Detection

Domain generalization for object detection has not been widely explored, compared with image classification. Lin et al. [18] proposed a method for disentangling domain-specific and domain-invariant features by adversarial learning on both image-level and instance-level features for DGOD. Liu et al. [20] investigated DGOD in underwater object detection and proposed DG-YOLO. For Single-DGOD, Wang et al. [35] proposed a self-training method that uses the temporal consistency of objects in videos. Wu et al. [36] proposed a method for disentangling domain-invariant features by contrastive learning and self-distillation. Fan et al. [8] proposed perturbing the channel statistics of feature maps, which can be interpreted as data augmentation of image styles to a variety of domains. Wang et al. [32] proposed a disentangle method on frequency space for object detection from unmanned aerial vehicles. Vidit et al. [29] proposed an augmentation method using a pre-trained vision-language model (CLIP) with textual prompts.

Unlike the above methods, as discussed in Sects. 1 and 2, we tackle SS-DGOD (Semi-Supervised Domain Generalization for Object Detection) and a new problem setting called WS-DGOD (Weakly-Supervised Domain Generalization for Object Detection). SS-DGOD and WS-DGOD are more practical than DGOD because labeled data is required only from one domain. The most closely related to our work is Malakouti and Kovashka’s work [21]. They tackled SS-DGOD and proposed a language-guided alignment method. However, the limitation of their method is that it requires a backbone network that was pre-trained on vision-and-language tasks. Our experiments show that the object detectors trained with the Mean Teacher learning framework and our regularization outperform their method.

3.3 Semi-supervised Domain Generalization

There are a few methods that use both labeled and unlabeled data for domain generalization (SSDG) on image classification [41, 43]. Zhang et al. [41] proposed an unsupervised pre-training method called DARLING, which performs contrastive learning on unlabeled images to obtain domain-irrelevant feature representation. Zhou et al. [43] extended a semi-supervised learning method called FixMatch [26] to SSDG.

In contrast to those studies, we tackle SSDG for object detection. We also tackle the “weakly-labeled” setting (i.e., WS-DGOD), which has not been explored even for image classification.

3.4 Mean Teacher Learning Framework

Mean Teacher learning framework was originally proposed for semi-supervised image classification [28]. Several studies have investigated the use of the Mean Teacher learning framework for a variety of tasks such as domain generalization on image classification [37], (in-domain) weakly-supervised object detection [33], (in-domain) semi-supervised object detection [22], UDA-OD [6, 17], and UDA for semantic segmentaion [1, 31, 12, 38]. Lee et al. [16] provided a theoretical analysis of the Mean Teacher learning framework on masked image modeling pretext tasks for semi-supervised image classification. We show that the Mean Teacher learning framework also works well on different settings (SS-DGOD and WS-DGOD), provide their interpretations, and propose incorporating a regularization method.

Refer to caption
Figure 1: Training framework.

4 Training Method

4.1 Overview and Key Idea

On both SS-DGOD and WS-DGOD, our goal is to obtain object detectors that work well on the unseen target domain data 𝒟t={Xt}subscript𝒟𝑡subscript𝑋𝑡\mathcal{D}_{t}=\{X_{t}\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. Gulrajani and Lopez-Paz [10] reported that if carefully implemented, empirical risk minimization (i.e., the image classifier simply trained with supervised learning on multiple domains) outperformed state-of-the-art domain generalization methods on several benchmark datasets for image classification. Following this important finding, we expect similar behavior on object detection and aim to train an object detector on multiple domains 𝒟si(i=1,,ND)subscript𝒟subscript𝑠𝑖𝑖1subscript𝑁𝐷\mathcal{D}_{s_{i}}(i=1,\cdots,N_{D})caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i = 1 , ⋯ , italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ). However, we have no ground-truth labels (or have only weak labels) for 𝒟si(i=2,,ND)subscript𝒟subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}(i=2,\cdots,N_{D})caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i = 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) although ground-truth labels are available for 𝒟s1subscript𝒟subscript𝑠1\mathcal{D}_{s_{1}}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, the question is how to train a detector on those domains. Our solution is to use the Mean Teacher learning framework for object detection [17, 5] shown in Fig. 1, where we have two networks (teacher and student) with the same structure and train the student network using the pseudo-labels generated by the teacher network. Note that this Mean Teacher learning framework can be applied to any object detector, but we hereafter describe the loss functions of FasterRCNN [23] as an example for ease of explanation.

4.2 Pre-training

If we start the Mean Teacher learning from randomly initialized parameters, the teacher network cannot output reliable pseudo labels. Therefore, we first perform supervised learning with the labeled data of one source domain 𝒟s1={(Xs1,Bs1,Cs1)}subscript𝒟subscript𝑠1subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { ( italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }. \linenomathAMS

θ*=argminθs1sup(θ)superscript𝜃subscriptargmin𝜃subscriptsuperscriptsupsubscript𝑠1𝜃\displaystyle\theta^{*}=\operatorname*{arg\,min}_{\theta}\mathcal{L}^{\mathrm{% sup}}_{s_{1}}(\theta)italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) (1)
s1sup(θ)=RPNcls(θ,Xs1,Bs1,Cs1)+RPNreg(θ,Xs1,Bs1,Cs1)+RoIcls(θ,Xs1,Bs1,Cs1)+RoIreg(θ,Xs1,Bs1,Cs1),subscriptsuperscriptsupsubscript𝑠1𝜃subscriptsuperscriptclsRPN𝜃subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1subscriptsuperscriptregRPN𝜃subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1subscriptsuperscriptclsRoI𝜃subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1subscriptsuperscriptregRoI𝜃subscript𝑋subscript𝑠1subscript𝐵subscript𝑠1subscript𝐶subscript𝑠1\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}}% )+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1% }})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}% })+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{% 1}}),\end{multlined}\mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)=\mathcal{L}^{% \mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}})+\mathcal{L}% ^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}% })+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{% 1}}),start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , end_CELL end_ROW (4)
\endlinenomath

where RPNclssubscriptsuperscriptclsRPN\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT and RPNregsubscriptsuperscriptregRPN\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT are the classification and regression losses for region proposal networks (RPN), respectively. RoIclssubscriptsuperscriptclsRoI\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT and RoIregsubscriptsuperscriptregRoI\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT are those for RoIhead. We initialize both the teacher and student networks with the parameters θ*superscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT obtained from this pre-training.

4.3 Mean Teacher Learning

4.3.1 Generate Pseudo-labels

Because we have no ground-truth labels (or have only weak labels) for the other source domains 𝒟si(i=2,,ND)subscript𝒟subscript𝑠𝑖𝑖2subscript𝑁𝐷\mathcal{D}_{s_{i}}(i=2,\cdots,N_{D})caligraphic_D start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_i = 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), we generate pseudo labels using the teacher network. Specifically, we perform weak data augmentation to the unlabeled (or weakly-labeled) image xsijsuperscriptsubscript𝑥subscript𝑠𝑖𝑗x_{s_{i}}^{j}italic_x start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and input it into the teacher network. We denote the output from the teacher as {(b^sijr,p^sijr)}r=1NRsuperscriptsubscriptsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟𝑟1subscript𝑁𝑅\{(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})\}_{r=1}^{N_{R}}{ ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where b^sijrsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟\hat{b}_{s_{i}}^{jr}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT and p^sijrsuperscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟\hat{p}_{s_{i}}^{jr}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT are the predicted bounding box and class probabilities for the r𝑟ritalic_r-th region of interests (RoI) in the j𝑗jitalic_j-th image, respectively, and NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the number of output RoIs.

In the case of SS-DGOD, we simply perform post-processing fpostsubscript𝑓𝑝𝑜𝑠𝑡f_{post}italic_f start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT to (b^sijr,p^sijr)superscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) and obtain the pseudo label (b¯sijr,c¯sijr)superscriptsubscript¯𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript¯𝑐subscript𝑠𝑖𝑗𝑟(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})( over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ):

(b¯sijr,c¯sijr)=fpost(b^sijr,p^sijr).superscriptsubscript¯𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript¯𝑐subscript𝑠𝑖𝑗𝑟subscript𝑓postsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})=f_{\mathrm{post}}(\hat{b}_{s_{i}}^% {jr},\hat{p}_{s_{i}}^{jr}).( over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) = italic_f start_POSTSUBSCRIPT roman_post end_POSTSUBSCRIPT ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) . (5)

Post-processing fpostsubscript𝑓𝑝𝑜𝑠𝑡f_{post}italic_f start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT indicates a simple thresholding function if we use “hard” pseudo labels like [17] and indicates a sharpening function if we use “soft” pseudo labels like [5].

In the case of WS-DGOD, we perform the refinement process of applying the weak labels to the predicted class probabilities p^sijrsuperscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟\hat{p}_{s_{i}}^{jr}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT immediately before post-processing fpostsubscript𝑓𝑝𝑜𝑠𝑡f_{post}italic_f start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT to obtain more accurate pseudo labels as follows: \linenomathAMS

p^sijr(k)={p^sijr(k)ifkcsij0otherwisesuperscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟𝑘casessuperscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟𝑘if𝑘superscriptsubscript𝑐subscript𝑠𝑖𝑗0otherwise\displaystyle\hat{p}_{s_{i}}^{jr}(k)=\begin{cases}\hat{p}_{s_{i}}^{jr}(k)&% \text{if}\ k\in c_{s_{i}}^{j}\\ 0&\text{otherwise}\end{cases}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ( italic_k ) = { start_ROW start_CELL over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ( italic_k ) end_CELL start_CELL if italic_k ∈ italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (6)
(b¯sijr,c¯sijr)=fpost(b^sijr,p^sijr),superscriptsubscript¯𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript¯𝑐subscript𝑠𝑖𝑗𝑟subscript𝑓postsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟\displaystyle(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})=f_{\mathrm{post}}(% \hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr}),( over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) = italic_f start_POSTSUBSCRIPT roman_post end_POSTSUBSCRIPT ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) , (7)
\endlinenomath

where p^sijr(k)superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟𝑘\hat{p}_{s_{i}}^{jr}(k)over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ( italic_k ) is the predicted class probability for the k𝑘kitalic_k-th class. Using the weak label csijsuperscriptsubscript𝑐subscript𝑠𝑖𝑗c_{s_{i}}^{j}italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, Eq. (6) makes the predicted probability zero if the k𝑘kitalic_k-th class does not exist in the j𝑗jitalic_j-th image.

4.3.2 Update Student

Now we have the pseudo labels B¯si={b¯sij}j=1Nsisubscript¯𝐵subscript𝑠𝑖superscriptsubscriptsuperscriptsubscript¯𝑏subscript𝑠𝑖𝑗𝑗1subscript𝑁subscript𝑠𝑖\bar{B}_{s_{i}}=\{\bar{b}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over¯ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and C¯si={c¯sij}j=1Nsisubscript¯𝐶subscript𝑠𝑖superscriptsubscriptsuperscriptsubscript¯𝑐subscript𝑠𝑖𝑗𝑗1subscript𝑁subscript𝑠𝑖\bar{C}_{s_{i}}=\{\bar{c}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over¯ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and train the student network with them.

We perform strong data augmentations to the image xsijsuperscriptsubscript𝑥subscript𝑠𝑖𝑗x_{s_{i}}^{j}italic_x start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and input it into the student network. In domain s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, because the ground-truth labels are available, we update the student by backpropagating loss s1supsubscriptsuperscriptsupsubscript𝑠1\mathcal{L}^{\mathrm{sup}}_{s_{1}}caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Eq. (4). In the other domains si(i=2,,ND)subscript𝑠𝑖𝑖2subscript𝑁𝐷s_{i}(i=2,\cdots,N_{D})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), we calculate loss siunsupsubscriptsuperscriptunsupsubscript𝑠𝑖\mathcal{L}^{\mathrm{unsup}}_{s_{i}}caligraphic_L start_POSTSUPERSCRIPT roman_unsup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT using the pseudo labels and backpropagate it to update the student. In summary, we update the parameters of student θstudentsuperscript𝜃student\theta^{\mathrm{student}}italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT with loss studentsuperscriptstudent\mathcal{L}^{\mathrm{student}}caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT as follows: \linenomathAMS

θstudentθstudentθstudent(θ)superscript𝜃studentsuperscript𝜃studentsubscript𝜃superscriptstudent𝜃\displaystyle\theta^{\mathrm{student}}\leftarrow\theta^{\mathrm{student}}-% \nabla_{\theta}\mathcal{L}^{\mathrm{student}}(\theta)italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) (8)
student(θ)=s1sup(θ)+i=2NDsiunsup(θ)superscriptstudent𝜃subscriptsuperscriptsupsubscript𝑠1𝜃superscriptsubscript𝑖2subscript𝑁𝐷subscriptsuperscriptunsupsubscript𝑠𝑖𝜃\displaystyle\mathcal{L}^{\mathrm{student}}(\theta)=\mathcal{L}^{\mathrm{sup}}% _{s_{1}}(\theta)+\sum_{i=2}^{N_{D}}\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(\theta)caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_unsup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) (9)
siunsup(θ)=RPNcls(θ,Xsi,B¯si,C¯si)+RPNreg(θ,Xsi,B¯si,C¯si)+RoIcls(θ,Xsi,B¯si,C¯si)+RoIreg(θ,Xsi,B¯si,C¯si).subscriptsuperscriptunsupsubscript𝑠𝑖𝜃subscriptsuperscriptclsRPN𝜃subscript𝑋subscript𝑠𝑖subscript¯𝐵subscript𝑠𝑖subscript¯𝐶subscript𝑠𝑖subscriptsuperscriptregRPN𝜃subscript𝑋subscript𝑠𝑖subscript¯𝐵subscript𝑠𝑖subscript¯𝐶subscript𝑠𝑖subscriptsuperscriptclsRoI𝜃subscript𝑋subscript𝑠𝑖subscript¯𝐵subscript𝑠𝑖subscript¯𝐶subscript𝑠𝑖subscriptsuperscriptregRoI𝜃subscript𝑋subscript𝑠𝑖subscript¯𝐵subscript𝑠𝑖subscript¯𝐶subscript𝑠𝑖\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}}).\end{multlined}\mathcal{L}^{\mathrm{unsup}}_{% s_{i}}(\theta)=\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\bar% {B}_{s_{i}},\bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,% X_{s_{i}},\bar{B}_{s_{i}},\bar{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}}).start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT roman_unsup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . end_CELL end_ROW (12)
\endlinenomath

4.3.3 Update Teacher

Similar to previous studies [5, 17], we do not update the parameters of the teacher θteachersuperscript𝜃teacher\theta^{\mathrm{teacher}}italic_θ start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT by backpropagation to obtain stable pseudo labels. Instead, we update them by the exponential moving average (EMA) of the parameters of the student network:

θteacherαθteacher+(1α)θstudent,superscript𝜃teacher𝛼superscript𝜃teacher1𝛼superscript𝜃student\theta^{\mathrm{teacher}}\leftarrow\alpha\theta^{\mathrm{teacher}}+(1-\alpha)% \theta^{\mathrm{student}},italic_θ start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT ← italic_α italic_θ start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT , (13)

where α𝛼\alphaitalic_α is a hyperparameter to control the update speed.

5 Why Does Mean Teacher Become Robust to Unseen Domains?

We provide novel interpretations of why the Mean Teacher learning framework works well on SS-DGOD and WS-DGOD settings in terms of the relationship between generalization ability and flat minima in parameter space. We show that the two key components of the Mean Teacher learning framework, i) EMA update and ii) learning from pseudo labels, lead to flat minima during the training.

5.1 Definition

We define an empirical risk as ER(θ):=i=1NDsisup(θ)assignsubscriptER𝜃superscriptsubscript𝑖1subscript𝑁𝐷subscriptsuperscriptsupsubscript𝑠𝑖𝜃\mathcal{E}_{\mathrm{ER}}(\theta)\vcentcolon=\sum_{i=1}^{N_{D}}\mathcal{L}^{% \mathrm{sup}}_{s_{i}}(\theta)caligraphic_E start_POSTSUBSCRIPT roman_ER end_POSTSUBSCRIPT ( italic_θ ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) when we assume that ground-truth labels are available on all the training domains. A risk at the target domain is defined as t(θ):=tsup(θ)assignsubscript𝑡𝜃subscriptsuperscript𝑠𝑢𝑝𝑡𝜃\mathcal{E}_{t}(\theta)\vcentcolon=\mathcal{L}^{sup}_{t}(\theta)caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) := caligraphic_L start_POSTSUPERSCRIPT italic_s italic_u italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ). The goal is to minimize the test risk t(θ)subscript𝑡𝜃\mathcal{E}_{t}(\theta)caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) by only solving the empirical risk minimization (ERM), i.e., minθER(θ)subscript𝜃subscriptER𝜃\min_{\theta}\mathcal{E}_{\mathrm{ER}}(\theta)roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT roman_ER end_POSTSUBSCRIPT ( italic_θ ). Hereafter, we use the terms risk and loss interchangeably.

Figure 2: Intuitive interpretation of flat minimum and its robustness to domain shift.
Refer to caption
Refer to caption
Figure 2: Intuitive interpretation of flat minimum and its robustness to domain shift.
Figure 3: Empirical and robust risks.

5.2 Preliminary Knowledge

Previous studies for domain generalization demonstrated both theoretically and empirically that neural networks with flatter minima in parameter space exhibit superior generalization ability to unseen domains [9, 4, 3, 14, 2, 34, 15, 39]. Fig. 3 shows its intuitive interpretation. We can see that when there is a domain shift between training and testing, the flat minimum of the training loss results in a lower test loss than the sharp minimum.

Cha et al. [3] theoretically revealed the relationship between the flat minima and generalization gap (i.e., performance drop by domain shift). We briefly describe the theorem for the subsequent explanation. We consider the worst-case loss within neighbor regions in parameter space, which is defined as a robust risk RRγ(θ):=maxΔγER(θ+Δ)assignsuperscriptsubscript𝑅𝑅𝛾𝜃subscriptnormΔ𝛾subscript𝐸𝑅𝜃Δ\mathcal{E}_{RR}^{\gamma}(\theta)\vcentcolon=\max_{{\|\Delta}\|\leq\gamma}% \mathcal{E}_{ER}(\theta+\Delta)caligraphic_E start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ ) := roman_max start_POSTSUBSCRIPT ∥ roman_Δ ∥ ≤ italic_γ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_E italic_R end_POSTSUBSCRIPT ( italic_θ + roman_Δ ). Here, γ𝛾\gammaitalic_γ is the radius of the neighbor region. As shown in Fig. 3, when γ𝛾\gammaitalic_γ is sufficiently large, sharp minima of the empirical risk are not minima of the robust risk. In contrast, the minima of the robust risk (i.e., argminθRRγ(θ)subscriptargmin𝜃superscriptsubscript𝑅𝑅𝛾𝜃\operatorname*{arg\,min}_{\theta}\mathcal{E}_{RR}^{\gamma}(\theta)start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ )) are also minima in the flat regions of the empirical risk. The following theorem shows the relationship between the optimal solution of robust risk minimization (RRM):

Theorem (from [3]).

Consider a set of N𝑁Nitalic_N covers {Θk}k=1Nsuperscriptsubscriptsubscriptnormal-Θ𝑘𝑘1𝑁\{\Theta_{k}\}_{k=1}^{N}{ roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT such that the parameter space ΘkNΘknormal-Θsuperscriptsubscript𝑘𝑁subscriptnormal-Θ𝑘\Theta\subset\cup_{k}^{N}\Theta_{k}roman_Θ ⊂ ∪ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where diam(Θ):=supθ,θΘθθ2assignnormal-diamnormal-Θsubscriptnormal-sup𝜃superscript𝜃normal-′normal-Θsubscriptnorm𝜃superscript𝜃normal-′2\mathrm{diam}(\Theta)\vcentcolon=\mathrm{sup}_{\theta,\theta^{\prime}\in\Theta% }\|\theta-\theta^{\prime}\|_{2}roman_diam ( roman_Θ ) := roman_sup start_POSTSUBSCRIPT italic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, N:=(diam(Θ)/γ)dassign𝑁superscriptnormal-diamnormal-Θ𝛾𝑑N\vcentcolon=\lceil(\mathrm{diam}(\Theta)/\gamma)^{d}\rceilitalic_N := ⌈ ( roman_diam ( roman_Θ ) / italic_γ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⌉ and d𝑑ditalic_d is dimension of Θnormal-Θ\Thetaroman_Θ. Let θγsuperscript𝜃𝛾\theta^{\gamma}italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT denote the optimal solution of the RRM, i.e., θγ:=argminθRRγ(θ)assignsuperscript𝜃𝛾subscriptnormal-argnormal-min𝜃superscriptsubscript𝑅𝑅𝛾𝜃\theta^{\gamma}\vcentcolon=\operatorname*{arg\,min}_{\theta}\mathcal{E}_{RR}^{% \gamma}(\theta)italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ ), and let vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and v𝑣vitalic_v be VC dimensions of each Θksubscriptnormal-Θ𝑘\Theta_{k}roman_Θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Θnormal-Θ\Thetaroman_Θ, respectively. Then, the gap between the optimal test loss, minθt(θ)subscriptsuperscript𝜃normal-′subscript𝑡superscript𝜃normal-′\min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^{\prime})roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and the test loss of θγsuperscript𝜃𝛾\theta^{\gamma}italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT, t(θγ)subscript𝑡superscript𝜃𝛾\mathcal{E}_{t}(\theta^{\gamma})caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ), has the following bound with probability of at least 1δ1𝛿1-\delta1 - italic_δ.

t(θγ)minθt(θ)RRγ(θγ)minθ′′ER(θ′′)+1NDi=1NDDiv(si,t)+maxk[1,N]vkln(m/vk)+ln(2N/δ)m+vln(m/v)+ln(2/δ)m,subscript𝑡superscript𝜃𝛾subscriptsuperscript𝜃subscript𝑡superscript𝜃superscriptsubscript𝑅𝑅𝛾superscript𝜃𝛾subscriptsuperscript𝜃′′subscript𝐸𝑅superscript𝜃′′1subscript𝑁𝐷superscriptsubscript𝑖1subscript𝑁𝐷Divsubscript𝑠𝑖𝑡subscript𝑘1𝑁subscript𝑣𝑘𝑚subscript𝑣𝑘2𝑁𝛿𝑚𝑣𝑚𝑣2𝛿𝑚\begin{multlined}\mathcal{E}_{t}(\theta^{\gamma})-\min_{\theta^{\prime}}% \mathcal{E}_{t}(\theta^{\prime})\leq\mathcal{E}_{RR}^{\gamma}(\theta^{\gamma})% -\min_{\theta^{\prime\prime}}\mathcal{E}_{ER}(\theta^{\prime\prime})+\frac{1}{% N_{D}}\sum_{i=1}^{N_{D}}\mathrm{Div}(s_{i},t)\\ +\max_{k\in[1,N]}\sqrt{\frac{v_{k}\ln(m/v_{k})+\ln(2N/\delta)}{m}}+\sqrt{\frac% {v\ln(m/v)+\ln(2/\delta)}{m}},\end{multlined}\mathcal{E}_{t}(\theta^{\gamma})-% \min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^{\prime})\leq\mathcal{E}_{RR}^{% \gamma}(\theta^{\gamma})-\min_{\theta^{\prime\prime}}\mathcal{E}_{ER}(\theta^{% \prime\prime})+\frac{1}{N_{D}}\sum_{i=1}^{N_{D}}\mathrm{Div}(s_{i},t)\\ +\max_{k\in[1,N]}\sqrt{\frac{v_{k}\ln(m/v_{k})+\ln(2N/\delta)}{m}}+\sqrt{\frac% {v\ln(m/v)+\ln(2/\delta)}{m}},start_ROW start_CELL caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ caligraphic_E start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_E italic_R end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_Div ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t ) end_CELL end_ROW start_ROW start_CELL + roman_max start_POSTSUBSCRIPT italic_k ∈ [ 1 , italic_N ] end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ln ( italic_m / italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_ln ( 2 italic_N / italic_δ ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG italic_v roman_ln ( italic_m / italic_v ) + roman_ln ( 2 / italic_δ ) end_ARG start_ARG italic_m end_ARG end_ARG , end_CELL end_ROW (14)

where m is the number of training samples and Div(si,t):=2supA|si(A)t(A)|assignnormal-Divsubscript𝑠𝑖𝑡2normal-snormal-usubscriptnormal-p𝐴subscriptsubscript𝑠𝑖𝐴subscript𝑡𝐴\mathrm{Div}(s_{i},t)\vcentcolon=2\mathrm{sup}_{A}|\mathbb{P}_{s_{i}}(A)-% \mathbb{P}_{t}(A)|roman_Div ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t ) := 2 roman_s roman_u roman_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT | blackboard_P start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ) - blackboard_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_A ) | is a divergence between two distributions.

For its proof, see [3]. From the theorem, we can interpret that the gap between the RRM and ERM (i.e., RRγ(θγ)minθ′′ER(θ′′)superscriptsubscript𝑅𝑅𝛾superscript𝜃𝛾subscriptsuperscript𝜃′′subscript𝐸𝑅superscript𝜃′′\mathcal{E}_{RR}^{\gamma}(\theta^{\gamma})-\min_{\theta^{\prime\prime}}% \mathcal{E}_{ER}(\theta^{\prime\prime})caligraphic_E start_POSTSUBSCRIPT italic_R italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_E italic_R end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT )) upper bounds the generalization gap in the test domain (i.e., t(θγ)minθt(θ)subscript𝑡superscript𝜃𝛾subscriptsuperscript𝜃subscript𝑡superscript𝜃\mathcal{E}_{t}(\theta^{\gamma})-\min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^% {\prime})caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )). Intuitively, as shown in Fig 3, the gap between the RRM and ERM narrows at flat regions of ERM. Therefore, we can interpret that lowering the gap leads to flat minima of ERM and results in better generalization performance on the target domain.

5.3 EMA Update

We explain why the EMA update in the Mean Teacher learning framework leads to flat minima. Mandt et al. [27] showed that optimizing with constant SGD (i.e., SGD with a fixed learning rate) converges to a Gaussian distribution centered on the optimum. On the basis of this finding, Izmailov et al. [14] and Cha et al. [3] showed that the ERM with SGD converges to the marginal of a flat minimum, and averaging the weights of the parameters over some training steps/epochs leads to the flat minima. Although they [14, 3] proposed sophisticated algorithms for averaging the weights to avoid overfitting, we found that a simple EMA leads to flat minima. The experiments presented in Sec. 7 show that the teacher network with only the EMA update of the student (i.e., without pseudo labeling) as shown in Eqs. (15-16) can reach flatter minima and perform better than the student.

θstudentsuperscript𝜃student\displaystyle\theta^{\mathrm{student}}italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT θstudentθstudent(θ),student(θ)=s1sup(θ)formulae-sequenceabsentsuperscript𝜃studentsubscript𝜃superscriptstudent𝜃superscriptstudent𝜃subscriptsuperscriptsupsubscript𝑠1𝜃\displaystyle\leftarrow\theta^{\mathrm{student}}-\nabla_{\theta}\mathcal{L}^{% \mathrm{student}}(\theta),\ \ \ \ \mathcal{L}^{\mathrm{student}}(\theta)=% \mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)← italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) , caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) (15)
θteachersuperscript𝜃teacher\displaystyle\theta^{\mathrm{teacher}}italic_θ start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT αθteacher+(1α)θstudent,absent𝛼superscript𝜃teacher1𝛼superscript𝜃student\displaystyle\leftarrow\alpha\theta^{\mathrm{teacher}}+(1-\alpha)\theta^{% \mathrm{student}},← italic_α italic_θ start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT , (16)
Figure 4: Intuitive interpretation of difference between loss values of trajectory of student and their mean (teacher).
Refer to caption
Refer to caption
Figure 4: Intuitive interpretation of difference between loss values of trajectory of student and their mean (teacher).
Figure 5: Overview of regualization method.

5.4 Learning from Pseudo Labels

We explain why learning from pseudo-labels in the Mean Teacher learning framework leads to flat minima. Assuming that the pseudo-labels from the teacher are accurate enough (i.e., similar enough to ground truth), siunsupsuperscriptsubscriptsubscript𝑠𝑖𝑢𝑛𝑠𝑢𝑝\mathcal{L}_{s_{i}}^{unsup}caligraphic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_n italic_s italic_u italic_p end_POSTSUPERSCRIPT in Eq. (9) can be approximated by sisupsuperscriptsubscriptsubscript𝑠𝑖𝑠𝑢𝑝\mathcal{L}_{s_{i}}^{sup}caligraphic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_p end_POSTSUPERSCRIPT, and we can regard the student network as the ERM in Sec. 5.1. On the other hand, as explained in Sec. 5.3 and shown in the experiments, because the teacher network updated with EMA has a better ability to reach flat minima than the student, we can regard the teacher network as the RRM. Therefore, from Eq. (14), the smaller the difference between the losses of the teacher and student, the smaller the generalization gap in the target domain is. Fig. 5 shows its intuitive interpretations. At the flat region, the trajectory of the student over the training steps and their mean (teacher) have similar loss values. In contrast, there is a large difference between the loss values of the trajectory of the student and their mean at the sharp valley.

Next, we show that learning from pseudo-labels in the Mean Teacher learning framework makes the losses of the student and teacher similar. Because the student is trained with the output from the teacher as pseudo-ground truth, the training promotes the outputs from the student similar to those from the teacher. When we use monotonically increasing/decreasing functions with respect to the outputs as loss functions \mathcal{E}caligraphic_E (e.g., cross-entropy loss (p)=pgtlog(p)𝑝subscript𝑝𝑔𝑡𝑝\mathcal{E}(p)=p_{gt}\log(p)caligraphic_E ( italic_p ) = italic_p start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT roman_log ( italic_p )), the more similar the outputs are, the more similar the loss values are, as shown below:

Proposition.

Assume p1<p2<p3subscript𝑝1subscript𝑝2subscript𝑝3p_{1}<p_{2}<p_{3}\in\mathbb{R}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R, and (p):normal-:𝑝normal-→\mathcal{E}(p):\mathbb{R}\rightarrow\mathbb{R}caligraphic_E ( italic_p ) : blackboard_R → blackboard_R is a monotonically increasing/decreasing function of p𝑝pitalic_p. Then, |(p3)(p2)|<|(p3)(p1)|subscript𝑝3subscript𝑝2subscript𝑝3subscript𝑝1|\mathcal{E}(p_{3})-\mathcal{E}(p_{2})|<|\mathcal{E}(p_{3})-\mathcal{E}(p_{1})|| caligraphic_E ( italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - caligraphic_E ( italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | < | caligraphic_E ( italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - caligraphic_E ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | holds.

Let us consider p3subscript𝑝3p_{3}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as the teacher’s output, and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the outputs of the student. Since p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is closer to p3subscript𝑝3p_{3}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT than p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the loss of p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT becomes more similar to the loss of p3subscript𝑝3p_{3}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT than that of p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Therefore, we can interpret that learning from pseudo-labels align the outputs from the student to be similar to those from the teacher, thereby aligning the loss values, consequently leading to flat minima.

6 Regularization for Flatter Minima

6.1 Method

As discussed in Sec. 5, when the output from the student and teacher are similar, the networks tend to reach flat minima. To this end, we propose incorporating a simple regularization method to make the two networks’ outputs more similar by training the student using raw outputs from the teacher.

Fig. 5 shows an overview of the method. The concept is to apply regularization so that the outputs from the two networks are similar for the same input image. Specifically, we perform weak data augmentations to the unlabeled (or weakly labeled) image xsijsuperscriptsubscript𝑥subscript𝑠𝑖𝑗x_{s_{i}}^{j}italic_x start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and input the image into the teacher network. We then use the output from the teacher {(b^sijr,p^sijr)}r=1NRsuperscriptsubscriptsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑟superscriptsubscript^𝑝subscript𝑠𝑖𝑗𝑟𝑟1subscript𝑁𝑅\{(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})\}_{r=1}^{N_{R}}{ ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_r end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT directly as pseudo-ground truth without post-processing in Eq. (5). To update the student, we input the same weakly augmented image xsijsuperscriptsubscript𝑥subscript𝑠𝑖𝑗x_{s_{i}}^{j}italic_x start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT into the student and calculate the loss as follows: \linenomathAMS

θstudentθstudentθstudent(θ)superscript𝜃studentsuperscript𝜃studentsubscript𝜃superscriptstudent𝜃\displaystyle\theta^{\mathrm{student}}\leftarrow\theta^{\mathrm{student}}-% \nabla_{\theta}\mathcal{L}^{\mathrm{student}}(\theta)italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) (17)
student(θ)=s1sup(θ)+i=2ND[siunsup(θ)+βsiregul.(θ)]superscriptstudent𝜃subscriptsuperscriptsupsubscript𝑠1𝜃superscriptsubscript𝑖2subscript𝑁𝐷delimited-[]subscriptsuperscriptunsupsubscript𝑠𝑖𝜃𝛽subscriptsuperscriptregulsubscript𝑠𝑖𝜃\displaystyle\mathcal{L}^{\mathrm{student}}(\theta)=\mathcal{L}^{\mathrm{sup}}% _{s_{1}}(\theta)+\sum_{i=2}^{N_{D}}[\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(% \theta)+\beta\mathcal{L}^{\mathrm{regul.}}_{s_{i}}(\theta)]caligraphic_L start_POSTSUPERSCRIPT roman_student end_POSTSUPERSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_sup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ caligraphic_L start_POSTSUPERSCRIPT roman_unsup end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) + italic_β caligraphic_L start_POSTSUPERSCRIPT roman_regul . end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) ] (18)
siregul.(θ)=RPNcls(θ,Xsi,B^si,C^si)+RPNreg(θ,Xsi,B^si,C^si)+RoIcls(θ,Xsi,B^si,C^si)+RoIreg(θ,Xsi,B^si,C^si),subscriptsuperscriptregulsubscript𝑠𝑖𝜃subscriptsuperscriptclsRPN𝜃subscript𝑋subscript𝑠𝑖subscript^𝐵subscript𝑠𝑖subscript^𝐶subscript𝑠𝑖subscriptsuperscriptregRPN𝜃subscript𝑋subscript𝑠𝑖subscript^𝐵subscript𝑠𝑖subscript^𝐶subscript𝑠𝑖subscriptsuperscriptclsRoI𝜃subscript𝑋subscript𝑠𝑖subscript^𝐵subscript𝑠𝑖subscript^𝐶subscript𝑠𝑖subscriptsuperscriptregRoI𝜃subscript𝑋subscript𝑠𝑖subscript^𝐵subscript𝑠𝑖subscript^𝐶subscript𝑠𝑖\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{regul.}}_{s_{i}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}}),\end{multlined}\mathcal{L}^{\mathrm{regul.}}_% {s_{i}}(\theta)=\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(% \theta,X_{s_{i}},\hat{B}_{s_{i}},\hat{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}}),start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT roman_regul . end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RPN end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL + caligraphic_L start_POSTSUPERSCRIPT roman_cls end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUPERSCRIPT roman_reg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_RoI end_POSTSUBSCRIPT ( italic_θ , italic_X start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , end_CELL end_ROW (21)
\endlinenomath

where B^si={b^sij}j=1Nsisubscript^𝐵subscript𝑠𝑖superscriptsubscriptsuperscriptsubscript^𝑏subscript𝑠𝑖𝑗𝑗1subscript𝑁subscript𝑠𝑖\hat{B}_{s_{i}}=\{\hat{b}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and C^si={c^sij}j=1Nsisubscript^𝐶subscript𝑠𝑖superscriptsubscriptsuperscriptsubscript^𝑐subscript𝑠𝑖𝑗𝑗1subscript𝑁subscript𝑠𝑖\hat{C}_{s_{i}}=\{\hat{c}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the raw pseudo-labels from the teacher, and β𝛽\betaitalic_β is a hyperparameter to tune the strength of the regularization.

6.2 Connection to Prior Arts

We can regard the regularization method as a type of knowledge distillation because the student is trained to mimic the raw output from the teacher. Although the technical details are different, it has been empirically shown that knowledge distillation methods are effective on related tasks such as Single-DGOD [36] and domain adaptive semantic segmentation [38]. We believe that our interpretation revealed one of the reasons knowledge-distillation methods lead to better generalization ability.

7 Experiments

7.1 Dataset Details

We used the artistic style image dataset [13], which has four domains: natural image, clipart, comic, and watercolor. The natural image domain has 16,551 images from PASCAL VOC07&12, and the other domains have 1,000, 2,000, and 2,000 images, respectively. There are six object classes (bike, bird, car, cat, dog, and person), and we removed the images that do not contain these classes.

We conducted the experiments on two patterns of domains. In the first pattern, we set the natural image domain as the labeled domain s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and set clipart and comic as the unlabeled domains s2,s3subscript𝑠2subscript𝑠3s_{2},s_{3}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We set watercolor as the target domain t𝑡titalic_t. Concretely, we used the trainval set of PASCAL VOC 2007&2012, the train set of clipart, and the train set of comic for training. We then used the test sets of clipart and comic for validation. For evaluation (testing), we used the test set of watercolor.

In the second pattern, we set (s1,s2,s3,t)=(natural,watercolor,comic,clipart)subscript𝑠1subscript𝑠2subscript𝑠3𝑡naturalwatercolorcomicclipart(s_{1},s_{2},s_{3},t)=(\mathrm{natural,watercolor,comic,clipart})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_t ) = ( roman_natural , roman_watercolor , roman_comic , roman_clipart ). The results on another dataset are shown in the supplementary material.

7.2 Implementation Details

We used soft pseudo labeling proposed in [5] for the Mean Teacher learning. We used Gaussian FasterRCNN [5] as the object detector, in which the regression output is modified to use the soft labels. We used ResNet101 [11] as the backbone. We applied the same hyperparameters as in a previous study [5] except for the number of iterations. All training (including baseline models) was done with four A100 GPUs. The parameters of the backbone network were initialized with the ResNet101 pre-trained on ImageNet. The hyperparameter β𝛽\betaitalic_β in Eq. (18) was set to 0.5 throughout the experiments. During the inference (testing) phase, we used the teacher network. Other details are given in the supplementary material.

7.3 Baseline Methods

As the baseline, we trained the detector Gaussian FasterRCNN on Single-DGOD setting (i.e., supervised learning on s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Eqs. (1-4)). To show the effectiveness of the EMA update, we trained Gaussian FasterRCNN + EMA with Eqs. (15-16). Gaussian FasterRCNN + EMA + PL is a detector trained with the Mean Teacher learning framework in Eqs. (8-13). Gaussian FasterRCNN + EMA + PL + Regul. is a detector with the Mean Teacher learning framework and the regualization in Eqs. (17-21).

To confirm the upper-bound performance, we also trained Gaussian FasterRCNN on DGOD and Oracle settings. On DGOD, the detector was trained with supervised learning using the ground-truth labels on the domains s1,s2subscript𝑠1subscript𝑠2s_{1},s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. On Oracle, the detector was trained with supervised learning on s1,s2,s3subscript𝑠1subscript𝑠2subscript𝑠3s_{1},s_{2},s_{3}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and the target domain t𝑡titalic_t.

Because there is only one existing method on SS-DGOD (i.e., CDDMSL [21]), we compared the above detectors with state-of-the-art methods on related task settings such as Single-DGOD and UDA-OD.

7.4 Comparisons with State-of-the-Art Methods

Table 2: Comparisons of mAP50 on the artistic style image dataset [13] when the target domain is watercolor. Values with * are from previous study [17].
setting method backbone mAP50 (watercolor)
Single-DGOD CLIP-based augmentation [29] Res101 46.6
Single-DGOD Gaussian FasterRCNN Res101 50.5
Single-DGOD Gaussian FasterRCNN + EMA Res101 55.5
SS-DGOD CDDMSL [21] Res50 (RegionCLIP) 46.1
SS-DGOD CDDMSL [21] Res101 41.3
SS-DGOD Gaussian FasterRCNN + EMA + PL Res101 56.6
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 58.2
WS-DGOD Gaussian FasterRCNN + EMA + PL Res101 59.7
WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 62.9
DGOD Gaussian FasterRCNN Res101 62.6
Oracle Gaussian FasterRCNN Res101 62.2
UDA-OD Gaussian FasterRCNN [5] Res101 54.9
UDA-OD SCL* [25] Res101 55.2
UDA-OD SWDA* [24] Res101 53.3
UDA-OD UMT* [6] Res101 58.1
UDA-OD AT* [17] Res101 59.9
Table 3: Comparisons of mAP50 on the artistic style image dataset [13] when the target domain is clipart.
setting method backbone mAP50 (clipart)
Single-DGOD CLIP-based augmentation [29] Res101 27.2
Single-DGOD Gaussian FasterRCNN Res101 34.5
Single-DGOD Gaussian FasterRCNN + EMA Res101 38.0
SS-DGOD CDDMSL [21] Res50 (RegionCLIP) 39.1
SS-DGOD CDDMSL [21] Res101 26.0
SS-DGOD Gaussian FasterRCNN + EMA + PL Res101 39.8
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 43.3
WS-DGOD Gaussian FasterRCNN + EMA + PL Res101 44.2
WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 46.2
DGOD Gaussian FasterRCNN Res101 47.1
Oracle Gaussian FasterRCNN Res101 48.2
UDA-OD Gaussian FasterRCNN [5] Res101 43.4

Table 2 shows the results on the artistic image style dataset when (s1,s2,s3,t)=subscript𝑠1subscript𝑠2subscript𝑠3𝑡absent(s_{1},s_{2},s_{3},t)=( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_t ) =(natural, clipart, comic, watercolor). We evaluated with the mean average precision (mAP50) when the IoU threshold was 0.5. EMA increased the mAP from 50.5 to 55.5, and this was further boosted to 56.6 with pseudo labeling (PL). We observed additional improvement to 58.2 with the regularization. The regularization improved the performance not only on SS-DGOD but also on WS-DGOD. Those results are comparable to those of the detectors trained on DGOD and Oracle. The detectors trained on SS-DGOD and WS-DGOD also performed comparably to or better than those on UDA-OD, although we did not use the target domain data during the training.

For fair comparisons, we trained CDDMSL with Res101 backbone pre-trained on ImageNet. However, its performance significantly degraded, as reported in a previous study [21], because it requires language-guided training, and initializing the model with RegionCLIP is crucial to achieve good performance.

Table 3 shows the results when (s1,s2,s3,t)=subscript𝑠1subscript𝑠2subscript𝑠3𝑡absent(s_{1},s_{2},s_{3},t)=( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_t ) =(natural, watercolor, comic, clipart). Similarly to Table 2, performance improved with EMA, PL, and the regularization.

7.5 Analysis of Flatness

Refer to caption
Refer to caption
Figure 6: Left and right plots compare average training and test flatness, respectively.

To evaluate the flatness of the detectors in parameter space, following previous studies [14] and [3], we computed the change in loss values when we perturb the parameters. Specifically, we sampled a random direction vector d𝑑ditalic_d on a unit sphere, perturbed the parameters (θ=θ+dγsuperscript𝜃𝜃𝑑𝛾\theta^{\prime}=\theta+d\gammaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θ + italic_d italic_γ) with a radius γ𝛾\gammaitalic_γ, and computed the average change over ten samples, i.e., γ(θ)=𝔼θ|(θ)(θ)|superscript𝛾𝜃subscript𝔼superscript𝜃superscript𝜃𝜃\mathcal{F}^{\gamma}(\theta)=\mathbb{E}_{\theta^{\prime}}|\mathcal{E}(\theta^{% \prime})-\mathcal{E}(\theta)|caligraphic_F start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_E ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_E ( italic_θ ) |. The lower the change is, the flatter the parameters.

Fig. 6 shows the γ(θ)superscript𝛾𝜃\mathcal{F}^{\gamma}(\theta)caligraphic_F start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ ) of the training loss (θ)=isisup(θ)𝜃subscript𝑖superscriptsubscriptsubscript𝑠𝑖𝑠𝑢𝑝𝜃\mathcal{E}(\theta)=\sum_{i}\mathcal{L}_{s_{i}}^{sup}(\theta)caligraphic_E ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_p end_POSTSUPERSCRIPT ( italic_θ ) and the test loss (θ)=tsup(θ)𝜃superscriptsubscript𝑡𝑠𝑢𝑝𝜃\mathcal{E}(\theta)=\mathcal{L}_{t}^{sup}(\theta)caligraphic_E ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_p end_POSTSUPERSCRIPT ( italic_θ ). The training domains were (s1,s2,s3)subscript𝑠1subscript𝑠2subscript𝑠3(s_{1},s_{2},s_{3})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )=(natural, watercolor, comic), and the test domain was clipart. We can see that EMA, PL, and the regularization lowered the changes in the losses on both the training domains and test domain. In other words, each contributed to falling into flatter minima.

8 Conclusions

We tackled two problem settings called semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD) to train object detectors that can generalize to unseen domains. We showed that the object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework. We also provided the interpretations of why the detectors trained with the Mean Teacher framework become robust to the unseen domains in terms of the flatness in the parameter space. On the basis of the interpretations, we proposed incorporating a regularization method to lead to flatter minima, which makes the loss value of the student similar to that of the teacher. The experimental results showed that the detectors trained with the Mean Teacher learning framework and the regularization performed significantly better than the state-of-the-art methods.

In future work, we are planning to investigate the robustness of the Mean Teacher framework against unseen domains on different settings such as UDA, or different tasks such as semantic segmentation.

Supplementary Material

9 More Analysis

9.1 How Sensitive to Hyperparameter β𝛽\betaitalic_β?

Table 4 shows the performance when the hyperparameter β𝛽\betaitalic_β in Eq. (18) (i.e. strength of the regularization) was changed from 0 to 1. By adding the regularization, the performance was constantly improved from the detector without regularization (i.e., β=0𝛽0\beta=0italic_β = 0).

Table 4: mAP50 with various β𝛽\betaitalic_β on the artistic style image dataset [13].
setting method β𝛽\betaitalic_β mAP50
clipart
SS-DGOD Gaussian FasterRCNN + EMA + PL 0.0 39.8
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 0.25 40.7
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 0.5 43.3
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 0.75 42.1
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 1.0 42.5

9.2 Comparison of Regularization with and without Post-processing

In the regularization described in Sec. 6.1, we use the raw outputs from the teacher without post-processing to train the student so that the outputs from the two networks are similar. To validate the claim, we compare the performance with and without post-processing (i.e., sharpening function [5]) in the regularization in Eq. (21). Table 5 shows that the performance drops when we perform the post-processing. We observe that using raw outputs is important to obtain better performance.

Table 5: mAP50 with and without post-processing on the artistic style image dataset [13].
setting method post process mAP50
clipart
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 43.3
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 39.4

9.3 Class-wise Average Precision

Table 6 and 7 show average precision (AP50) at each class when the target domain is watercolor and clipart, respectively. We can see that the regularization improved the performance on many classes.

Table 6: Comparisons of AP50 at each class on watercolor of the artistic style image dataset [13]. The values of * are from [17].
setting method

bicycle

bird

cat

car

dog

person

mAP

Single-DGOD CLIP-based augmentation [29]

74.8

37.3

36.8

40.7

29.2

59.9

46.4

Single-DGOD Gaussian FasterRCNN 90.4

47.9

30.3

46.7

28.7

59.2

50.5

Single-DGOD Gaussian FasterRCNN + EMA

86.2

54.3

35.3

53.5 34.5 69.0 55.5
SS-DGOD CDDMSL [43] (RegionCLIP)

66.3

50.6

34.5

49.2

20.1

56.0

46.1

SS-DGOD CDDMSL [43] (Res101)

75.5

36.1

23.9

40.7

19.7

52.0

41.3

SS-DGOD Gaussian FasterRCNN + EMA + PL 87.4 54.6

40.0

51.9

32.4

73.1

56.6

SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul.

87.2

52.3

44.7 53.2 36.8 75.3 58.2
WS-DGOD Gaussian FasterRCNN + EMA + PL

90.3

55.8

49.3

49.9

37.5

75.4

59.7

WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 95.8 59.9 51.5 53.3 40.2 76.7 62.9
DGOD Gaussian FasterRCNN

84.8

57.8

51.0

50.8

51.8

79.3

62.6

Oracle Gaussian FasterRCNN

90.9

59.9

44.2

53.1

46.7

78.3

62.2

UDAOD Gaussian FasterRCNN + EMA + PL [5]

77.7

46.5

40.4

50.1

39.7

75.0

54.9

UDAOD SCL* [25]

82.2

55.1

51.8

39.6

38.4

64.0

55.2

UDAOD SWDA* [24]

82.3

55.9

46.5

32.7

35.5

66.7

53.3

UDAOD UMT* [6]

88.2

55.3

51.7

39.8

43.6

69.9

58.1

UDAOD AT* [17]

93.6

56.1

58.9

37.3

39.6

73.8

59.9

Table 7: Comparisons of AP50 at each class on clipart of the artistic style image dataset [13].
setting method

bicycle

bird

cat

car

dog

person

mAP

Single-DGOD CLIP-based augmentation [29]

36.5

22.5

20.1

25.0

8.8

50.4

27.2

Single-DGOD Gaussian FasterRCNN

69.5

25.1

5.7

39.4

17.3

49.9

34.5

Single-DGOD Gaussian FasterRCNN + EMA 87.6 29.3

5.5

30.1

18.3 57.2 38.0
SS-DGOD CDDMSL [43] (RegionCLIP)

51.0

33.3 26.5 45.2

14.6

63.8

39.1

SS-DGOD CDDMSL [43] (Res101)

41.6

19.2

5.5

26.7

12.3

50.9

26.0

SS-DGOD Gaussian FasterRCNN + EMA + PL

75.8

31.2

9.4

33.1

20.4

69.1

39.8

SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 79.3

32.5

11.6

40.9

26.3

69.0

43.3
WS-DGOD Gaussian FasterRCNN + EMA + PL

80.3

33.3

11.1

44.5 23.2 72.6

44.2

WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 84.8

33.2

23.8

43.0

22.1

70.1

46.2
DGOD Gaussian FasterRCNN

76.0

34.8

18.8

38.3

36.9

77.6

47.1

Oracle Gaussian FasterRCNN

70.4

38.8

26.1

52.9

27.5

73.4

48.2

UDAOD Gaussian FasterRCNN + EMA + PL [5]

79.9

33.5

6.5

53.1

23.7

65.2

43.6

9.4 Qualitative Results

(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on PASCAL VOC07&12).
Refer to caption
Refer to caption
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on PASCAL VOC07&12).
(b) Gaussian FasterRCNN + EMA + PL + Regul. trained on SS-DGOD setting (i.e., labeled data on PASCAL VOC07&12 and unlabeled data on clipart and comic).
Figure 7: Qualitative comparisons on watertcolor.
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on PASCAL VOC07&12).
Refer to caption
Refer to caption
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on PASCAL VOC07&12).
(b) Gaussian FasterRCNN + EMA + PL + Regul. trained on SS-DGOD setting (i.e., labeled data on PASCAL VOC07&12 and unlabeled data on watercolor and comic).
Figure 8: Qualitative comparisons on clipart.

Figs. 7 and 8 show the qualitative comparison on watercolor and clipart, respectively. We observe that false negative detection of the baseline model was drastically improved.

10 Results on Car-mounted Camera Dataset [36]

10.1 Dataset Details

In this dataset, the domains are defined as different times and weather: daytime-sunny, night-sunny, daytime-foggy, dusk-rainy, and night-rainy. The number of images for each domain is 27,708, 18,310, 2,642, 3,501, and 2,494, respectively. We used daytime-sunny as the labeled domain s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and used night-sunny and daytime-foggy as the unlabeled (or weakly-labeled) domains s2,s3subscript𝑠2subscript𝑠3s_{2},s_{3}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We used each of the remaining domains (dusk-rainy and night-rainy) as the target domain. Because the train/val/test split is not publicly available for daytime-sunny, dusk-rainy, and night-rainy, we used all images of daytime-sunny, the trainval set of night-sunny, and the trainval set of daytime-foggy for training. We then used the test set of night-sunny and the test set of daytime-foggy for validation. We used all images of dusk-rainy and night-rainy for evaluation (testing). There are seven object classes: bus, bike, car, motor, person, rider, and truck.

10.2 Comparisons with State-of-the-Art Methods

Table 8: Comparisons of mAP50 on the car-mounted camera dataset [36]. The values of * and ** were from [36] and [29], respectively.
setting method backbone mAP50
dusk-rainy night-rainy
Single-DGOD FasterRCNN* Res101 26.6 14.5
Single-DGOD CDSD* [36] Res101 28.2 16.6
Single-DGOD CLIP-based augmentation**[29] Res101 32.3 18.7
Single-DGOD Gaussian FasterRCNN Res101 25.3 13.3
Single-DGOD Gaussian FasterRCNN + EMA Res101 36.0 19.0
SS-DGOD Gaussian FasterRCNN + EMA + PL Res101 30.3 21.3
SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 31.2 21.9
WS-DGOD Gaussian FasterRCNN + EMA + PL Res101 30.5 22.5
WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. Res101 32.5 23.1
DGOD Gaussian FasterRCNN Res101 28.4 21.2

Table 8 shows the results on the car-mounted camera dataset. Each of EMA, PL, and the regularization improved the performance on both target domains except that PL degraded the performance on dusk-rainy. We will investigate the cause of the performance drop in our future work.

The mAP50 of the detector with the regularization is boosted to (32.5, 23.1) on WS-DGOD. This result exceeds (32.3, 18.7), which is the result of the state-of-the-art method on Single-DGOD [29]. Also, this result is better than those of the models trained with supervised learning on the three domains (DGOD).

10.3 Analysis of Flatness

Fig. 9 shows the average change of the training loss at each domain when perturbing the parameters (γ(θ)=𝔼θ|(θ)(θ)|superscript𝛾𝜃subscript𝔼superscript𝜃superscript𝜃𝜃\mathcal{F}^{\gamma}(\theta)=\mathbb{E}_{\theta^{\prime}}|\mathcal{E}(\theta^{% \prime})-\mathcal{E}(\theta)|caligraphic_F start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | caligraphic_E ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_E ( italic_θ ) | described in Sec. 7.5), and Fig. 10 shows those of the test loss. Each of EMA, PL, and the regularization lowered the changes in the losses at every domain when the radius is 125 or smaller although EMA lowered the changes the most when the radius is extremely large (>125). In other words, each contributed to falling into flatter minima with a sufficiently large radius.

(a) Daytime-clear
(b) Daytime-foggy
Refer to caption
Refer to caption
Refer to caption
(a) Daytime-clear
(b) Daytime-foggy
(c) Night-sunny
Figure 9: Average training flatness at each training domain.
(a) Dusk-rainy
Refer to caption
Refer to caption
(a) Dusk-rainy
(b) Night-rainy
Figure 10: Average test flatness at each target domain.

10.4 Class-wise Average Precision

Tables 9 and 10 show class-wise average precision on dusk-rainy and night-rainy domains, respectively. We can see that each of EMA, PL, and regularization contributes to improving the performance on many classes except the performance drop by PL on dusk-rainy.

Table 9: Comparisons of AP50 at each class on dusk-rainy of the car-mounted camera dataset [36]. The values of * and ** are from [36] and [29], respectively.
setting method

bus

bike

car

motor

person

rider

truck

mAP

Single-DGOD FasterRCNN*

36.8

15.8

50.1

12.8

18.9

12.4

39.5

26.6

Single-DGOD CDSD* [36]

37.1

19.6

50.9

13.4

19.7

16.3

40.7

28.2

Single-DGOD CLIP-based augmentation** [29]

37.8

22.8

60.7

16.8

26.8

18.7

42.4

32.3

Single-DGOD Gaussian FasterRCNN

33.9

14.9

53.6

4.2

17.4

13.6

39.2

25.3

Single-DGOD Gaussian FasterRCNN + EMA 46.3 24.9 65.9

11.9

29.1 23.7 50.0 36.0
SS-DGOD Gaussian FasterRCNN + EMA + PL

40.0

17.3

61.0

8.0 23.6

17.1

45.1

30.3

SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 40.8 20.1 61.8

7.8

23.6 18.3 46.2 31.2
WS-DGOD Gaussian FasterRCNN + EMA + PL

39.0

19.4

60.4

9.4

23.8

17.3

44.0

30.5

WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 41.7 22.3 62.1 11.2 25.3 18.9 45.9 32.5
DGOD Gaussian FasterRCNN

36.2

18.2

61.3

7.3

18.4

15.9

41.9

28.4

Table 10: Comparisons of AP50 at each class on night-rainy of the car-mounted camera dataset [36]. The values of * and ** are from [36] and [29], respectively.
setting method

bus

bike

car

motor

person

rider

truck

mAP

Single-DGOD FasterRCNN*

22.6

11.5

27.7

0.4

10.0

10.5

19.0

14.5

Single-DGOD CDSD* [36]

24.4

11.6

29.5

9.8

10.5

11.4

19.2

16.6

Single-DGOD CLIP-based augmentation** [29]

28.6

12.1

36.1

9.2

12.3

9.6

22.9

18.7

Single-DGOD Gaussian FasterRCNN

20.4

7.7

31.0

0.5

6.8

5.6

21.3

13.3

Single-DGOD Gaussian FasterRCNN + EMA 33.9

11.1

38.5

0.8

10.5

8.8

29.2 19.0
SS-DGOD Gaussian FasterRCNN + EMA + PL

35.7

9.8

46.7

1.4

12.6

10.8

32.0

21.3

SS-DGOD Gaussian FasterRCNN + EMA + PL + Regul. 37.0 10.3

46.3

2.8 12.9 12.0

31.8

21.9
WS-DGOD Gaussian FasterRCNN + EMA + PL 38.6

11.3

47.9 2.9

13.4

11.2

32.1

22.5

WS-DGOD Gaussian FasterRCNN + EMA + PL + Regul.

38.3

13.4

46.2

2.7

15.1 14.0

32.0

23.1
DGOD Gaussian FasterRCNN

38.9

7.6

46.7

1.8

9.8

11.3

32.1

21.2

10.5 Qualitative Results

(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on daytime-sunny).
Refer to caption
Refer to caption
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on daytime-sunny).
(b) Gaussian FasterRCNN + EMA + PL + Regul. trained on SS-DGOD setting (i.e., labeled data on daytime-sunny and unlabeled data on night-sunny and daytime-foggy).
Figure 11: Qualitative comparisons on dusk-rainy.
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on daytime-sunny).
Refer to caption
Refer to caption
(a) Gaussian FasterRCNN trained on Single-DGOD setting (i.e., labeled data on daytime-sunny).
(b) Gaussian FasterRCNN + EMA + PL + Regul. trained on SS-DGOD setting (i.e., labeled data on daytime-sunny and unlabeled data on night-sunny and daytime-foggy).
Figure 12: Qualitative comparisons on night-rainy.

Figs. 11 and 12 show the qualitative comparison on dusk-rainy and night-rainy, respectively. Similar to the artistic image dataset, the baseline model had false negative detections, which were improved by EMA, PL, and regularization.

11 Training Details

On the artistic style image dataset, the detectors were trained with 10,000 and 20,000 iterations for the pretraining and the student-teacher learning of SS-DGOD (or WS-DGOD), respectively. During the training, we saved the models and evaluated the performance on the validation at every 2,000 iterations, and the best model was used for the evaluation. The whole training took about one day. For fair comparisons, the compared models on Single-DGOD and DGOD were trained with 30,000 iterations, and the best models at the validation of every 2,000 iterations were used for evaluation.

On the car-mounted camera dataset, we performed the same procedure for training, validation, and evaluation, but the numbers of iterations for the pretraining and the student-teacher learning were set to 20,000 and 40,000 respectively, and the validation was conducted at every 4,000 iterations. The whole training took about two days. For fair comparisons, the compared models on Single-DGOD and DGOD were trained with 60,000 iterations, and the best models at the validation of every 4,000 iterations were used for evaluation.

References

  • [1] Araslanov, N., Roth, S.: Self-supervised augmentation consistency for adapting semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15384–15394 (2021)
  • [2] Caldarola, D., Caputo, B., Ciccone, M.: Improving generalization in federated learning by seeking flat minima. In: Proceedings of the European Conference on Computer Vision. pp. 654–672 (2022)
  • [3] Cha, J., Chun, S., Lee, K., Cho, H.C., Park, S., Lee, Y., Park, S.: Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems 34, 22405–22418 (2021)
  • [4] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: Biasing gradient descent into wide valleys. In: Proceedings of the International Conference on Learning Representations (2017)
  • [5] Chen, M., Chen, W., Yang, S., Song, J., Wang, X., Zhang, L., Yan, Y., Qi, D., Zhuang, Y., Xie, D., Pu, S.: Learning domain adaptive object detection with probabilistic teacher. In: Proceedings of the International Conference on Machine Learning. vol. 162, pp. 3040–3055 (2022)
  • [6] Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4091–4101 (2021)
  • [7] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
  • [8] Fan, Q., Segu, M., Tai, Y.W., Yu, F., Tang, C.K., Schiele, B., Dai, D.: Towards robust object detection invariant to real-world domain shifts. In: Proceedings of the International Conference on Learning Representations (2023)
  • [9] Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: Proceedings of the International Conference on Learning Representations (2021)
  • [10] Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Proceedings of the International Conference on Learning Representations (2021)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [12] Hoyer, L., Dai, D., Van Gool, L.: Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9924–9935 (2022)
  • [13] Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5001–5009 (2018)
  • [14] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. pp. 876–885 (2018)
  • [15] Kaddour, J., Liu, L., Silva, R., Kusner, M.: When do flat minima optimizers work? (2022)
  • [16] Lee, Y., Willette, J.R., Kim, J., Lee, J., Hwang, S.J.: Exploring the role of mean teachers in self-supervised masked auto-encoders. In: Proceedings of the International Conference on Learning Representations (2023)
  • [17] Li, Y.J., Dai, X., Ma, C.Y., Liu, Y.C., Chen, K., Wu, B., He, Z., Kitani, K., Vajda, P.: Cross-domain adaptive teacher for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7581–7590 (2022)
  • [18] Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., Cai, J.: Domain-invariant disentangled network for generalizable object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8771–8780 (2021)
  • [19] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755 (2014)
  • [20] Liu, H., Song, P., Ding, R.: Towards domain generalization in underwater object detection. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 1971–1975 (2020)
  • [21] Malakouti, S., Kovashka, A.: Semi-supervised domain generalization for object detection via language-guided feature alignment. In: Proceedings of the British Machine Vision Conference (2023)
  • [22] Mi, P., Lin, J., Zhou, Y., Shen, Y., Luo, G., Sun, X., Cao, L., Fu, R., Xu, Q., Ji, R.: Active teacher for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14482–14491 (2022)
  • [23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
  • [24] Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6956–6965 (2019)
  • [25] Shen, Z., Maheshwari, H., Yao, W., Savvides, M.: Scl: Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv preprint arXiv:1911.02559 (2019)
  • [26] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems 33, 596–608 (2020)
  • [27] Stephan, M., Hoffman, M.D., Blei, D.M., et al.: Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research 18(134), 1–35 (2017)
  • [28] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30 (2017)
  • [29] Vidit, V., Engilberge, M., Salzmann, M.: Clip the gap: A single domain generalization approach for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
  • [30] Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., Yu, P.: Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering (2022)
  • [31] Wang, K., Yang, C., Betke, M.: Consistency regularization with high-dimensional non-adversarial source-guided perturbation for unsupervised domain adaptation in segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 10138–10146 (2021)
  • [32] Wang, K., Fu, X., Huang, Y., Cao, C., Shi, G., Zha, Z.J.: Generalized uav object detection via frequency domain disentanglement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1064–1073 (2023)
  • [33] Wang, P., Cai, Z., Yang, H., Swaminathan, G., Vasconcelos, N., Schiele, B., Soatto, S.: Omni-detr: Omni-supervised object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9367–9376 (2022)
  • [34] Wang, P., Zhang, Z., Lei, Z., Zhang, L.: Sharpness-aware gradient matching for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3769–3778 (2023)
  • [35] Wang, X., Huang, T.E., Liu, B., Yu, F., Wang, X., Gonzalez, J.E., Darrell, T.: Robust object detection via instance-level temporal cycle confusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9143–9152 (2021)
  • [36] Wu, A., Deng, C.: Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 847–856 (2022)
  • [37] Yang, F.E., Cheng, Y.C., Shiau, Z.Y., Wang, Y.C.F.: Adversarial teacher-student representation learning for domain generalization. Advances in Neural Information Processing Systems 34, 19448–19460 (2021)
  • [38] Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12414–12424 (2021)
  • [39] Zhang, X., Xu, R., Yu, H., Dong, Y., Tian, P., Cui, P.: Flatness-aware minimization for domain generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5189–5202 (2023)
  • [40] Zhang, X., Xu, Z., Xu, R., Liu, J., Cui, P., Wan, W., Sun, C., Li, C.: Towards domain generalization in object detection. arXiv preprint arXiv:2203.14387 (2022)
  • [41] Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Towards unsupervised domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4910–4920 (2022)
  • [42] Zhou, K., Liu, Z., Qiao, Y., Xiang, T., Loy, C.C.: Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
  • [43] Zhou, K., Loy, C.C., Liu, Z.: Semi-supervised domain generalization with stochastic stylematch. International Journal of Computer Vision pp. 1–11 (2023)