(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: The University of Tokyo, Tokyo, Japan
¹¹email: {furuta,ysato}@iis.u-tokyo.ac.jp

Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection

Ryosuke Furuta 0000-0003-1441-889X Yoichi Sato 0000-0003-0097-4537

Abstract

Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD). In contrast to the conventional domain generalization for object detection that requires labeled data from multiple domains, SS-DGOD and WS-DGOD require labeled data only from one domain and unlabeled or weakly-labeled data from multiple domains for training. In this paper, we show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly-labeled data. We provide novel interpretations of why the Mean Teacher learning framework works well on the two settings in terms of the relationships between the generalization gap and flat minima in parameter space. On the basis of the interpretations, we also propose incorporating a simple regularization method into the Mean Teacher learning framework to find flatter minima. The experimental results demonstrate that the regularization leads to flatter minima and boosts the performance of the detectors trained with the Mean Teacher learning framework on the two settings. They also indicate that those detectors significantly outperform the state-of-the-art methods.

Keywords:

Object detection Domain generalization Semi-supervised learning Weakly-supervised learning

1 Introduction

Object detection has been attracting much attention because it has practically useful applications such as in autonomous driving. Object detectors have performed tremendously well on commonly used benchmark datasets for object detection, such as MSCOCO [19] and PASCAL VOC [7]. However, such performance significantly drops when they are deployed on unseen domains, i.e., when the training and testing domains are different. For example, Inoue et al. [13] reported a performance drop caused by the difference in image styles, and Li et al. [17] showed one caused by the weather and time difference in the images captured with car-mounted cameras.

To solve this problem, many researchers have been exploring unsupervised domain adaptive object detection (UDA-OD) [6, 17, 5]. On UDA-OD, we train object detectors using source domain data with ground-truth labels (bounding boxes and class labels) and unlabeled target domain data to adapt the detectors to the target domain. However, in the real world, target domain data cannot always be accessed in the training phase.

Domain generalizable object detection (DGOD) is another common problem setting for solving the problem of the performance drop caused by the domain gaps [18, 40]. On DGOD, we train object detectors using labeled data from multiple domains so that the detectors work well on unseen domains. However, it is labor-intensive to collect these data for object detection because both bounding boxes and class labels for all objects in the images must be annotated. Although single-DGOD [36, 8, 29, 35, 32], on which we train object detectors to generalize unseen domains using labeled data from one single domain, has been investigated, the performance gain is still limited.

In this paper, we tackle two tasks as more realistic settings: i) semi-supervised DGOD (SS-DGOD) [21] and ii) weakly-supervised DGOD (WS-DGOD). The goal of SS-DGOD is to generalize object detectors to unseen domains using labeled data only from one single domain and unlabeled data from multiple domains. Note that the target domain data are not included in the training data. On WS-DGOD, we use weakly labeled data from multiple domains instead of the unlabeled data in SS-DGOD. “Weakly labeled” means that we have only image-level labels that show the existence of each class in each training image and do not have bounding box annotations. To the best of our knowledge, this is the first attempt to tackle WS-DGOD. We show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly labeled data, and the teacher network is updated as the exponential moving average (EMA) of the student.

Not only do we experimentally demonstrate the good performance of the Mean Teacher learning framework, but also provide novel interpretations of why the Mean Teacher learning framework works well on these two settings in terms of the relationship between generalization ability and flat minima in parameter space. In the research area of domain generalization, it has been shown both theoretically and empirically that neural networks with flatter minima in parameter space have better generalization ability to unseen domains [9, 4, 3, 14, 2, 34, 15, 39]. We show that the two key components of the Mean Teacher learning framework, i) EMA update and ii) learning from pseudo-labels, lead to flat minima during the training.

On the basis of the interpretations, we also propose incorporating a regularization method into the Mean Teacher learning framework to find flatter minima. Specifically, because the teacher and the student have similar loss values around the flat minima, we introduce an additional loss term so that the output from the student network becomes similar to that from the teacher network.

The experimental results demonstrate that the detectors trained with the Mean Teacher learning framework perform well for unseen test domains on the two settings. We show that the simple yet effective regularization leads to flatter minima and boosts the performance of those detectors. We also confirm these detectors significantly outperform the state-of-the-art methods on the two settings.

Our contributions are summarized as follows:

•

We show that object detectors can be effectively trained on the SS-DGOD and WS-DGOD settings with the same Mean Teacher learning framework.
•

We provide interpretations of the reason the detectors trained with the Mean teacher learning framework achieve robustness to unseen test domains in terms of the flatness of minima in parameter space.
•

On the basis of the interpretations, we propose incorporating a simple regularization method into the Mean Teacher learning framework to achieve flatter minima.
•

We are the first to tackle the WS-DGOD setting.

2 Problem Settings

We formally describe the two problem settings of SS-DGOD and WS-DGOD. Their goal is to obtain object detectors that perform well on unseen target domain data $\mathcal{D}_{t}=\{X_{t}\}$ , where $\mathcal{D}_{t}=\{X_{t}\}$ is a set of images from the target domain.

On SS-DGOD, we have labeled data from a source domain $\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ and unlabeled data from multiple source domains $\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}$ in the training phase. Here, $X_{s_{1}}=\{x^{j}_{s_{1}}\}_{j=1}^{N_{s_{1}}}$ is a set of $N_{s_{1}}$ images from domain $s_{1}$ . $B_{s_{1}}=\{b_{s_{1}}^{j}\}_{j=1}^{N_{s_{1}}}$ and $C_{s_{1}}=\{c_{s_{1}}^{j}\}_{j=1}^{N_{s_{1}}}$ are the corresponding bounding boxes and object-class labels, respectively. ${s_{i}}$ is the $i$ -th source domain, and $N_{\mathcal{D}}$ is the number of the source domains. We assume that the data distributions differ between the domains, i.e., $P(X_{s_{1}})\neq P(X_{s_{2}})\neq\cdots P(X_{s_{N_{D}}})\neq P(X_{t})$ .

On WS-DGOD, we use labeled data from a source domain $\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ and weakly labeled data from multiple domains $\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}$ for training.

Table 1 compares SS-DGOD and WS-DGOD with related problem settings (Single-DGOD, DGOD, and UDA-OD). As discussed in Sec. 1, DGOD requires labeled data from multiple domains $\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}$ , but those data are sometimes hard to prepare due to the high annotation cost. In contrast, SS-DGOD (or WS-DGOD) requires labeled data from one domain $\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ and unlabeled data $\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}$ (or weakly labeled data $\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}$ ), which are easier to obtain. Therefore, SS-DGOD and WS-DGOD are more practical settings than DGOD. By using those data, we aim to better generalize object detectors to the unseen target domain data $\mathcal{D}_{t}=\{X_{t}\}$ than on Single-DGOD. Unlike on UDA-OD, we can train the detectors even when the target domain data $\mathcal{D}_{t}=\{X_{t}\}$ are not accessible.

Table 1: Formal comparisons of SS-DGOD, WS-DGOD, and related problem settings. DGOD stands for domain generalizable object detection, and SS-DGOD and WS-DGOD are semi-supervised DGOD and weakly-supervised DGOD, respectively. UDA-OD is unsupervised domain adaptive object detection.

task	train data	test data
Single-DGOD	$\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$	$\mathcal{D}_{t}=\{X_{t}\}$
SS-DGOD	$\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ ,	$\mathcal{D}_{t}=\{X_{t}\}$
SS-DGOD	$\mathcal{D}_{s_{i}}=\{X_{s_{i}}\}_{i=2}^{N_{D}}$	$\mathcal{D}_{t}=\{X_{t}\}$
WS-DGOD	$\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ ,	$\mathcal{D}_{t}=\{X_{t}\}$
WS-DGOD	$\mathcal{D}_{s_{i}}=\{(X_{s_{i}},C_{s_{i}})\}_{i=2}^{N_{D}}$	$\mathcal{D}_{t}=\{X_{t}\}$
DGOD	$\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}$	$\mathcal{D}_{t}=\{X_{t}\}$
UDA-OD	$\mathcal{D}_{s_{i}}=\{(X_{s_{i}},B_{s_{i}},C_{s_{i}})\}_{i=1}^{N_{D}}$ ,	$\mathcal{D}_{t}=\{X_{t}\}$
UDA-OD	$\mathcal{D}_{t}=\{X_{t}\}$	$\mathcal{D}_{t}=\{X_{t}\}$

3 Related Work

3.1 Domain Generalization for Image Classification

Many methods have been proposed for domain generalization on image classification tasks as summarized in recent survey papers [42, 30]. Among a variety of domain generalization methods, finding flat minima is one of the most common approaches [9, 4, 3, 14, 2, 34, 15, 39]. Those studies empirically and theoretically showed that finding flat minima in parameter space results in a better generalization ability. Ismailov et al. [14] and Cha et al. [3] demonstrated that empirical risk minimization (ERM) with stochastic gradient descent (SGD) converges to the vicinity of a flat minimum, and averaging the parameter weights over a certain number of training steps/epochs results in reaching the flat minimum. Cha et al. [3] also derived the theoretical relationship between flat minima and generalization gap. Inspired by these findings, we reveal that the Mean Teacher learning framework leads to flat minima, and thus can obtain good generalization ability.

3.2 Domain Generalization for Object Detection

Domain generalization for object detection has not been widely explored, compared with image classification. Lin et al. [18] proposed a method for disentangling domain-specific and domain-invariant features by adversarial learning on both image-level and instance-level features for DGOD. Liu et al. [20] investigated DGOD in underwater object detection and proposed DG-YOLO. For Single-DGOD, Wang et al. [35] proposed a self-training method that uses the temporal consistency of objects in videos. Wu et al. [36] proposed a method for disentangling domain-invariant features by contrastive learning and self-distillation. Fan et al. [8] proposed perturbing the channel statistics of feature maps, which can be interpreted as data augmentation of image styles to a variety of domains. Wang et al. [32] proposed a disentangle method on frequency space for object detection from unmanned aerial vehicles. Vidit et al. [29] proposed an augmentation method using a pre-trained vision-language model (CLIP) with textual prompts.

Unlike the above methods, as discussed in Sects. 1 and 2, we tackle SS-DGOD (Semi-Supervised Domain Generalization for Object Detection) and a new problem setting called WS-DGOD (Weakly-Supervised Domain Generalization for Object Detection). SS-DGOD and WS-DGOD are more practical than DGOD because labeled data is required only from one domain. The most closely related to our work is Malakouti and Kovashka’s work [21]. They tackled SS-DGOD and proposed a language-guided alignment method. However, the limitation of their method is that it requires a backbone network that was pre-trained on vision-and-language tasks. Our experiments show that the object detectors trained with the Mean Teacher learning framework and our regularization outperform their method.

3.3 Semi-supervised Domain Generalization

There are a few methods that use both labeled and unlabeled data for domain generalization (SSDG) on image classification [41, 43]. Zhang et al. [41] proposed an unsupervised pre-training method called DARLING, which performs contrastive learning on unlabeled images to obtain domain-irrelevant feature representation. Zhou et al. [43] extended a semi-supervised learning method called FixMatch [26] to SSDG.

In contrast to those studies, we tackle SSDG for object detection. We also tackle the “weakly-labeled” setting (i.e., WS-DGOD), which has not been explored even for image classification.

3.4 Mean Teacher Learning Framework

Mean Teacher learning framework was originally proposed for semi-supervised image classification [28]. Several studies have investigated the use of the Mean Teacher learning framework for a variety of tasks such as domain generalization on image classification [37], (in-domain) weakly-supervised object detection [33], (in-domain) semi-supervised object detection [22], UDA-OD [6, 17], and UDA for semantic segmentaion [1, 31, 12, 38]. Lee et al. [16] provided a theoretical analysis of the Mean Teacher learning framework on masked image modeling pretext tasks for semi-supervised image classification. We show that the Mean Teacher learning framework also works well on different settings (SS-DGOD and WS-DGOD), provide their interpretations, and propose incorporating a regularization method.

Refer to caption — Figure 1: Training framework.

4 Training Method

4.1 Overview and Key Idea

On both SS-DGOD and WS-DGOD, our goal is to obtain object detectors that work well on the unseen target domain data $\mathcal{D}_{t}=\{X_{t}\}$ . Gulrajani and Lopez-Paz [10] reported that if carefully implemented, empirical risk minimization (i.e., the image classifier simply trained with supervised learning on multiple domains) outperformed state-of-the-art domain generalization methods on several benchmark datasets for image classification. Following this important finding, we expect similar behavior on object detection and aim to train an object detector on multiple domains $\mathcal{D}_{s_{i}}(i=1,\cdots,N_{D})$ . However, we have no ground-truth labels (or have only weak labels) for $\mathcal{D}_{s_{i}}(i=2,\cdots,N_{D})$ although ground-truth labels are available for $\mathcal{D}_{s_{1}}$ . Therefore, the question is how to train a detector on those domains. Our solution is to use the Mean Teacher learning framework for object detection [17, 5] shown in Fig. 1, where we have two networks (teacher and student) with the same structure and train the student network using the pseudo-labels generated by the teacher network. Note that this Mean Teacher learning framework can be applied to any object detector, but we hereafter describe the loss functions of FasterRCNN [23] as an example for ease of explanation.

4.2 Pre-training

If we start the Mean Teacher learning from randomly initialized parameters, the teacher network cannot output reliable pseudo labels. Therefore, we first perform supervised learning with the labeled data of one source domain $\mathcal{D}_{s_{1}}=\{(X_{s_{1}},B_{s_{1}},C_{s_{1}})\}$ . \linenomathAMS

	$\displaystyle\theta^{}=\operatorname{arg\,min}_{\theta}\mathcal{L}^{\mathrm{% sup}}_{s_{1}}(\theta)$		(1)
	$\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}}% )+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1% }})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}% })+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{% 1}}),\end{multlined}\mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)=\mathcal{L}^{% \mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}})+\mathcal{L}% ^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{1}% })+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{1}},B_{s_{1}},C_{s_{% 1}}),$		(4)

\endlinenomath

where $\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}$ and $\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}$ are the classification and regression losses for region proposal networks (RPN), respectively. $\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}$ and $\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}$ are those for RoIhead. We initialize both the teacher and student networks with the parameters $\theta^{*}$ obtained from this pre-training.

4.3 Mean Teacher Learning

4.3.1 Generate Pseudo-labels

Because we have no ground-truth labels (or have only weak labels) for the other source domains $\mathcal{D}_{s_{i}}(i=2,\cdots,N_{D})$ , we generate pseudo labels using the teacher network. Specifically, we perform weak data augmentation to the unlabeled (or weakly-labeled) image $x_{s_{i}}^{j}$ and input it into the teacher network. We denote the output from the teacher as $\{(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})\}_{r=1}^{N_{R}}$ , where $\hat{b}_{s_{i}}^{jr}$ and $\hat{p}_{s_{i}}^{jr}$ are the predicted bounding box and class probabilities for the $r$ -th region of interests (RoI) in the $j$ -th image, respectively, and $N_{R}$ is the number of output RoIs.

In the case of SS-DGOD, we simply perform post-processing $f_{post}$ to $(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})$ and obtain the pseudo label $(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})$ :

(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})=f_{\mathrm{post}}(\hat{b}_{s_{i}}^% {jr},\hat{p}_{s_{i}}^{jr}).

(5)

Post-processing $f_{post}$ indicates a simple thresholding function if we use “hard” pseudo labels like [17] and indicates a sharpening function if we use “soft” pseudo labels like [5].

In the case of WS-DGOD, we perform the refinement process of applying the weak labels to the predicted class probabilities $\hat{p}_{s_{i}}^{jr}$ immediately before post-processing $f_{post}$ to obtain more accurate pseudo labels as follows: \linenomathAMS

	$\displaystyle\hat{p}_{s_{i}}^{jr}(k)=\begin{cases}\hat{p}_{s_{i}}^{jr}(k)&% \text{if}\ k\in c_{s_{i}}^{j}\\ 0&\text{otherwise}\end{cases}$		(6)
	$\displaystyle(\bar{b}_{s_{i}}^{jr},\bar{c}_{s_{i}}^{jr})=f_{\mathrm{post}}(% \hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr}),$		(7)

\endlinenomath

where $\hat{p}_{s_{i}}^{jr}(k)$ is the predicted class probability for the $k$ -th class. Using the weak label $c_{s_{i}}^{j}$ , Eq. (6) makes the predicted probability zero if the $k$ -th class does not exist in the $j$ -th image.

4.3.2 Update Student

Now we have the pseudo labels $\bar{B}_{s_{i}}=\{\bar{b}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}$ and $\bar{C}_{s_{i}}=\{\bar{c}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}$ and train the student network with them.

We perform strong data augmentations to the image $x_{s_{i}}^{j}$ and input it into the student network. In domain $s_{1}$ , because the ground-truth labels are available, we update the student by backpropagating loss $\mathcal{L}^{\mathrm{sup}}_{s_{1}}$ in Eq. (4). In the other domains $s_{i}(i=2,\cdots,N_{D})$ , we calculate loss $\mathcal{L}^{\mathrm{unsup}}_{s_{i}}$ using the pseudo labels and backpropagate it to update the student. In summary, we update the parameters of student $\theta^{\mathrm{student}}$ with loss $\mathcal{L}^{\mathrm{student}}$ as follows: \linenomathAMS

	$\displaystyle\theta^{\mathrm{student}}\leftarrow\theta^{\mathrm{student}}-% \nabla_{\theta}\mathcal{L}^{\mathrm{student}}(\theta)$		(8)
	$\displaystyle\mathcal{L}^{\mathrm{student}}(\theta)=\mathcal{L}^{\mathrm{sup}}% _{s_{1}}(\theta)+\sum_{i=2}^{N_{D}}\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(\theta)$		(9)
	$\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}}).\end{multlined}\mathcal{L}^{\mathrm{unsup}}_{% s_{i}}(\theta)=\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\bar% {B}_{s_{i}},\bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,% X_{s_{i}},\bar{B}_{s_{i}},\bar{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\bar{B}_{s_{i}},% \bar{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \bar{B}_{s_{i}},\bar{C}_{s_{i}}).$		(12)

\endlinenomath

4.3.3 Update Teacher

Similar to previous studies [5, 17], we do not update the parameters of the teacher $\theta^{\mathrm{teacher}}$ by backpropagation to obtain stable pseudo labels. Instead, we update them by the exponential moving average (EMA) of the parameters of the student network:

\theta^{\mathrm{teacher}}\leftarrow\alpha\theta^{\mathrm{teacher}}+(1-\alpha)% \theta^{\mathrm{student}},

(13)

where $\alpha$ is a hyperparameter to control the update speed.

5 Why Does Mean Teacher Become Robust to Unseen Domains?

We provide novel interpretations of why the Mean Teacher learning framework works well on SS-DGOD and WS-DGOD settings in terms of the relationship between generalization ability and flat minima in parameter space. We show that the two key components of the Mean Teacher learning framework, i) EMA update and ii) learning from pseudo labels, lead to flat minima during the training.

5.1 Definition

We define an empirical risk as $\mathcal{E}_{\mathrm{ER}}(\theta)\vcentcolon=\sum_{i=1}^{N_{D}}\mathcal{L}^{% \mathrm{sup}}_{s_{i}}(\theta)$ when we assume that ground-truth labels are available on all the training domains. A risk at the target domain is defined as $\mathcal{E}_{t}(\theta)\vcentcolon=\mathcal{L}^{sup}_{t}(\theta)$ . The goal is to minimize the test risk $\mathcal{E}_{t}(\theta)$ by only solving the empirical risk minimization (ERM), i.e., $\min_{\theta}\mathcal{E}_{\mathrm{ER}}(\theta)$ . Hereafter, we use the terms risk and loss interchangeably.

5.2 Preliminary Knowledge

Previous studies for domain generalization demonstrated both theoretically and empirically that neural networks with flatter minima in parameter space exhibit superior generalization ability to unseen domains [9, 4, 3, 14, 2, 34, 15, 39]. Fig. 3 shows its intuitive interpretation. We can see that when there is a domain shift between training and testing, the flat minimum of the training loss results in a lower test loss than the sharp minimum.

Cha et al. [3] theoretically revealed the relationship between the flat minima and generalization gap (i.e., performance drop by domain shift). We briefly describe the theorem for the subsequent explanation. We consider the worst-case loss within neighbor regions in parameter space, which is defined as a robust risk $\mathcal{E}_{RR}^{\gamma}(\theta)\vcentcolon=\max_{{\|\Delta}\|\leq\gamma}% \mathcal{E}_{ER}(\theta+\Delta)$ . Here, $\gamma$ is the radius of the neighbor region. As shown in Fig. 3, when $\gamma$ is sufficiently large, sharp minima of the empirical risk are not minima of the robust risk. In contrast, the minima of the robust risk (i.e., $\operatorname*{arg\,min}_{\theta}\mathcal{E}_{RR}^{\gamma}(\theta)$ ) are also minima in the flat regions of the empirical risk. The following theorem shows the relationship between the optimal solution of robust risk minimization (RRM):

Theorem (from [3]).

Consider a set of $N$ covers $\{\Theta_{k}\}_{k=1}^{N}$ such that the parameter space $\Theta\subset\cup_{k}^{N}\Theta_{k}$ where $\mathrm{diam}(\Theta)\vcentcolon=\mathrm{sup}_{\theta,\theta^{\prime}\in\Theta% }\|\theta-\theta^{\prime}\|_{2}$ , $N\vcentcolon=\lceil(\mathrm{diam}(\Theta)/\gamma)^{d}\rceil$ and $d$ is dimension of $\Theta$ . Let $\theta^{\gamma}$ denote the optimal solution of the RRM, i.e., $\theta^{\gamma}\vcentcolon=\operatorname*{arg\,min}_{\theta}\mathcal{E}_{RR}^{% \gamma}(\theta)$ , and let $v_{k}$ and $v$ be VC dimensions of each $\Theta_{k}$ and $\Theta$ , respectively. Then, the gap between the optimal test loss, $\min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^{\prime})$ , and the test loss of $\theta^{\gamma}$ , $\mathcal{E}_{t}(\theta^{\gamma})$ , has the following bound with probability of at least $1-\delta$ .

\begin{multlined}\mathcal{E}_{t}(\theta^{\gamma})-\min_{\theta^{\prime}}% \mathcal{E}_{t}(\theta^{\prime})\leq\mathcal{E}_{RR}^{\gamma}(\theta^{\gamma})% -\min_{\theta^{\prime\prime}}\mathcal{E}_{ER}(\theta^{\prime\prime})+\frac{1}{% N_{D}}\sum_{i=1}^{N_{D}}\mathrm{Div}(s_{i},t)\\ +\max_{k\in[1,N]}\sqrt{\frac{v_{k}\ln(m/v_{k})+\ln(2N/\delta)}{m}}+\sqrt{\frac% {v\ln(m/v)+\ln(2/\delta)}{m}},\end{multlined}\mathcal{E}_{t}(\theta^{\gamma})-% \min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^{\prime})\leq\mathcal{E}_{RR}^{% \gamma}(\theta^{\gamma})-\min_{\theta^{\prime\prime}}\mathcal{E}_{ER}(\theta^{% \prime\prime})+\frac{1}{N_{D}}\sum_{i=1}^{N_{D}}\mathrm{Div}(s_{i},t)\\ +\max_{k\in[1,N]}\sqrt{\frac{v_{k}\ln(m/v_{k})+\ln(2N/\delta)}{m}}+\sqrt{\frac% {v\ln(m/v)+\ln(2/\delta)}{m}},

(14)

where m is the number of training samples and $\mathrm{Div}(s_{i},t)\vcentcolon=2\mathrm{sup}_{A}|\mathbb{P}_{s_{i}}(A)-% \mathbb{P}_{t}(A)|$ is a divergence between two distributions.

For its proof, see [3]. From the theorem, we can interpret that the gap between the RRM and ERM (i.e., $\mathcal{E}_{RR}^{\gamma}(\theta^{\gamma})-\min_{\theta^{\prime\prime}}% \mathcal{E}_{ER}(\theta^{\prime\prime})$ ) upper bounds the generalization gap in the test domain (i.e., $\mathcal{E}_{t}(\theta^{\gamma})-\min_{\theta^{\prime}}\mathcal{E}_{t}(\theta^% {\prime})$ ). Intuitively, as shown in Fig 3, the gap between the RRM and ERM narrows at flat regions of ERM. Therefore, we can interpret that lowering the gap leads to flat minima of ERM and results in better generalization performance on the target domain.

5.3 EMA Update

We explain why the EMA update in the Mean Teacher learning framework leads to flat minima. Mandt et al. [27] showed that optimizing with constant SGD (i.e., SGD with a fixed learning rate) converges to a Gaussian distribution centered on the optimum. On the basis of this finding, Izmailov et al. [14] and Cha et al. [3] showed that the ERM with SGD converges to the marginal of a flat minimum, and averaging the weights of the parameters over some training steps/epochs leads to the flat minima. Although they [14, 3] proposed sophisticated algorithms for averaging the weights to avoid overfitting, we found that a simple EMA leads to flat minima. The experiments presented in Sec. 7 show that the teacher network with only the EMA update of the student (i.e., without pseudo labeling) as shown in Eqs. (15-16) can reach flatter minima and perform better than the student.

	$\displaystyle\theta^{\mathrm{student}}$	$\displaystyle\leftarrow\theta^{\mathrm{student}}-\nabla_{\theta}\mathcal{L}^{% \mathrm{student}}(\theta),\ \ \ \ \mathcal{L}^{\mathrm{student}}(\theta)=% \mathcal{L}^{\mathrm{sup}}_{s_{1}}(\theta)$		(15)
	$\displaystyle\theta^{\mathrm{teacher}}$	$\displaystyle\leftarrow\alpha\theta^{\mathrm{teacher}}+(1-\alpha)\theta^{% \mathrm{student}},$		(16)

5.4 Learning from Pseudo Labels

We explain why learning from pseudo-labels in the Mean Teacher learning framework leads to flat minima. Assuming that the pseudo-labels from the teacher are accurate enough (i.e., similar enough to ground truth), $\mathcal{L}_{s_{i}}^{unsup}$ in Eq. (9) can be approximated by $\mathcal{L}_{s_{i}}^{sup}$ , and we can regard the student network as the ERM in Sec. 5.1. On the other hand, as explained in Sec. 5.3 and shown in the experiments, because the teacher network updated with EMA has a better ability to reach flat minima than the student, we can regard the teacher network as the RRM. Therefore, from Eq. (14), the smaller the difference between the losses of the teacher and student, the smaller the generalization gap in the target domain is. Fig. 5 shows its intuitive interpretations. At the flat region, the trajectory of the student over the training steps and their mean (teacher) have similar loss values. In contrast, there is a large difference between the loss values of the trajectory of the student and their mean at the sharp valley.

Next, we show that learning from pseudo-labels in the Mean Teacher learning framework makes the losses of the student and teacher similar. Because the student is trained with the output from the teacher as pseudo-ground truth, the training promotes the outputs from the student similar to those from the teacher. When we use monotonically increasing/decreasing functions with respect to the outputs as loss functions $\mathcal{E}$ (e.g., cross-entropy loss $\mathcal{E}(p)=p_{gt}\log(p)$ ), the more similar the outputs are, the more similar the loss values are, as shown below:

Proposition.

Assume $p_{1}<p_{2}<p_{3}\in\mathbb{R}$ , and $\mathcal{E}(p):\mathbb{R}\rightarrow\mathbb{R}$ is a monotonically increasing/decreasing function of $p$ . Then, $|\mathcal{E}(p_{3})-\mathcal{E}(p_{2})|<|\mathcal{E}(p_{3})-\mathcal{E}(p_{1})|$ holds.

Let us consider $p_{3}$ as the teacher’s output, and $p_{2}$ and $p_{1}$ as the outputs of the student. Since $p_{2}$ is closer to $p_{3}$ than $p_{1}$ , the loss of $p_{2}$ becomes more similar to the loss of $p_{3}$ than that of $p_{1}$ . Therefore, we can interpret that learning from pseudo-labels align the outputs from the student to be similar to those from the teacher, thereby aligning the loss values, consequently leading to flat minima.

6 Regularization for Flatter Minima

6.1 Method

As discussed in Sec. 5, when the output from the student and teacher are similar, the networks tend to reach flat minima. To this end, we propose incorporating a simple regularization method to make the two networks’ outputs more similar by training the student using raw outputs from the teacher.

Fig. 5 shows an overview of the method. The concept is to apply regularization so that the outputs from the two networks are similar for the same input image. Specifically, we perform weak data augmentations to the unlabeled (or weakly labeled) image $x_{s_{i}}^{j}$ and input the image into the teacher network. We then use the output from the teacher $\{(\hat{b}_{s_{i}}^{jr},\hat{p}_{s_{i}}^{jr})\}_{r=1}^{N_{R}}$ directly as pseudo-ground truth without post-processing in Eq. (5). To update the student, we input the same weakly augmented image $x_{s_{i}}^{j}$ into the student and calculate the loss as follows: \linenomathAMS

	$\displaystyle\theta^{\mathrm{student}}\leftarrow\theta^{\mathrm{student}}-% \nabla_{\theta}\mathcal{L}^{\mathrm{student}}(\theta)$		(17)
	$\displaystyle\mathcal{L}^{\mathrm{student}}(\theta)=\mathcal{L}^{\mathrm{sup}}% _{s_{1}}(\theta)+\sum_{i=2}^{N_{D}}[\mathcal{L}^{\mathrm{unsup}}_{s_{i}}(% \theta)+\beta\mathcal{L}^{\mathrm{regul.}}_{s_{i}}(\theta)]$		(18)
	$\displaystyle\begin{multlined}\mathcal{L}^{\mathrm{regul.}}_{s_{i}}(\theta)=% \mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}}),\end{multlined}\mathcal{L}^{\mathrm{regul.}}_% {s_{i}}(\theta)=\mathcal{L}^{\mathrm{cls}}_{\mathrm{RPN}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RPN}}(% \theta,X_{s_{i}},\hat{B}_{s_{i}},\hat{C}_{s_{i}})\\ +\mathcal{L}^{\mathrm{cls}}_{\mathrm{RoI}}(\theta,X_{s_{i}},\hat{B}_{s_{i}},% \hat{C}_{s_{i}})+\mathcal{L}^{\mathrm{reg}}_{\mathrm{RoI}}(\theta,X_{s_{i}},% \hat{B}_{s_{i}},\hat{C}_{s_{i}}),$		(21)

\endlinenomath

where $\hat{B}_{s_{i}}=\{\hat{b}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}$ and $\hat{C}_{s_{i}}=\{\hat{c}_{s_{i}}^{j}\}_{j=1}^{N_{s_{i}}}$ are the raw pseudo-labels from the teacher, and $\beta$ is a hyperparameter to tune the strength of the regularization.

6.2 Connection to Prior Arts

We can regard the regularization method as a type of knowledge distillation because the student is trained to mimic the raw output from the teacher. Although the technical details are different, it has been empirically shown that knowledge distillation methods are effective on related tasks such as Single-DGOD [36] and domain adaptive semantic segmentation [38]. We believe that our interpretation revealed one of the reasons knowledge-distillation methods lead to better generalization ability.

7 Experiments

7.1 Dataset Details

We used the artistic style image dataset [13], which has four domains: natural image, clipart, comic, and watercolor. The natural image domain has 16,551 images from PASCAL VOC07&12, and the other domains have 1,000, 2,000, and 2,000 images, respectively. There are six object classes (bike, bird, car, cat, dog, and person), and we removed the images that do not contain these classes.

We conducted the experiments on two patterns of domains. In the first pattern, we set the natural image domain as the labeled domain $s_{1}$ and set clipart and comic as the unlabeled domains $s_{2},s_{3}$ . We set watercolor as the target domain $t$ . Concretely, we used the trainval set of PASCAL VOC 2007&2012, the train set of clipart, and the train set of comic for training. We then used the test sets of clipart and comic for validation. For evaluation (testing), we used the test set of watercolor.

In the second pattern, we set $(s_{1},s_{2},s_{3},t)=(\mathrm{natural,watercolor,comic,clipart})$ . The results on another dataset are shown in the supplementary material.

7.2 Implementation Details

We used soft pseudo labeling proposed in [5] for the Mean Teacher learning. We used Gaussian FasterRCNN [5] as the object detector, in which the regression output is modified to use the soft labels. We used ResNet101 [11] as the backbone. We applied the same hyperparameters as in a previous study [5] except for the number of iterations. All training (including baseline models) was done with four A100 GPUs. The parameters of the backbone network were initialized with the ResNet101 pre-trained on ImageNet. The hyperparameter $\beta$ in Eq. (18) was set to 0.5 throughout the experiments. During the inference (testing) phase, we used the teacher network. Other details are given in the supplementary material.

7.3 Baseline Methods

As the baseline, we trained the detector Gaussian FasterRCNN on Single-DGOD setting (i.e., supervised learning on $s_{1}$ in Eqs. (1-4)). To show the effectiveness of the EMA update, we trained Gaussian FasterRCNN + EMA with Eqs. (15-16). Gaussian FasterRCNN + EMA + PL is a detector trained with the Mean Teacher learning framework in Eqs. (8-13). Gaussian FasterRCNN + EMA + PL + Regul. is a detector with the Mean Teacher learning framework and the regualization in Eqs. (17-21).

To confirm the upper-bound performance, we also trained Gaussian FasterRCNN on DGOD and Oracle settings. On DGOD, the detector was trained with supervised learning using the ground-truth labels on the domains $s_{1},s_{2}$ , and $s_{3}$ . On Oracle, the detector was trained with supervised learning on $s_{1},s_{2},s_{3}$ , and the target domain $t$ .

Because there is only one existing method on SS-DGOD (i.e., CDDMSL [21]), we compared the above detectors with state-of-the-art methods on related task settings such as Single-DGOD and UDA-OD.

7.4 Comparisons with State-of-the-Art Methods

Table 2: Comparisons of mAP50 on the artistic style image dataset [13] when the target domain is watercolor. Values with * are from previous study [17].

setting	method	backbone	mAP50 (watercolor)
Single-DGOD	CLIP-based augmentation [29]	Res101	46.6
Single-DGOD	Gaussian FasterRCNN	Res101	50.5
Single-DGOD	Gaussian FasterRCNN + EMA	Res101	55.5
SS-DGOD	CDDMSL [21]	Res50 (RegionCLIP)	46.1
SS-DGOD	CDDMSL [21]	Res101	41.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	56.6
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	58.2
WS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	59.7
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	62.9
DGOD	Gaussian FasterRCNN	Res101	62.6
Oracle	Gaussian FasterRCNN	Res101	62.2
UDA-OD	Gaussian FasterRCNN [5]	Res101	54.9
UDA-OD	SCL* [25]	Res101	55.2
UDA-OD	SWDA* [24]	Res101	53.3
UDA-OD	UMT* [6]	Res101	58.1
UDA-OD	AT* [17]	Res101	59.9

Table 3: Comparisons of mAP50 on the artistic style image dataset [13] when the target domain is clipart.

setting	method	backbone	mAP50 (clipart)
Single-DGOD	CLIP-based augmentation [29]	Res101	27.2
Single-DGOD	Gaussian FasterRCNN	Res101	34.5
Single-DGOD	Gaussian FasterRCNN + EMA	Res101	38.0
SS-DGOD	CDDMSL [21]	Res50 (RegionCLIP)	39.1
SS-DGOD	CDDMSL [21]	Res101	26.0
SS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	39.8
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	43.3
WS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	44.2
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	46.2
DGOD	Gaussian FasterRCNN	Res101	47.1
Oracle	Gaussian FasterRCNN	Res101	48.2
UDA-OD	Gaussian FasterRCNN [5]	Res101	43.4

Table 2 shows the results on the artistic image style dataset when $(s_{1},s_{2},s_{3},t)=$ (natural, clipart, comic, watercolor). We evaluated with the mean average precision (mAP50) when the IoU threshold was 0.5. EMA increased the mAP from 50.5 to 55.5, and this was further boosted to 56.6 with pseudo labeling (PL). We observed additional improvement to 58.2 with the regularization. The regularization improved the performance not only on SS-DGOD but also on WS-DGOD. Those results are comparable to those of the detectors trained on DGOD and Oracle. The detectors trained on SS-DGOD and WS-DGOD also performed comparably to or better than those on UDA-OD, although we did not use the target domain data during the training.

For fair comparisons, we trained CDDMSL with Res101 backbone pre-trained on ImageNet. However, its performance significantly degraded, as reported in a previous study [21], because it requires language-guided training, and initializing the model with RegionCLIP is crucial to achieve good performance.

Table 3 shows the results when $(s_{1},s_{2},s_{3},t)=$ (natural, watercolor, comic, clipart). Similarly to Table 2, performance improved with EMA, PL, and the regularization.

7.5 Analysis of Flatness

To evaluate the flatness of the detectors in parameter space, following previous studies [14] and [3], we computed the change in loss values when we perturb the parameters. Specifically, we sampled a random direction vector $d$ on a unit sphere, perturbed the parameters ( $\theta^{\prime}=\theta+d\gamma$ ) with a radius $\gamma$ , and computed the average change over ten samples, i.e., $\mathcal{F}^{\gamma}(\theta)=\mathbb{E}_{\theta^{\prime}}|\mathcal{E}(\theta^{% \prime})-\mathcal{E}(\theta)|$ . The lower the change is, the flatter the parameters.

Fig. 6 shows the $\mathcal{F}^{\gamma}(\theta)$ of the training loss $\mathcal{E}(\theta)=\sum_{i}\mathcal{L}_{s_{i}}^{sup}(\theta)$ and the test loss $\mathcal{E}(\theta)=\mathcal{L}_{t}^{sup}(\theta)$ . The training domains were $(s_{1},s_{2},s_{3})$ =(natural, watercolor, comic), and the test domain was clipart. We can see that EMA, PL, and the regularization lowered the changes in the losses on both the training domains and test domain. In other words, each contributed to falling into flatter minima.

8 Conclusions

We tackled two problem settings called semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD) to train object detectors that can generalize to unseen domains. We showed that the object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework. We also provided the interpretations of why the detectors trained with the Mean Teacher framework become robust to the unseen domains in terms of the flatness in the parameter space. On the basis of the interpretations, we proposed incorporating a regularization method to lead to flatter minima, which makes the loss value of the student similar to that of the teacher. The experimental results showed that the detectors trained with the Mean Teacher learning framework and the regularization performed significantly better than the state-of-the-art methods.

In future work, we are planning to investigate the robustness of the Mean Teacher framework against unseen domains on different settings such as UDA, or different tasks such as semantic segmentation.

Supplementary Material

9 More Analysis

9.1 How Sensitive to Hyperparameter $\beta$ ?

Table 4 shows the performance when the hyperparameter $\beta$ in Eq. (18) (i.e. strength of the regularization) was changed from 0 to 1. By adding the regularization, the performance was constantly improved from the detector without regularization (i.e., $\beta=0$ ).

Table 4: mAP50 with various

\beta

on the artistic style image dataset [13].

setting	method	$\beta$	mAP50
setting	method	$\beta$	clipart
SS-DGOD	Gaussian FasterRCNN + EMA + PL	0.0	39.8
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	0.25	40.7
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	0.5	43.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	0.75	42.1
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	1.0	42.5

9.2 Comparison of Regularization with and without Post-processing

In the regularization described in Sec. 6.1, we use the raw outputs from the teacher without post-processing to train the student so that the outputs from the two networks are similar. To validate the claim, we compare the performance with and without post-processing (i.e., sharpening function [5]) in the regularization in Eq. (21). Table 5 shows that the performance drops when we perform the post-processing. We observe that using raw outputs is important to obtain better performance.

Table 5: mAP50 with and without post-processing on the artistic style image dataset [13].

setting	method	post process	mAP50
setting	method	post process	clipart
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.		43.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	✓	39.4

9.3 Class-wise Average Precision

Table 6 and 7 show average precision (AP50) at each class when the target domain is watercolor and clipart, respectively. We can see that the regularization improved the performance on many classes.

Table 6: Comparisons of AP50 at each class on watercolor of the artistic style image dataset [13]. The values of * are from [17].

setting	method	bicycle	bird	cat	car	dog	person	mAP
Single-DGOD	CLIP-based augmentation [29]	74.8	37.3	36.8	40.7	29.2	59.9	46.4
Single-DGOD	Gaussian FasterRCNN	90.4	47.9	30.3	46.7	28.7	59.2	50.5
Single-DGOD	Gaussian FasterRCNN + EMA	86.2	54.3	35.3	53.5	34.5	69.0	55.5
SS-DGOD	CDDMSL [43] (RegionCLIP)	66.3	50.6	34.5	49.2	20.1	56.0	46.1
SS-DGOD	CDDMSL [43] (Res101)	75.5	36.1	23.9	40.7	19.7	52.0	41.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL	87.4	54.6	40.0	51.9	32.4	73.1	56.6
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	87.2	52.3	44.7	53.2	36.8	75.3	58.2
WS-DGOD	Gaussian FasterRCNN + EMA + PL	90.3	55.8	49.3	49.9	37.5	75.4	59.7
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	95.8	59.9	51.5	53.3	40.2	76.7	62.9
DGOD	Gaussian FasterRCNN	84.8	57.8	51.0	50.8	51.8	79.3	62.6
Oracle	Gaussian FasterRCNN	90.9	59.9	44.2	53.1	46.7	78.3	62.2
UDAOD	Gaussian FasterRCNN + EMA + PL [5]	77.7	46.5	40.4	50.1	39.7	75.0	54.9
UDAOD	SCL* [25]	82.2	55.1	51.8	39.6	38.4	64.0	55.2
UDAOD	SWDA* [24]	82.3	55.9	46.5	32.7	35.5	66.7	53.3
UDAOD	UMT* [6]	88.2	55.3	51.7	39.8	43.6	69.9	58.1
UDAOD	AT* [17]	93.6	56.1	58.9	37.3	39.6	73.8	59.9

Table 7: Comparisons of AP50 at each class on clipart of the artistic style image dataset [13].

setting	method	bicycle	bird	cat	car	dog	person	mAP
Single-DGOD	CLIP-based augmentation [29]	36.5	22.5	20.1	25.0	8.8	50.4	27.2
Single-DGOD	Gaussian FasterRCNN	69.5	25.1	5.7	39.4	17.3	49.9	34.5
Single-DGOD	Gaussian FasterRCNN + EMA	87.6	29.3	5.5	30.1	18.3	57.2	38.0
SS-DGOD	CDDMSL [43] (RegionCLIP)	51.0	33.3	26.5	45.2	14.6	63.8	39.1
SS-DGOD	CDDMSL [43] (Res101)	41.6	19.2	5.5	26.7	12.3	50.9	26.0
SS-DGOD	Gaussian FasterRCNN + EMA + PL	75.8	31.2	9.4	33.1	20.4	69.1	39.8
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	79.3	32.5	11.6	40.9	26.3	69.0	43.3
WS-DGOD	Gaussian FasterRCNN + EMA + PL	80.3	33.3	11.1	44.5	23.2	72.6	44.2
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	84.8	33.2	23.8	43.0	22.1	70.1	46.2
DGOD	Gaussian FasterRCNN	76.0	34.8	18.8	38.3	36.9	77.6	47.1
Oracle	Gaussian FasterRCNN	70.4	38.8	26.1	52.9	27.5	73.4	48.2
UDAOD	Gaussian FasterRCNN + EMA + PL [5]	79.9	33.5	6.5	53.1	23.7	65.2	43.6

9.4 Qualitative Results

Figs. 7 and 8 show the qualitative comparison on watercolor and clipart, respectively. We observe that false negative detection of the baseline model was drastically improved.

10 Results on Car-mounted Camera Dataset [36]

10.1 Dataset Details

In this dataset, the domains are defined as different times and weather: daytime-sunny, night-sunny, daytime-foggy, dusk-rainy, and night-rainy. The number of images for each domain is 27,708, 18,310, 2,642, 3,501, and 2,494, respectively. We used daytime-sunny as the labeled domain $s_{1}$ and used night-sunny and daytime-foggy as the unlabeled (or weakly-labeled) domains $s_{2},s_{3}$ . We used each of the remaining domains (dusk-rainy and night-rainy) as the target domain. Because the train/val/test split is not publicly available for daytime-sunny, dusk-rainy, and night-rainy, we used all images of daytime-sunny, the trainval set of night-sunny, and the trainval set of daytime-foggy for training. We then used the test set of night-sunny and the test set of daytime-foggy for validation. We used all images of dusk-rainy and night-rainy for evaluation (testing). There are seven object classes: bus, bike, car, motor, person, rider, and truck.

10.2 Comparisons with State-of-the-Art Methods

Table 8: Comparisons of mAP50 on the car-mounted camera dataset [36]. The values of * and ** were from [36] and [29], respectively.

setting	method	backbone	mAP50
setting	method	backbone	dusk-rainy	night-rainy
Single-DGOD	FasterRCNN*	Res101	26.6	14.5
Single-DGOD	CDSD* [36]	Res101	28.2	16.6
Single-DGOD	CLIP-based augmentation**[29]	Res101	32.3	18.7
Single-DGOD	Gaussian FasterRCNN	Res101	25.3	13.3
Single-DGOD	Gaussian FasterRCNN + EMA	Res101	36.0	19.0
SS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	30.3	21.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	31.2	21.9
WS-DGOD	Gaussian FasterRCNN + EMA + PL	Res101	30.5	22.5
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	Res101	32.5	23.1
DGOD	Gaussian FasterRCNN	Res101	28.4	21.2

Table 8 shows the results on the car-mounted camera dataset. Each of EMA, PL, and the regularization improved the performance on both target domains except that PL degraded the performance on dusk-rainy. We will investigate the cause of the performance drop in our future work.

The mAP50 of the detector with the regularization is boosted to (32.5, 23.1) on WS-DGOD. This result exceeds (32.3, 18.7), which is the result of the state-of-the-art method on Single-DGOD [29]. Also, this result is better than those of the models trained with supervised learning on the three domains (DGOD).

10.3 Analysis of Flatness

Fig. 9 shows the average change of the training loss at each domain when perturbing the parameters ( $\mathcal{F}^{\gamma}(\theta)=\mathbb{E}_{\theta^{\prime}}|\mathcal{E}(\theta^{% \prime})-\mathcal{E}(\theta)|$ described in Sec. 7.5), and Fig. 10 shows those of the test loss. Each of EMA, PL, and the regularization lowered the changes in the losses at every domain when the radius is 125 or smaller although EMA lowered the changes the most when the radius is extremely large (>125). In other words, each contributed to falling into flatter minima with a sufficiently large radius.

10.4 Class-wise Average Precision

Tables 9 and 10 show class-wise average precision on dusk-rainy and night-rainy domains, respectively. We can see that each of EMA, PL, and regularization contributes to improving the performance on many classes except the performance drop by PL on dusk-rainy.

Table 9: Comparisons of AP50 at each class on dusk-rainy of the car-mounted camera dataset [36]. The values of * and ** are from [36] and [29], respectively.

setting	method	bus	bike	car	motor	person	rider	truck	mAP
Single-DGOD	FasterRCNN*	36.8	15.8	50.1	12.8	18.9	12.4	39.5	26.6
Single-DGOD	CDSD* [36]	37.1	19.6	50.9	13.4	19.7	16.3	40.7	28.2
Single-DGOD	CLIP-based augmentation** [29]	37.8	22.8	60.7	16.8	26.8	18.7	42.4	32.3
Single-DGOD	Gaussian FasterRCNN	33.9	14.9	53.6	4.2	17.4	13.6	39.2	25.3
Single-DGOD	Gaussian FasterRCNN + EMA	46.3	24.9	65.9	11.9	29.1	23.7	50.0	36.0
SS-DGOD	Gaussian FasterRCNN + EMA + PL	40.0	17.3	61.0	8.0	23.6	17.1	45.1	30.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	40.8	20.1	61.8	7.8	23.6	18.3	46.2	31.2
WS-DGOD	Gaussian FasterRCNN + EMA + PL	39.0	19.4	60.4	9.4	23.8	17.3	44.0	30.5
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	41.7	22.3	62.1	11.2	25.3	18.9	45.9	32.5
DGOD	Gaussian FasterRCNN	36.2	18.2	61.3	7.3	18.4	15.9	41.9	28.4

Table 10: Comparisons of AP50 at each class on night-rainy of the car-mounted camera dataset [36]. The values of * and ** are from [36] and [29], respectively.

setting	method	bus	bike	car	motor	person	rider	truck	mAP
Single-DGOD	FasterRCNN*	22.6	11.5	27.7	0.4	10.0	10.5	19.0	14.5
Single-DGOD	CDSD* [36]	24.4	11.6	29.5	9.8	10.5	11.4	19.2	16.6
Single-DGOD	CLIP-based augmentation** [29]	28.6	12.1	36.1	9.2	12.3	9.6	22.9	18.7
Single-DGOD	Gaussian FasterRCNN	20.4	7.7	31.0	0.5	6.8	5.6	21.3	13.3
Single-DGOD	Gaussian FasterRCNN + EMA	33.9	11.1	38.5	0.8	10.5	8.8	29.2	19.0
SS-DGOD	Gaussian FasterRCNN + EMA + PL	35.7	9.8	46.7	1.4	12.6	10.8	32.0	21.3
SS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	37.0	10.3	46.3	2.8	12.9	12.0	31.8	21.9
WS-DGOD	Gaussian FasterRCNN + EMA + PL	38.6	11.3	47.9	2.9	13.4	11.2	32.1	22.5
WS-DGOD	Gaussian FasterRCNN + EMA + PL + Regul.	38.3	13.4	46.2	2.7	15.1	14.0	32.0	23.1
DGOD	Gaussian FasterRCNN	38.9	7.6	46.7	1.8	9.8	11.3	32.1	21.2

10.5 Qualitative Results

Figs. 11 and 12 show the qualitative comparison on dusk-rainy and night-rainy, respectively. Similar to the artistic image dataset, the baseline model had false negative detections, which were improved by EMA, PL, and regularization.

11 Training Details

On the artistic style image dataset, the detectors were trained with 10,000 and 20,000 iterations for the pretraining and the student-teacher learning of SS-DGOD (or WS-DGOD), respectively. During the training, we saved the models and evaluated the performance on the validation at every 2,000 iterations, and the best model was used for the evaluation. The whole training took about one day. For fair comparisons, the compared models on Single-DGOD and DGOD were trained with 30,000 iterations, and the best models at the validation of every 2,000 iterations were used for evaluation.

On the car-mounted camera dataset, we performed the same procedure for training, validation, and evaluation, but the numbers of iterations for the pretraining and the student-teacher learning were set to 20,000 and 40,000 respectively, and the validation was conducted at every 4,000 iterations. The whole training took about two days. For fair comparisons, the compared models on Single-DGOD and DGOD were trained with 60,000 iterations, and the best models at the validation of every 4,000 iterations were used for evaluation.

References

[1] Araslanov, N., Roth, S.: Self-supervised augmentation consistency for adapting semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15384–15394 (2021)
[2] Caldarola, D., Caputo, B., Ciccone, M.: Improving generalization in federated learning by seeking flat minima. In: Proceedings of the European Conference on Computer Vision. pp. 654–672 (2022)
[3] Cha, J., Chun, S., Lee, K., Cho, H.C., Park, S., Lee, Y., Park, S.: Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems 34, 22405–22418 (2021)
[4] Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-sgd: Biasing gradient descent into wide valleys. In: Proceedings of the International Conference on Learning Representations (2017)
[5] Chen, M., Chen, W., Yang, S., Song, J., Wang, X., Zhang, L., Yan, Y., Qi, D., Zhuang, Y., Xie, D., Pu, S.: Learning domain adaptive object detection with probabilistic teacher. In: Proceedings of the International Conference on Machine Learning. vol. 162, pp. 3040–3055 (2022)
[6] Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4091–4101 (2021)
[7] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
[8] Fan, Q., Segu, M., Tai, Y.W., Yu, F., Tang, C.K., Schiele, B., Dai, D.: Towards robust object detection invariant to real-world domain shifts. In: Proceedings of the International Conference on Learning Representations (2023)
[9] Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: Proceedings of the International Conference on Learning Representations (2021)
[10] Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: Proceedings of the International Conference on Learning Representations (2021)
[11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[12] Hoyer, L., Dai, D., Van Gool, L.: Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9924–9935 (2022)
[13] Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5001–5009 (2018)
[14] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. pp. 876–885 (2018)
[15] Kaddour, J., Liu, L., Silva, R., Kusner, M.: When do flat minima optimizers work? (2022)
[16] Lee, Y., Willette, J.R., Kim, J., Lee, J., Hwang, S.J.: Exploring the role of mean teachers in self-supervised masked auto-encoders. In: Proceedings of the International Conference on Learning Representations (2023)
[17] Li, Y.J., Dai, X., Ma, C.Y., Liu, Y.C., Chen, K., Wu, B., He, Z., Kitani, K., Vajda, P.: Cross-domain adaptive teacher for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7581–7590 (2022)
[18] Lin, C., Yuan, Z., Zhao, S., Sun, P., Wang, C., Cai, J.: Domain-invariant disentangled network for generalizable object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8771–8780 (2021)
[19] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755 (2014)
[20] Liu, H., Song, P., Ding, R.: Towards domain generalization in underwater object detection. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 1971–1975 (2020)
[21] Malakouti, S., Kovashka, A.: Semi-supervised domain generalization for object detection via language-guided feature alignment. In: Proceedings of the British Machine Vision Conference (2023)
[22] Mi, P., Lin, J., Zhou, Y., Shen, Y., Luo, G., Sun, X., Cao, L., Fu, R., Xu, Q., Ji, R.: Active teacher for semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14482–14491 (2022)
[23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
[24] Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6956–6965 (2019)
[25] Shen, Z., Maheshwari, H., Yao, W., Savvides, M.: Scl: Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. arXiv preprint arXiv:1911.02559 (2019)
[26] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems 33, 596–608 (2020)
[27] Stephan, M., Hoffman, M.D., Blei, D.M., et al.: Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research 18(134), 1–35 (2017)
[28] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30 (2017)
[29] Vidit, V., Engilberge, M., Salzmann, M.: Clip the gap: A single domain generalization approach for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)
[30] Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., Yu, P.: Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering (2022)
[31] Wang, K., Yang, C., Betke, M.: Consistency regularization with high-dimensional non-adversarial source-guided perturbation for unsupervised domain adaptation in segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 10138–10146 (2021)
[32] Wang, K., Fu, X., Huang, Y., Cao, C., Shi, G., Zha, Z.J.: Generalized uav object detection via frequency domain disentanglement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1064–1073 (2023)
[33] Wang, P., Cai, Z., Yang, H., Swaminathan, G., Vasconcelos, N., Schiele, B., Soatto, S.: Omni-detr: Omni-supervised object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9367–9376 (2022)
[34] Wang, P., Zhang, Z., Lei, Z., Zhang, L.: Sharpness-aware gradient matching for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3769–3778 (2023)
[35] Wang, X., Huang, T.E., Liu, B., Yu, F., Wang, X., Gonzalez, J.E., Darrell, T.: Robust object detection via instance-level temporal cycle confusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9143–9152 (2021)
[36] Wu, A., Deng, C.: Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 847–856 (2022)
[37] Yang, F.E., Cheng, Y.C., Shiau, Z.Y., Wang, Y.C.F.: Adversarial teacher-student representation learning for domain generalization. Advances in Neural Information Processing Systems 34, 19448–19460 (2021)
[38] Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12414–12424 (2021)
[39] Zhang, X., Xu, R., Yu, H., Dong, Y., Tian, P., Cui, P.: Flatness-aware minimization for domain generalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5189–5202 (2023)
[40] Zhang, X., Xu, Z., Xu, R., Liu, J., Cui, P., Wan, W., Sun, C., Li, C.: Towards domain generalization in object detection. arXiv preprint arXiv:2203.14387 (2022)
[41] Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., Liu, H.: Towards unsupervised domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4910–4920 (2022)
[42] Zhou, K., Liu, Z., Qiao, Y., Xiang, T., Loy, C.C.: Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
[43] Zhou, K., Loy, C.C., Liu, Z.: Semi-supervised domain generalization with stochastic stylematch. International Journal of Computer Vision pp. 1–11 (2023)

Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection

Abstract

Keywords:

1 Introduction

2 Problem Settings

3 Related Work

3.1 Domain Generalization for Image Classification

3.2 Domain Generalization for Object Detection

3.3 Semi-supervised Domain Generalization

3.4 Mean Teacher Learning Framework

4 Training Method

4.1 Overview and Key Idea

4.2 Pre-training

4.3 Mean Teacher Learning

4.3.1 Generate Pseudo-labels

4.3.2 Update Student

4.3.3 Update Teacher

5 Why Does Mean Teacher Become Robust to Unseen Domains?

5.1 Definition

5.2 Preliminary Knowledge

Theorem (from [3]).

5.3 EMA Update

5.4 Learning from Pseudo Labels

Proposition.

6 Regularization for Flatter Minima

6.1 Method

6.2 Connection to Prior Arts

7 Experiments

7.1 Dataset Details

7.2 Implementation Details

7.3 Baseline Methods

7.4 Comparisons with State-of-the-Art Methods

7.5 Analysis of Flatness

8 Conclusions

Supplementary Material

9 More Analysis

9.1 How Sensitive to Hyperparameter β𝛽\betaitalic_β?

9.2 Comparison of Regularization with and without Post-processing

9.3 Class-wise Average Precision

9.4 Qualitative Results

10 Results on Car-mounted Camera Dataset [36]

10.1 Dataset Details

10.2 Comparisons with State-of-the-Art Methods

10.3 Analysis of Flatness

10.4 Class-wise Average Precision

10.5 Qualitative Results

11 Training Details

References

9.1 How Sensitive to Hyperparameter $\beta$ ?