Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments

Shilei Cao Sun Yat-sen UniversityZhuhaiChina [email protected] Yan Liu Sun Yat-sen UniversityZhuhaiChina [email protected] Juepeng Zheng Sun Yat-sen UniversityZhuhaiChina [email protected] Weijia Li Sun Yat-sen UniversityZhuhaiChina [email protected] Runmin Dong Tsinghua UniversityBei**gChina [email protected]  and  Haohuan Fu Tsinghua UniversityBei**gChina [email protected]
Abstract.

For real-world applications, neural network models are commonly deployed in dynamic environments, where the distribution of the target domain undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to test data drawn from a continually changing target domain. Despite recent advancements in addressing CTTA, two critical issues remain: 1) The use of a fixed threshold for pseudo-labeling in existing methodologies leads to the generation of low-quality pseudo-labels, as model confidence varies across categories and domains; 2) While current solutions utilize stochastic parameter restoration to mitigate catastrophic forgetting, their capacity to preserve critical information is undermined by its intrinsic randomness. To tackle these challenges, we present CTAOD, aiming to enhance the performance of detection models in CTTA scenarios. Inspired by prior CTTA works for effective adaptation, CTAOD is founded on the mean-teacher framework, characterized by three core components. Firstly, the object-level contrastive learning module tailored for object detection extracts object-level features using the teacher’s region of interest features and optimizes them through contrastive learning. Secondly, the dynamic threshold strategy updates the category-specific threshold based on predicted confidence scores to improve the quality of pseudo-labels. Lastly, we design a data-driven stochastic restoration mechanism to selectively reset inactive parameters using the gradients as weights for a random mask matrix, thereby ensuring the retention of essential knowledge. We demonstrate the effectiveness of our approach on four CTTA tasks for object detection, where CTAOD outperforms existing methods, especially achieving a 3.0 mAP improvement on the Cityscapes-to-Cityscapes-C CTTA task. The code of this work will be released soon.

copyright: none

1. Introduction

Refer to caption
Figure 1. Motivation of our proposed method. The source-trained model adapts to dynamic environments without source data. (1) DT: The bar chart above represents the mean model scores for different categories across four domains of the ACDC dataset (Sakaridis et al., 2021). While existing methods produce the pseudo-labels via a fixed threshold, CTAOD utilizes DT to update the category-specific threshold based on the mean predicted scores. (2) DSR: The bottom describes the neuron reset process, where neurons with deeper shades of purple indicate increased activity levels and blue neurons represent corresponding neurons from the source model. While existing methods restore parameters randomly, CTAOD resets inactive parameters with higher possibilities.

Deep learning models have demonstrated immense potential across various data modalities such as image (He et al., 2016; Dosovitskiy et al., 2020; Ren et al., 2015), audio (Aytar et al., 2016; Amodei et al., 2016), language (Devlin et al., 2018; Vaswani et al., 2017), and video (Tran et al., 2015; Carreira and Zisserman, 2017). However, these models experience pronounced degradation in performance when confronted with training data (i.e., source domain) and testing data (i.e., target domain) originating from disparate distributions. This phenomenon, commonly referred to as distribution shifts, poses a significant challenge (Geirhos et al., 2018; Hendrycks and Dietterich, 2019; Luo et al., 2020; Deng et al., 2014). In such a scenario, unsupervised domain adaptation (UDA) becomes crucial, which typically involves aligning the source and target data distributions, thereby mitigating the impact of distribution shifts (Ganin et al., 2016; Tzeng et al., 2017; Saito et al., 2018). Still, UDA falls short by necessitating access to source data, which is often inaccessible due to privacy constraints or data transmission barriers (VS et al., 2023b; Huang et al., 2021).

This limitation catalyzes the exploration of Test-Time Adaptation (TTA) where the source-trained model directly adapts toward unlabeled test samples encountered during evaluation in an online manner, without the reliance on the source data (Mummadi et al., 2021; Sinha et al., 2023; You et al., 2021; Wang et al., 2020). Nonetheless, the aforementioned methods, which assume a static target domain, face a more challenging and realistic problem, as real-world machine learning systems work in non-stationary and continually evolving environments (Wang et al., 2022; Döbler et al., 2023). Existing TTA methods are vulnerable to catastrophic forgetting of previously learned source knowledge and error accumulation when adaptation faces more than one distribution shift (Wang et al., 2022; Döbler et al., 2023; Niloy et al., 2024). For example, a vehicle may encounter various continuous environmental changes such as fog, night, rain, and snow during its journey.

Recently, Wang et al. (2022) introduce CoTTA by applying stochastic parameters restoration to mitigate catastrophic forgetting in Continual Test-Time Adaptation (CTTA) scenarios (see Table 1), where the model is continually adapted to sequences of target domains. Although the strategy of randomly selecting parameters for restoration partially helps alleviate the forgetting of source knowledge (Wang et al., 2022; Zhu et al., 2023), its randomness may also contribute to the loss of crucial knowledge specific to the current domain (refer to Figure 1). Additionally, existing CTTA methods normally rely on self-training methods by utilizing pseudo-labels through a fixed threshold to supervise the training of the model (Wang et al., 2022; Döbler et al., 2023; Sójka et al., 2023). Given that model confidence could vary across categories and domains (refer to Figure 1), employing a uniformly fixed threshold could result in excluding high-quality pseudo-labels while incorporating incorrect ones. Erroneous or missing pseudo-labels lead to error accumulation as they provide negative feedback to the model.

Therefore, despite recent efforts to handle CTTA, two significant challenges persist: Current self-training-based methods suffer from noisy and low-quality pseudo-labels, leading to error accumulation (Challenge 1); Continual adaptation to dynamic environments in existing methods struggles to effectively retain valuable knowledge about the current domain while mitigating noise specific to the previous domain (Challenge 2).

To tackle the identified challenges in the CTTA setting for object detection, we introduce CTAOD (Continual Test-time Adaption for Object Detection). Aligning with previous CTTA works for robust adaptation (Wang et al., 2022; Gong et al., 2022; Döbler et al., 2023), CTAOD is constructed upon the mean-teacher framework (Tarvainen and Valpola, 2017) where the teacher model is an exponential moving average of the student model. Particularly, CTAOD comprises Object-level Contrastive Learning (OCL), Dynamic Threshold (DT), and Data-driven Stochastic Restoration (DSR).

More specifically, we propose OCL, which is tailored for object detection, to acquire fine-grained and localized feature representation, motivated by the effectiveness of contrastive learning (CL) in self-supervised learning (Chen et al., 2020; He et al., 2020; Grill et al., 2020; Chen and He, 2021). OCL extracts features based on proposals generated by the Region Proposal Network (RPN), which provides multiple cropped views around the object instance at different locations and scales. Subsequently, CL loss is applied on RPN cropped views to guide the model to encourage similar object instances to remain close while pushing dissimilar ones apart. The OCL is well integrated into the mean-teacher paradigm as a drop-in enhancement for feature adaptation. Furthermore, to improve the quality of pseudo-labels across various categories, i.e., addressing Challenge 1, we devise a DT strategy to dynamically adjust the threshold for each category individually based on the predicted confidence scores. The dynamic nature of the DT method makes it better suited to address the effects of continuously changing distributions compared to fixed thresholds. Finally, the DSR mechanism is utilized to reset inactive parameters with a higher possibility than active ones by employing the gradients of parameters as weights for a random mask matrix. On the one hand, it not only helps alleviate the impact of noise from the previous domain but also preserves important information, i.e., addressing Challenge 2. On the other hand, its randomness also allows potentially falsely activated parameters to be reset, thereby making the model more stable.

We demonstrate the effectiveness of our proposed approach on four continual test-time adaptation benchmark tasks for object detection. These datasets are associated with synthetic and real-world distribution shifts in the short-term and long-term adaption (i.e. Cityscapes (Cordts et al., 2016) \rightarrow Cityscapes-C (Hendrycks and Dietterich, 2019), SHIFT (Sun et al., 2022), and Cityscapes (Cordts et al., 2016) \rightarrow ACDC (Sakaridis et al., 2021) adaptation). Experimental results indicate that our method significantly improves performance over existing state-of-the-art methods, with gains of up to 3.0 mAP on these tasks.

In summary, the main contributions of this paper are as follows:

  • This study introduces a novel method named CTAOD, which pioneers exploring CTTA for detection models. Specifically, we propose to leverage object-level features for contrastive learning to refine feature representation for object detection.

  • To address two challenges in CTTA, we dynamically update the category-specific threshold based on predicted scores to improve the quality of pseudo-labels and reset the inactive parameter with higher possibilities to mitigate forgetting.

  • Empirical experiments demonstrate that our proposed method surpasses existing methods and effectively facilitates short-term and long-term adaptation in dynamic environments.

2. RELATED WORK

2.1. Source-Free Domain Adaptation

UDA tackles the inter-domain divergence by aligning the distributions of source and target data (Ganin et al., 2016; Li et al., 2019; Luo et al., 2020; Jiang et al., 2020). Despite its effectiveness, the limitation of UDA lies in its requirement for access to the source domain data, which often raises concerns regarding data privacy, data portability, and data transmission efficiency. As a result, Source-Free Domain Adaptation (SFDA) received extensive research attention, especially in object detection, where the source-trained detector is re-trained on the target training data (VS et al., 2023b, a; Li et al., 2022b, 2021a, 2021b) before evaluation. For instance, MemCLR (VS et al., 2023a) employs a cross-attention transformer-based memory bank for contrastive learning, while IRG (VS et al., 2023a) utilizes the object relations with instance relation graph network to explore the SFDA setting for object detection without the source data. However, the standard SFDA setting requires prior knowledge of the target domain, which is impractical in most real-world applications.

2.2. Test-Time Adaptation

TTA adapts the source-trained model to the target test data during inference time without access to the source data. Since both TTA and SFDA involve adapting the source-trained model to the unlabeled target data without utilizing source data, some works also refer to TTA as SFDA (Wang et al., 2022; Brahma and Rai, 2023). In this paper, we distinguish between TTA and SFDA based on evaluation protocol, although they can be transformed into each other in experiments. Furthermore, TTA methods improve the model performance under distribution shift commonly through techniques like pseudo-labeling (Sinha et al., 2023; Iwasawa and Matsuo, 2021; Sun et al., 2020; Zeng et al., 2023), batchnorm statistics updating (Hu et al., 2021; You et al., 2021), or entropy regularization (Wang et al., 2020; Iwasawa and Matsuo, 2021; Niu et al., 2022) during testing. For example, Tent (Wang et al., 2020) updates the batchnorm parameters with entropy minimization and demands a large batch size for optimization during test-time adaptation, which is unsuitable for real-time detection model deployment where images are processed sequentially. The above approaches assume a specific and static target domain where all the test inputs come from a single domain. However, in numerous practical scenarios, the distribution of test input may exhibit a continual shift over time.

2.3. Continual Test-time Adaptation

Early works consider adaptation to evolving and continually changing domains by aligning the source and target data (Hoffman et al., 2014; Wulfmeier et al., 2018). These methods rely on source data during inference, which limits their applicability. A significant advancement in this area is the proposal of CoTTA (Wang et al., 2022), marking the first work tailored to the demands of CTTA by adapting a pre-trained model to sequences of domains without using any source data. Subsequently, research efforts have been dedicated to exploring CTTA, primarily focused on classification (Döbler et al., 2023; Niloy et al., 2024; Brahma and Rai, 2023; Yu et al., 2023) and segmentation (Niloy et al., 2024; Song et al., 2023; Zhu et al., 2023) tasks. For instance, similar to CoTTA (Wang et al., 2022), Zhu et al. (2023) apply a stochastic reset mechanism for CTTA in the segmentation task of medical images to prevent forgetting. In contrast, PETAL (Brahma and Rai, 2023) utilizes the Fisher Information Matrix (FIM) as a metric of parameter importance to reset only the most irrelevant parameters. Nevertheless, these two methods are sub-optimal, as the randomness might lead to losing essential information, and pure data-driven restoration may retain false active parameters resulting from noise.

Moreover, the mean-teacher framework (Tarvainen and Valpola, 2017) serves as a base architecture for most CTTA works (Wang et al., 2022; Gong et al., 2022; Döbler et al., 2023; Niloy et al., 2024), where the teacher commonly generates pseudo-labels via a fixed threshold to supervise the student. Nonetheless, these methods suffer from low-quality pseudo-labels with a uniform threshold since the model confidence varies across categories and domains. Furthermore, Wang et al. (2024) design a dynamic thresholding technique to update the threshold based on the maximum predicted score in a batch for classification tasks, which requires a relatively large batch size to set an appropriate threshold. However, it is not suitable for object detection, where a smaller batch size is often preferred for computational efficiency. Alternatively, Gan et al. (2023) presents a cloud-device collaborative continual adaptation paradigm to accommodate dynamic environment in object detection. Nevertheless, its dependency on source data also constrains its practical utility. Therefore, a gap still exists in exploring the CTTA setting in object detection to improve the quality of pseudo-labels (Challenge 1) and to effectively reset noisy neurons (Challenge 2), without reliance on the source data.

2.4. Continual Learning

Continual learning, also known as incremental learning or life-long learning, typically involves enabling the model to retain previously acquired knowledge while learning from a sequential series of tasks, i.e. preventing catastrophic forgetting (De Lange et al., 2021; Parisi et al., 2019). It is commonly categorized into replay methods (Rebuffi et al., 2017; Tiwari et al., 2022; Rolnick et al., 2019), parameter isolation method (Aljundi et al., 2017; Xu and Zhu, 2018), and regularization-based methods (Kirkpatrick et al., 2017; Li and Hoiem, 2017; Zenke et al., 2017). As an illustration, Elastic weight consolidation (EWC) (Kirkpatrick et al., 2017) is a regularization-based technique that penalizes the changing of parameters with a significant impact on prediction, based on the Fisher Information Matrix (FIM). In this paper, motivated by (Kirkpatrick et al., 2017; Brahma and Rai, 2023), we utilize gradients to approximate FIM as a metric of parameter importance for resetting noisy parameters. Furthermore, we introduce randomness to enhance model robustness during continual adaptation. Additionally, while the continual learning approaches aim to tackle catastrophic forgetting in sequences of new tasks, our work focuses on learning from different domains for a single task.

3. METHOD

3.1. Preliminary

3.1.1. Problem Statement

Table 1. Comparisons between different problem settings. ‘Online’ indicates whether the model has access to the target data or only predicts the incoming test samples immediately.
Setting Source Data Target Training Data Target Distribution Train Loss Test Loss Online
Continual Learning ×\times× (xt,yt)superscript𝑥𝑡superscript𝑦𝑡(x^{t},y^{t})( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) Dynamic (xt,yt)superscript𝑥𝑡superscript𝑦𝑡\mathcal{L}(x^{t},y^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ×\times× ×\times×
Unsupervised Domain Adaptation (xs,ys)superscript𝑥𝑠superscript𝑦𝑠(x^{s},y^{s})( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (xt)superscript𝑥𝑡(x^{t})( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) Static (xs,ys)+(xs,xt)superscript𝑥𝑠superscript𝑦𝑠superscript𝑥𝑠superscript𝑥𝑡\mathcal{L}(x^{s},y^{s})+\mathcal{L}(x^{s},x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) + caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ×\times× ×\times×
Source-free Domain Adaptation ×\times× (xt)superscript𝑥𝑡(x^{t})( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) Static (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ×\times× ×\times×
Test-time Adaptation ×\times× ×\times× Static ×\times× (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) \checkmark
Continual Test-Time Adaptation ×\times× ×\times× Dynamic ×\times× (xt)superscript𝑥𝑡\mathcal{L}(x^{t})caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) \checkmark

Given a sequence of domain D={D0,D1,,Dn}𝐷superscript𝐷0superscript𝐷1superscript𝐷𝑛D=\{D^{0},D^{1},...,D^{n}\}italic_D = { italic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, we define the first domain D0superscript𝐷0D^{0}italic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT as the source domain and the subsequent domains as the target domain. The objective of CTTA is to enhance the performance of the model fθ0(x)subscript𝑓subscript𝜃0𝑥f_{\theta_{0}}(x)italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ), where the parameters θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are pre-trained on source data xD0,yD0superscript𝑥superscript𝐷0superscript𝑦superscript𝐷0x^{D^{0}},y^{D^{0}}italic_x start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from D0superscript𝐷0D^{0}italic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, in a continually changing target domain during inference time without using source data. For simplicity, we denote the data without the superscript about the specific domain in the following discussion. At time step t𝑡titalic_t, unlabeled target data xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is provided sequentially, following the domain group order. The model is required to make a prediction y^t=fθt1(xt)subscript^𝑦𝑡subscript𝑓subscript𝜃𝑡1subscript𝑥𝑡\hat{y}_{t}=f_{\theta_{t-1}}(x_{t})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using the parameters θt1subscript𝜃𝑡1\theta_{t-1}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT which have been updated based on previous target data x1,,xt1subscript𝑥1subscript𝑥𝑡1x_{1},...,x_{t-1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Subsequently, y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT serves as the evaluation output at time step t, and the model will adapt itself toward xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as θt1θtsubscript𝜃𝑡1subscript𝜃𝑡\theta_{t-1}\rightarrow\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT → italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which will only influence future inputs xt+nsubscript𝑥𝑡𝑛x_{t+n}italic_x start_POSTSUBSCRIPT italic_t + italic_n end_POSTSUBSCRIPT.

Moreover, We compare CTTA with other adaptation settings, as detailed in Table 1. These settings are developed to meet the diverse prerequisites and requirements for real-world applications.

3.1.2. Mean-Teacher Framework

In alignment with previous CTTA works (Wang et al., 2022; Gong et al., 2022; Döbler et al., 2023; Niloy et al., 2024), we build our method based on the mean-teacher framework (Tarvainen and Valpola, 2017). This framework is characterized by the interplay between a teacher model and a student model. Both networks are initialized with the source-trained model at the beginning of adaptation. The teacher model produces the pseudo-labels for the unlabeled target data, which serve as labels of the unlabeled data to supervise the student model. While the parameters of the student model are optimized via gradient descent, the parameters of the teacher model are updated following an Exponential Moving Average (EMA) strategy based on the student model. Formally, this process can be expressed as follows:

(1) pl(xt)=rpn(xt,y^t)+rcnn(xt,y^t),subscript𝑝𝑙subscript𝑥𝑡subscript𝑟𝑝𝑛subscript𝑥𝑡subscript^𝑦𝑡subscript𝑟𝑐𝑛𝑛subscript𝑥𝑡subscript^𝑦𝑡\displaystyle\mathcal{L}_{pl}(x_{t})=\mathcal{L}_{rpn}(x_{t},\hat{y}_{t})+% \mathcal{L}_{rcnn}(x_{t},\hat{y}_{t}),caligraphic_L start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_r italic_p italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_c italic_n italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
(2) θtstdθt1std+γ(pl(xt))θt1std,superscriptsubscript𝜃𝑡𝑠𝑡𝑑superscriptsubscript𝜃𝑡1𝑠𝑡𝑑𝛾subscript𝑝𝑙subscript𝑥𝑡superscriptsubscript𝜃𝑡1𝑠𝑡𝑑\displaystyle\theta_{t}^{std}\leftarrow\theta_{t-1}^{std}+\gamma\frac{\partial% (\mathcal{L}_{pl}(x_{t}))}{\partial\theta_{t-1}^{std}},italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT + italic_γ divide start_ARG ∂ ( caligraphic_L start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT end_ARG ,
(3) θttchαθt1tch+(1α)θtstd,superscriptsubscript𝜃𝑡𝑡𝑐𝛼superscriptsubscript𝜃𝑡1𝑡𝑐1𝛼superscriptsubscript𝜃𝑡𝑠𝑡𝑑\displaystyle\theta_{t}^{tch}\leftarrow\alpha\theta_{t-1}^{tch}+(1-\alpha)% \theta_{t}^{std},italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ← italic_α italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT + ( 1 - italic_α ) italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT ,

where xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the unlabeled target data and corresponding pseudo-labels generated by the teacher network. θtstdsubscriptsuperscript𝜃𝑠𝑡𝑑𝑡\theta^{std}_{t}italic_θ start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and θttchsubscriptsuperscript𝜃𝑡𝑐𝑡\theta^{tch}_{t}italic_θ start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT symbolize the parameters of student and teacher networks at time step t𝑡titalic_t, respectively. Moreover, the supervision loss plsubscript𝑝𝑙\mathcal{L}_{pl}caligraphic_L start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT, which consists of the rpn loss rpnsubscript𝑟𝑝𝑛\mathcal{L}_{rpn}caligraphic_L start_POSTSUBSCRIPT italic_r italic_p italic_n end_POSTSUBSCRIPT and rcnn loss rcnnsubscript𝑟𝑐𝑛𝑛\mathcal{L}_{rcnn}caligraphic_L start_POSTSUBSCRIPT italic_r italic_c italic_n italic_n end_POSTSUBSCRIPT in faster-rcnn (Ren et al., 2015), is utilized for pseudo-labeling. Additionally, γ𝛾\gammaitalic_γ represents the student’s learning rate, and the EMA rate is denoted by α𝛼\alphaitalic_α. Hence, the teacher can be regarded as an ensemble of historical students, providing stable supervision. Despite the effectiveness of the mean-teacher framework under the static distribution shift, this framework encounters error accumulation and catastrophic forgetting in non-stationary environments. Therefore, we design CTAOD to improve the robustness of feature adaptation in the CTTA setting.

Refer to caption
Figure 2. The overview of the proposed CTAOD, which follows the mean-teacher framework. The teacher generates proposals and predictions for the weakly augmented image, whereas the student receives the strongly enhanced images. Our proposed CTAOD includes: (1) The Object-level Contrastive Learning (OCL) module compares region of interest features extracted from feature maps of both networks based on teacher proposals for contrastive learning; (2) The dynamic threshold (DT) strategy adjusts category-specific thresholds based on prediction scores; These pseudo-labels are utilized to supervise the student through the threshold. (3) The Data-driven stochastic restoration (DSR) reset the inactive parameter based on the gradient.

3.2. Proposed Method

3.2.1. Overview

As indicated in Figure 2, this subsection presents a concise overview of our proposed CTAOD, consisting of Object-level Contrastive Learning (OCL), Dynamic Threshold (DT), and Data-driven Stochastic Restoration (DSR). Inspired by UDA methods for object detection (Li et al., 2022a; Liu et al., 2021), we adopt the Weak-Strong augmentation to enable the teacher model to generate reliable pseudo-labels without being affected by heavy augmentation. Specifically, the teacher model receives input with weak augmentations while we input student networks with strong enhanced images. The overall loss of the student model for gradient descent is defined as:

(4) all=pl(xt)+λcl(xt)+μkl(xt),subscript𝑎𝑙𝑙subscript𝑝𝑙subscript𝑥𝑡𝜆subscript𝑐𝑙subscript𝑥𝑡𝜇subscript𝑘𝑙subscript𝑥𝑡\mathcal{L}_{all}=\mathcal{L}_{pl}(x_{t})+\lambda\mathcal{L}_{cl}(x_{t})+\mu% \mathcal{L}_{kl}(x_{t}),caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_μ caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where cl(xt)subscript𝑐𝑙subscript𝑥𝑡\mathcal{L}_{cl}(x_{t})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the contrastive learning (CL) loss, to be elaborated in section 3.2.2. The klsubscript𝑘𝑙\mathcal{L}_{kl}caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT represents the Kullback-Leibler (KL) Divergence (Kullback and Leibler, 1951) loss, used to quantify the distinction between two probability distributions. The parameters λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ are corresponding weights of the losses. The KL loss is defined as:

(5) kl(PQ)=x𝒳P(x)log(P(x)Q(x)),subscript𝑘𝑙conditional𝑃𝑄subscript𝑥𝒳𝑃𝑥𝑃𝑥𝑄𝑥\mathcal{L}_{kl}\left(P\parallel Q\right)=\sum_{x\in\mathcal{X}}P(x)\mathrm{~{% }}\log\left(\frac{P(x)}{Q(x)}\right),caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_P ( italic_x ) roman_log ( divide start_ARG italic_P ( italic_x ) end_ARG start_ARG italic_Q ( italic_x ) end_ARG ) ,

where P(x)𝑃𝑥P(x)italic_P ( italic_x ) and Q(x)𝑄𝑥Q(x)italic_Q ( italic_x ) represent different distribution. We employ the KL divergence loss to encourage the student model to approximate the teacher model closely.

3.2.2. Object-Level Contrastive Learning

In this subsection, we first briefly introduce SimCLR, a widely used CL approach for self-supervised visual representation learning (Chen et al., 2020). SimCLR aims to learn high-quality feature representation across differently augmented views of the same image. For the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT image xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a batch, the CL loss in SimCLR is formulated as follows:

(6) cl(xi)=log(exp(sim(zi,zj)/τ)k=1,ki2Nexp(sim(zi,zk)/τ)),subscript𝑐𝑙subscript𝑥𝑖𝑠𝑖𝑚subscript𝑧𝑖subscript𝑧𝑗𝜏superscriptsubscript𝑘1subscriptcontains𝑘𝑖2𝑁𝑠𝑖𝑚subscript𝑧𝑖subscript𝑧𝑘𝜏\mathcal{L}_{cl}(x_{i})=-\log\left(\frac{\exp(sim(z_{i},z_{j})/\tau)}{\sum_{k=% 1,\ni_{k\neq i}}^{2N}\exp(sim(z_{i},z_{k})/\tau)}\right),caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - roman_log ( divide start_ARG roman_exp ( italic_s italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , ∋ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ) ,

where τ>0𝜏0\tau>0italic_τ > 0 is a temperature hyper-parameter. zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and zjsubscript𝑧𝑗z_{j}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denote the features of two different augmentations of the same images, serving as the positive pair. At the same time, zksubscript𝑧𝑘z_{k}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents one of the features of a different augmented sample in a batch size of N𝑁Nitalic_N. Here, the sim()𝑠𝑖𝑚sim(\cdot)italic_s italic_i italic_m ( ⋅ ) indicates the similarity function, i.e. cosine similarity. Notably, the SimCLR is initially designed for classification tasks, assuming each image pertains to a single category, and requires large batch sizes to ensure sufficient positive and negative pairs for feature representation learning. Consequently, the original SimCLR is not well-suited for object detection, where images typically contain multiple instances, thus requiring significant computational resources to accommodate large batch sizes.

Motivated by SimCLR (Chen et al., 2020), we present an OCL module to extract teacher and student features for CL based on proposals generated from the region proposal network (RPN). Since proposals provide multiple cropped views around the object instance, this strategy does not require a large batch size, thus computationally efficient for online detection model updates. Specifically, given a weakly augmented image fweak(xt)subscript𝑓𝑤𝑒𝑎𝑘subscript𝑥𝑡f_{weak}(x_{t})italic_f start_POSTSUBSCRIPT italic_w italic_e italic_a italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at time step t, the teacher produces ROI proposals pt={p1,,pl}subscript𝑝𝑡subscript𝑝1subscript𝑝𝑙p_{t}=\{p_{1},...,p_{l}\}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } via region proposal network. We then apply RoIAlign (He et al., 2017), a pooling operation for ROI, to extract corresponding teacher and student object-level features FtT={fiT1×C}i=1lsuperscriptsubscript𝐹𝑡𝑇superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑇superscript1𝐶𝑖1𝑙F_{t}^{T}=\{f_{i}^{T}\in\mathbb{R}^{1\times C}\}_{i=1}^{l}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and FtS={fiS1×C}i=1lsuperscriptsubscript𝐹𝑡𝑆superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑆superscript1𝐶𝑖1𝑙F_{t}^{S}=\{f_{i}^{S}\in\mathbb{R}^{1\times C}\}_{i=1}^{l}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT based on the feature map from the backbone, respectively. Subsequently, the features are projected into the final object-level features with the same encoder using two fully connected layers. For simplicity, we still denote these two final features as FtTsuperscriptsubscript𝐹𝑡𝑇F_{t}^{T}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and FtSsuperscriptsubscript𝐹𝑡𝑆F_{t}^{S}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. The features associated with the same proposal are considered positive pairs, otherwise negative pairs. we employ the CL loss to these features FtTsuperscriptsubscript𝐹𝑡𝑇F_{t}^{T}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and FtSsuperscriptsubscript𝐹𝑡𝑆F_{t}^{S}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT derived from weak and strong enhanced images respectively by minimizing:

(7) cl(xt)=1li=1llogexp(fiTfiS/τ)j=1lexp(fiTfjS/τ),subscript𝑐𝑙subscript𝑥𝑡1𝑙superscriptsubscript𝑖1𝑙superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑖𝑆𝜏superscriptsubscript𝑗1𝑙superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑗𝑆𝜏\mathcal{L}_{cl}(x_{t})=\frac{1}{l}\sum_{i=1}^{l}-\log\frac{\exp(f_{i}^{T}% \cdot f_{i}^{S}/\tau)}{\sum_{j=1}^{l}\exp(f_{i}^{T}\cdot f_{j}^{S}/\tau)},caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT / italic_τ ) end_ARG ,

where l𝑙litalic_l indicates the number of the features. This strategy encourages the model to learn fine-grained and localized representations on the target domain, without relying on accurate pseudo-labels. Moreover, the OCL is well integrated into the mean-teacher self-training paradigm as a drop-in enhancement for feature adaptation.

3.2.3. Dynamic Threshold

Due to the presence of domain shifts, pseudo-labels inevitably suffer from label noise. Additionally, the common practice is to use a fixed threshold to remove suspected noisy labels. However, model confidence could differ between different categories and domains, thereby hurting the performance using a fixed threshold. Taking inspiration from the threshold strategy (Wang et al., 2024; Zhao et al., 2023), we design the DT strategy to update the category-specific threshold based on the predicted scores dynamically to minimize noise and select a suitable threshold for each category across different domains in test-time. As shown in Figure 2, the teacher network are first to make the prediction y^t=fθt1(fwek(xt))subscript^𝑦𝑡subscript𝑓subscript𝜃𝑡1subscript𝑓𝑤𝑒𝑘subscript𝑥𝑡\hat{y}_{t}=f_{\theta_{t-1}}(f_{wek}(x_{t}))over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_w italic_e italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) for the weakly enhanced images fwek(xt)subscript𝑓𝑤𝑒𝑘subscript𝑥𝑡f_{wek}(x_{t})italic_f start_POSTSUBSCRIPT italic_w italic_e italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . The thresholds for each category are initialized with the same value δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at the start of the adaptation, which is updated every iteration by:

(8) δtcβδt1c+(1β)ϵ(ltc¯)12,superscriptsubscript𝛿𝑡𝑐𝛽superscriptsubscript𝛿𝑡1𝑐1𝛽italic-ϵsuperscript¯superscriptsubscript𝑙𝑡𝑐12\delta_{t}^{c}\leftarrow\beta\cdot\delta_{t-1}^{c}+(1-\beta)\cdot\epsilon\cdot% (\overline{l_{t}^{c}})^{\frac{1}{2}},italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ← italic_β ⋅ italic_δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + ( 1 - italic_β ) ⋅ italic_ϵ ⋅ ( over¯ start_ARG italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ,

where δtcsuperscriptsubscript𝛿𝑡𝑐\delta_{t}^{c}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT denotes the threshold of category c𝑐citalic_c at time step t, ltc¯¯superscriptsubscript𝑙𝑡𝑐\overline{l_{t}^{c}}over¯ start_ARG italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG is the mean predicted scores of the category c𝑐citalic_c at time step t𝑡titalic_t, β𝛽\betaitalic_β represents the update rate, and ϵitalic-ϵ\epsilonitalic_ϵ provides a linear projection. Furthermore, the threshold δtcsuperscriptsubscript𝛿𝑡𝑐\delta_{t}^{c}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT will not change if class c𝑐citalic_c does not exist in prediction y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and we set a fixed upper and lower bound δmaxsubscript𝛿𝑚𝑎𝑥\delta_{max}italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and δminisubscript𝛿𝑚𝑖𝑛𝑖\delta_{mini}italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_n italic_i end_POSTSUBSCRIPT. For unlabeled data from dynamic environments, this mechanism effectively prevents the threshold from being too high or too low while generating an appropriate threshold for each category.

3.2.4. Data-Driven Stochastic Restoration

Existing methods address catastrophic forgetting through stochastic reset mechanisms (Wang et al., 2022; Zhu et al., 2023) or pure data-driven restoration (Brahma and Rai, 2023). Nevertheless, the former may reset valuable parameters, potentially erasing essential knowledge relevant to the current domain. On the other hand, the latter restores the most unimportant parameters, which may retain noise parameters, thereby leading to error accumulation. Taking inspiration from (Wang et al., 2022; Brahma and Rai, 2023), we propose the DSR strategy to reset irrelevant parameters with higher possibilities, enabling robust adaptation. Following (Brahma and Rai, 2023), we adopt the diagonal approximation of Fisher Information Matrix (FIM) (Kirkpatrick et al., 2017) based on gradient, as a metric of parameter importance. The FIM approximates parameter importance by considering the sensitivity of the loss function to parameter changes (Kirkpatrick et al., 2017). Parameters with higher values in the FIM are those for which small changes result in significant increases in loss, indicating their importance for domain at hand.

Specifically, consider a convolution layer of the student model with weight after gradient descent Wt1subscript𝑊𝑡1W_{t-1}italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and its corresponding gradient matrix Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the same shape as Wt1subscript𝑊𝑡1W_{t-1}italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT at time step t𝑡titalic_t. First, we use a random matrix RtUniform(0,1)similar-tosubscript𝑅𝑡Uniform01R_{t}\sim\operatorname{Uniform}(0,1)italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Uniform ( 0 , 1 ) with the same shape as the weights matrix, where each element follows a uniform distribution between 0 and 1. Moreover, we utilize the element-wise multiplication of the square of the gradient and the random matrix to approximate the FIM (Kirkpatrick et al., 2017):

(9) Ft=GtGtRt,subscript𝐹𝑡direct-productsubscript𝐺𝑡subscript𝐺𝑡subscript𝑅𝑡F_{t}=G_{t}\odot G_{t}\odot R_{t},italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where direct-product\odot represents the element-wise multiplication. Furthermore, the proposed DSR updates the parameters based on the FIM values, which is formulated as:

(10) Mt=Ft<η,subscript𝑀𝑡subscript𝐹𝑡𝜂\displaystyle M_{t}=F_{t}<\eta,italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_η ,
(11) Wt=MtW0+(1Mt)Wt1,subscript𝑊𝑡direct-productsubscript𝑀𝑡subscript𝑊0direct-product1subscript𝑀𝑡subscript𝑊𝑡1\displaystyle W_{t}=M_{t}\odot W_{0}+(1-M_{t})\odot W_{t-1},italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊙ italic_W start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ,

where Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the mask matrix of the same shape as Wtsubscript𝑊𝑡W_{t}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, <<< represents the element-wise less than operation, and η𝜂\etaitalic_η is the threshold value which is acquired by the q𝑞qitalic_q-quantile of Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e. η=quantile(Ft,q)𝜂𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒subscript𝐹𝑡𝑞\eta=quantile(F_{t},q)italic_η = italic_q italic_u italic_a italic_n italic_t italic_i italic_l italic_e ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q ). Consequently, the elements in Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are set to 1, when the corresponding FIM value is less than η𝜂\etaitalic_η, indicating the corresponding parameters should be reset to the source weight W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This strategy enables the model to retain essential knowledge while introducing randomness to enhance robustness, thereby mitigating the forgetting issue in CTTA.

4. EXPERIMENTS

Table 2. Experimental results ([email protected]) of Cityscapes-to-Cityscapes-C short-term CTTA task. We evaluate the short-term adaptation performance by continually adapting the source-trained model to the twelve corruptions with the largest corruption severity level 5.
Time t𝑡absentt\xrightarrow{\hskip 325.215pt}italic_t start_ARROW → end_ARROW All
Condition Defocus Glass Motion Zoom Snow Frost Fog Brightness Contrast Elastic Pixelate Jpeg Mean Gain
Source (Ren et al., 2015) 6.8 8.1 8.0 1.5 0.2 6.8 34.6 30.7 3.0 50.2 17.6 13.5 15.1 /
Tent (Wang et al., 2020) 6.8 7.8 7.7 1.3 0.2 6.1 33.1 28.0 2.2 51.1 14.8 11.0 14.2 -0.9
CoTTA (Wang et al., 2022) 7.8 9.0 8.9 1.8 0.3 7.1 38.4 31.1 8.6 49.6 16.2 13.1 16.0 +0.9
SVDP (Yang et al., 2024) 7.7 10.1 9.7 2.3 0.7 13.0 42.4 45.2 15.4 47.2 21.2 14.8 19.1 +4.0
IRG (VS et al., 2023a) 8.0 11.0 9.3 3.4 1.2 13.0 37.9 41.3 15.9 38.9 16.9 13.4 17.5 +2.4
MemCLR (VS et al., 2023b) 8.5 10.4 10.6 2.7 1.1 12.2 41.4 41.6 16.4 43.1 15.4 12.7 18.0 +2.9
Ours 8.6 12.1 11.7 3.6 1.5 16.7 44.7 48.1 16.7 47.4 22.5 13.9 20.6 +5.5
Table 3. Experimental results ([email protected]) of Cityscapes-to-Cityscapes-C long-term CTTA task. We evaluate the long-term adaptation performance by continually adapting the source model to the five corruption ten times with the largest corruption severity level 5. To save space, we display selected rounds of results, and full results are provided in the supplementary material.
Time t𝑡absentt\xrightarrow{\hskip 390.25534pt}italic_t start_ARROW → end_ARROW
Round 1 5 10 All
Condition Fog Motion Snow Brightness Defocus Fog Motion Snow Brightness Defocus Fog Motion Snow Brightness Defocus Mean Gain
Source (Ren et al., 2015) 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 16.4 /
Tent (Wang et al., 2020) 35.8 8.1 0.2 29.2 6.2 30.7 10.2 0.8 27.9 12.4 20.4 5.0 0.1 11.5 2.4 12.0 -4.4
CoTTA (Wang et al., 2022) 38.2 10.6 0.4 33.8 9.3 40.8 10.4 0.6 36.5 9.2 40.8 10.4 0.5 36.3 9.6 19.1 +2.7
SVDP (Yang et al., 2024) 36.8 8.8 0.6 43.5 11.2 45.0 15.4 3.8 47.8 19.9 41.5 16.2 4.6 44.6 20.2 25.3 +8.9
IRG (VS et al., 2023a) 37.9 9.0 0.7 44.3 12.2 45.6 17.7 6.0 46.7 21.9 37.0 16.7 7.1 38.5 20.5 25.6 +9.2
MemCLR (VS et al., 2023b) 37.7 8.9 0.8 45.5 13.5 45.0 18.1 5.1 46.8 22.4 36.8 18.5 7.5 38.4 21.8 26.0 +9.6
Ours 39.0 10.4 0.8 48.0 13.8 49.4 18.6 7.7 51.7 25.0 45.7 20.0 11.8 46.4 25.8 29.0 +12.6

In this study, we rigorously evaluate our methodology across four benchmark tasks tailored for CTTA in object detection. Four tasks encompass continual adaptation to synthetic and real-world distribution shifts, evaluated over short and long-term periods. Inspired by the foundational work CoTTA (Wang et al., 2022), the short-term CTTA task entails the sequential adaptation to various target domains once. In contrast, the long-term CTTA task involves continually adapting the model toward groups of target domains cyclically.

4.1. Datasets

Cityscapes. The Cityscapes dataset (Cordts et al., 2016) is collected for urban scene understanding, encompassing 2,975 training images and 500 validation images with eight object types, i.e. person, rider, car, truck, bus, train, motorcycle, and bicycle. We utilize the model pre-trained on the training set of Cityscapes as the source model, and the source data is discarded during adaptation.

Cityscapes-C. The Cityscapes-C (Hendrycks and Dietterich, 2019) is initially designed to assess robustness against various corruptions, introducing 15 types of corruption with 5 levels of severity. We create the dataset by applying these corruptions at the maximum severity level 5 to the validation set of clean cityscapes, treating each corruption type as an individual target domain comprising 500 images.

For Cityscapes-C, both short-term and long-term CTTA tasks are explored. Our short-term CTTA task selectively focuses on the latter 12 corruptions, i.e. Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, Pixelate, and JPEG. For the long-term CTTA task, we prioritize the five corruptions with direct relevance to autonomous driving scenarios as the target domain groups following (Gan et al., 2023), namely Fog, Motion, Snow, Brightness, and Defocus. To mimic the scenario in real life where similar environments might be revisited, and to evaluate the forgetting effect of our methods in the long term, we repeat the adaptation to the target domain group 10 times.

SHIFT. The SHIFT (Sun et al., 2022) is a synthetic dataset for autonomous driving, featuring real-world environmental changes. SHIFT can be categorized as clear, cloudy, overcast, rainy, and foggy, where each condition contains images taken at various times ranging from daytime to night. For SHIFT, short-term CTTA tasks are considered. We designate the clear condition as the source domain, with the remaining four conditions as the target domain groups including nearly 20k images in total.

ACDC. The ACDC dataset (Sakaridis et al., 2021), namely the adverse conditions dataset, shares the same class types as Cityscapes and is collected in four different adverse visual conditions, including Fog, Night, Rain, and Snow. Following (Wang et al., 2022; Gan et al., 2023), we use these four conditions as the target domain group for the long-term CTTA task, with 400 unlabeled images from each condition. Similarly, we adapt the source-trained model to continually to the target domain groups 10 cycles for the long-term CTTA task.

4.2. Implementation Details

We adopt the Faster R-CNN (Ren et al., 2015) with ResNet50 (He et al., 2016) pre-trained on ImageNet (Krizhevsky et al., 2012) as the backbone. Aligning with (VS et al., 2023b; Wang et al., 2022), we maintain a batch size of 1, thereby emulating a real-world application scenario where the object detection model is required to adapt toward a continuous influx of images. The source and student models are trained using an SGD optimizer with a learning rate of 0.001 and a momentum of 0.9. Algorithms are implemented leveraging the Detectron2 platform (Wu et al., 2019). The metric of mAP at an IoU threshold of 0.5 ([email protected]) is employed for evaluation. Each experiment is conducted on 1 NVIDIA A800 GPU. More implementation details are provided in the supplementary material.

4.3. Baselines and Compared Approaches

To establish the efficacy of our proposed CTAOD, we compare it with five baselines across various setting types for a fair comparison, including Source (Ren et al., 2015), IRG (VS et al., 2023a), Memclr (VS et al., 2023b), Tent (Wang et al., 2020), and CoTTA (Wang et al., 2022). Specifically, ”Source” represents the source model, i.e., Faster R-CNN (Ren et al., 2015), solely pre-trained on the source domain. Moreover, Memclr (VS et al., 2023b) and IRG (Ren et al., 2015) are SFDA object detection methods. MemCLR (VS et al., 2023b) integrates cross-attention with contrastive learning (CL), while IRG (VS et al., 2023a) incorporates instance relation graph and CL. Furthermore, Tent (Wang et al., 2020) updates the affine parameters through entropy minimization in TTA. CoTTA (Wang et al., 2022) utilizes weight-averaged predictions and random recovery neurons to tackle CTTA, while SVDP (Yang et al., 2024) explores sparse visual prompts for CTTA dense prediction.

Table 4. Experimental results ([email protected]) of SHIFT short-term CTTA task. We evaluate the adaptation performance by continually adapting the source model to the four conditions.
Time t𝑡absentt\xrightarrow{\hskip 195.12767pt}italic_t start_ARROW → end_ARROW All
Condition Cloudy Overcast Rainy Foggy Mean Gain
Source 51.8 41.5 43.8 33.9 42.7 /
Tent 50.9 39.5 36.3 23.1 37.5 -5.2
CoTTA 51.1 40.1 40.7 29.9 40.5 -2.2
SVDP 52.0 41.0 43.8 35.9 43.2 +0.5
IRG 51.9 40.6 42.7 34.3 42.4 -0.3
MemCLR 51.8 40.4 42.6 35 42.4 -0.3
Ours 52.3 41.3 44.1 36.8 43.6 +0.9

4.4. Results and Analysis

Table 5. Experimental results ([email protected]) of Cityscapes-to-ACDC long-term CTTA task. We evaluate the long-term adaptation performance by continually adapting the source model to the four conditions ten times. To save space, we display selected rounds of results, and full results are provided in the supplementary material.
Time t𝑡absentt\xrightarrow{\hskip 368.57964pt}italic_t start_ARROW → end_ARROW
Round 1 4 7 10 All
Condition Fog Night Rain Snow Fog Night Rain Snow Fog Night Rain Snow Fog Night Rain Snow Mean Gain
Source (Ren et al., 2015) 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 36.0 /
Tent (Wang et al., 2020) 52.4 18.6 33.4 38.9 51.7 17.4 31.4 36.0 45.8 14.6 28.1 28.5 35.0 9.5 21.5 18.3 30.5 -5.5
CoTTA (Wang et al., 2022) 53.7 19.7 38.0 42.4 53.1 19.7 37.7 42.9 51.3 19.0 36.5 41.8 50.9 19.1 36.0 42.4 37.8 +1.8
SVDP (Yang et al., 2024) 52.8 20.0 35.6 42.0 54.6 23.5 38.7 43.8 52.8 24.0 38.6 43.9 51.8 23.6 38.2 43.0 39.5 +3.5
IRG (VS et al., 2023a) 52.7 20.6 36.0 42.9 53.6 23.2 38.1 45.6 51.0 23.1 37.1 42.8 49.7 22.5 36.5 40.7 38.9 +2.9
MemCLR (VS et al., 2023b) 52.9 21.2 35.2 42.8 52.8 23.2 38.2 44.4 51.4 23.2 37.1 42.9 49.9 22.9 36.6 41.0 38.7 +2.7
Ours 52.9 20.4 34.5 42.9 54.5 23.8 39.6 45.8 53.5 25.1 39.0 45.3 52.5 24.3 38.8 44.7 40.3 +4.3

4.4.1. Synthetic Continual Distribution Shift

We first evaluate the effectiveness of the proposed method on the synthetic continual shift dataset (i.e. Cityscapes-to-Cityscapes-C adaptation tasks) in the short term. As depicted in Table 2, Tent (Wang et al., 2020) undergoes a slight decline in performance, drop** from 15.1 to 14.2 mAP relative to the source model. This downturn may be attributed to its dependency on a large batch size to update the parameters of the batchnorm layer, making it suboptimal for online adaptation. In contrast, CoTTA (Wang et al., 2022), IRG (VS et al., 2023a), and Memclr (VS et al., 2023b) exhibit enhancements in performance, achieving 16.0, 17.5, and 18.0 mAP respectively. Although the SFDA methods like IRG and MemCLR assume a static target domain, employing a large momentum update rate α𝛼\alphaitalic_α for the teacher enables a relatively stable adaptation in the short term. Consequently, CL leads to a better performance than CoTTA which employs weight-averaged predictions. However, SVDP that utilizes visual prompts achieves the sub-optimal performance of 19.1 mAP. Remarkably, our proposed method improves the performance to 20.6 mAP, consistently outperforming the above approaches. CTAOD ensures a more reliable adaptation to target domains characterized by intense and continual changes in the short term with high-quality pseudo-labels and feature representation.

As presented in Table 9, the long-term task outcomes reveal the source model’s relatively poor performance, with an average mAP of 16.4. Tent’s performance is markedly deteriorative across the timeline. Despite the improvement of Memclr and IRG, their performance also begins to decline in the later rounds, failing to maintain stability over long-term adaptation. We believe this is due to the aforementioned methods not accounting for continual distribution shifts, resulting in error accumulation and catastrophic forgetting. Furthermore, CoTTA utilizes a stochastic restoration mechanism to mitigate forgetting, but its randomness may result in losing crucial information, thus limiting its performance at 19.1 mAP. In contrast, SVDP that employ sparse visual prompts raise the performance to 25.3 mAP. Particularly, CTAOD yields a remarkable 12.6 mAP enhancement over the Source and surpasses all comparative baselines, which employs the DSR mechanism to conserve valuable knowledge while eliminating noise from prior domains. These findings empirically validate the effectiveness of our proposed method in ensuring stable adaptation amidst synthetic continual distribution shifts, across both short-term and long-term adaptation scenarios.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Source
Refer to caption
(b) MemCLR
Refer to caption
(c) CTAOD (ours)
Figure 3. Qualitative results. We compare the detection results of the Source model, MemCLR, and CTAOD in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to brightness (row 1) and defocus (row 2) corruption on the long-term Cityscapes-to-Cityscapes-C task.

4.4.2. Real-World Continual Distribution Shift

We also evaluate our method on the real-world continual distribution shift dataset (i.e. SHIFT and Cityscapes-to-ACDC adaptation tasks). For the SHIFT short-term CTTA task, where each domain includes more complex images ranging from daytime to night, the results is shown in Table 4. While other baseline method suffer from performance decline, SVDP that employs fixed threshold for pseudo-labeling and prompts learning achieve a 0.5 mAP improvements. Moreover, CTAOD that introduces a DT strategy shows a superior performance of 43.6 mAP.

The experimental results of ACDC long-term CTTA task are summarized in Table 5. Tent, CoTTA, IRG, and MemCLR potentially suffer from error accumulation and catastrophic forgetting, manifesting in a rapid decline in performance during the later stages. Similarly, SVDP achieves a sub-optimal performance of 39.5 mAP. Notably, CTAOD achieves a performance of 40.3 mAP, yielding an absolute improvement of 0.8 mAP over baselines. While CTAOD may not display a significant advantage over other methods in the initial round, its performance continually improves and remains stable throughout the long-term adaptation. This outcome underscores the capability of our method to foster robust adaptation toward the real-world continual distribution shift in the long term.

4.4.3. Qualitative Results

As shown in Figure 3, we provide some examples of detection results. We selectively compare the source model, MemCLR, and CTAOD in the final round adaptation to the brightness and defocus corruption on the long-term Cityscapes-to-Cityscapes-C tasks. Compared with Source and MemCLR, CTAOD guides the model to learn better feature representations while effectively mitigating forgetting in the long-term adaptation. Therefore, CTAOD assists the detector in distinguishing more foreground object categories and better locating them. More visualization analysis is presented in the supplementary material.

Table 6. Alation experiment. “Mean-Teacher” represents the base mean teacher framework with weak-strong augmentation and KL divergence distillation. All experiments are done on long-term Cityscapes-to-Cityscapes-C tasks.
Mean-Teacher OCL DT DSR Mean Gain
1 \checkmark 25.9 /
2 \checkmark \checkmark 27.4 +1.5
3 \checkmark \checkmark 27.1 +1.2
4 \checkmark \checkmark 26.7 +0.8
5 \checkmark \checkmark \checkmark 28.5 +12.1
6 \checkmark \checkmark \checkmark \checkmark 29.0 +12.6

4.4.4. Ablation Study

We conduct an ablation study to empirically assess the impact of the proposed OCL, DT, and DSR. As shown in Table 6, the integration of the mean-teacher architecture, which incorporates weak-strong augmentation and KL divergence distillation, elevates the performance to 25.9 mAP. Building upon the mean-teacher foundation, the proposed OCL, DT, and DSR achieve an additional 1.5, 1.2, and 0.8 mAP improvement individually. Moreover, the combination of the OCL and DT modules raises the performance to 28.5 mAP. Furthermore, the inclusion of the DSR mechanism brings a 0.5 mAP boost in performance, culminating in a comprehensive model performance of 29.0 mAP

5. CONCLUSION

In this work, we proposed CTAOD, aimed at addressing the challenges of low-quality pseudo-labels and random errors resulting from stochastic recovery in continual test-tiem adaptation. Firstly, object-level contrastive learning leverages teacher ROI features for contrastive learning to refine the feature representation, tailored for object detection. Secondly, the dynamic threshold strategy selects higher-quality pseudo-labels by dynamically updating the category-specific threshold based on the predicted confidence scores. Lastly, the data-driven stochastic restoration mechanism selectively reset inactive parameters to mitigate forgetting while retaining valuable knowledge. The empirical results of four continual test-time adaptation tasks for object detection demonstrate the efficacy of CTAOD in both short-term and long-term adaption scenarios.

References

  • (1)
  • Aljundi et al. (2017) Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. 2017. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3366–3375.
  • Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, **gliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning. PMLR, 173–182.
  • Aytar et al. (2016) Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems 29 (2016).
  • Brahma and Rai (2023) Dhanajit Brahma and Piyush Rai. 2023. A probabilistic framework for lifelong test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3582–3591.
  • Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. PMLR, 1597–1607.
  • Chen and He (2021) Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15750–15758.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3223.
  • De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7 (2021), 3366–3385.
  • Deng et al. (2014) Jun Deng, Zixing Zhang, Florian Eyben, and Björn Schuller. 2014. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters 21, 9 (2014), 1068–1072.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Döbler et al. (2023) Mario Döbler, Robert A Marsden, and Bin Yang. 2023. Robust mean teacher for continual and gradual test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7704–7714.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • Gan et al. (2023) Yulu Gan, Mingjie Pan, Rongyu Zhang, Zijian Ling, Lingran Zhao, Jiaming Liu, and Shanghang Zhang. 2023. Cloud-device collaborative adaptation to continual changing environments in the real-world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12157–12166.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. Journal of Machine Learning Research 17, 59 (2016), 1–35.
  • Geirhos et al. (2018) Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. 2018. Generalisation in humans and deep neural networks. Advances in Neural Information Processing Systems 31 (2018).
  • Gong et al. (2022) Taesik Gong, Jongheon Jeong, Taewon Kim, Yewon Kim, **woo Shin, and Sung-Ju Lee. 2022. Note: Robust continual test-time adaptation against temporal correlation. Advances in Neural Information Processing Systems 35 (2022), 27253–27266.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33 (2020), 21271–21284.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729–9738.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
  • Hendrycks and Dietterich (2019) Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261 (2019).
  • Hoffman et al. (2014) Judy Hoffman, Trevor Darrell, and Kate Saenko. 2014. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 867–874.
  • Hu et al. (2021) Xuefeng Hu, Gokhan Uzunbas, Sirius Chen, Rui Wang, Ashish Shah, Ram Nevatia, and Ser-Nam Lim. 2021. Mixnorm: Test-time adaptation through online normalization estimation. arXiv preprint arXiv:2110.11478 (2021).
  • Huang et al. (2021) Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. 2021. Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data. Advances in Neural Information Processing Systems 34 (2021), 3635–3649.
  • Iwasawa and Matsuo (2021) Yusuke Iwasawa and Yutaka Matsuo. 2021. Test-time classifier adjustment module for model-agnostic domain generalization. Advances in Neural Information Processing Systems 34 (2021), 2427–2440.
  • Jiang et al. (2020) Junguang Jiang, Ximei Wang, Mingsheng Long, and Jianmin Wang. 2020. Resource efficient domain adaptation. In Proceedings of the 28th ACM International Conference on Multimedia. 2220–2228.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of International Conference on Computer Vision 114, 13 (2017), 3521–3526.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012).
  • Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79–86.
  • Li et al. (2019) Shuang Li, Chi Harold Liu, Binhui Xie, Limin Su, Zhengming Ding, and Gao Huang. 2019. Joint adversarial domain adaptation. In Proceedings of the 27th ACM International Conference on Multimedia. 729–737.
  • Li et al. (2022b) Shuaifeng Li, Mao Ye, Xiatian Zhu, Lihua Zhou, and Lin Xiong. 2022b. Source-free object detection by learning to overlook domain style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8014–8023.
  • Li et al. (2021a) Xianfeng Li, Weijie Chen, Di Xie, Shicai Yang, Peng Yuan, Shiliang Pu, and Yueting Zhuang. 2021a. A free lunch for unsupervised domain adaptive object detection without source data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8474–8481.
  • Li et al. (2021b) Xinhao Li, **g**g Li, Lei Zhu, Guoqing Wang, and Zi Huang. 2021b. Imbalanced source-free domain adaptation. In Proceedings of the 29th ACM International Conference on Multimedia. 3330–3339.
  • Li et al. (2022a) Yu-Jhe Li, Xiaoliang Dai, Chih-Yao Ma, Yen-Cheng Liu, Kan Chen, Bichen Wu, Zijian He, Kris Kitani, and Peter Vajda. 2022a. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7581–7590.
  • Li and Hoiem (2017) Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.
  • Liu et al. (2021) Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. 2021. Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480 (2021).
  • Luo et al. (2020) Yadan Luo, Zi Huang, Zijian Wang, Zheng Zhang, and Mahsa Baktashmotlagh. 2020. Adversarial bipartite graph learning for video domain adaptation. In Proceedings of the 28th ACM International Conference on Multimedia. 19–27.
  • Mummadi et al. (2021) Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, and Jan Hendrik Metzen. 2021. Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999 (2021).
  • Niloy et al. (2024) Fahim Faisal Niloy, Sk Miraj Ahmed, Dripta S Raychaudhuri, Samet Oymak, and Amit K Roy-Chowdhury. 2024. Effective restoration of source knowledge in continual test time adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2091–2100.
  • Niu et al. (2022) Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. 2022. Efficient test-time model adaptation without forgetting. In International Conference on Machine Learning. PMLR, 16888–16905.
  • Parisi et al. (2019) German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural networks 113 (2019), 54–71.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2001–2010.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. 2019. Experience replay for continual learning. Advances in Neural Information Processing Systems 32 (2019).
  • Saito et al. (2018) Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3723–3732.
  • Sakaridis et al. (2021) Christos Sakaridis, Dengxin Dai, and Luc Van Gool. 2021. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10765–10775.
  • Sinha et al. (2023) Samarth Sinha, Peter Gehler, Francesco Locatello, and Bernt Schiele. 2023. Test: Test-time self-training under distribution shift. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2759–2769.
  • Sójka et al. (2023) Damian Sójka, Sebastian Cygert, Bartłomiej Twardowski, and Tomasz Trzciński. 2023. AR-TTA: A Simple Method for Real-World Continual Test-Time Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3491–3495.
  • Song et al. (2023) Junha Song, Jungsoo Lee, In So Kweon, and Sungha Choi. 2023. Ecotta: Memory-efficient continual test-time adaptation via self-distilled regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11920–11929.
  • Sun et al. (2022) Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc Van Gool, Bernt Schiele, Federico Tombari, and Fisher Yu. 2022. SHIFT: a synthetic driving dataset for continuous multi-task domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21371–21382.
  • Sun et al. (2020) Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. 2020. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning. PMLR, 9229–9248.
  • Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems 30 (2017).
  • Tiwari et al. (2022) Rishabh Tiwari, Krishnateja Killamsetty, Rishabh Iyer, and Pradeep Shenoy. 2022. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99–108.
  • Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features With 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167–7176.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
  • VS et al. (2023a) Vibashan VS, Poojan Oza, and Vishal M Patel. 2023a. Instance relation graph guided source-free domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3520–3530.
  • VS et al. (2023b) Vibashan VS, Poojan Oza, and Vishal M Patel. 2023b. Towards online domain adaptive object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 478–488.
  • Wang et al. (2020) Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2020. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020).
  • Wang et al. (2022) Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. 2022. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7201–7211.
  • Wang et al. (2024) Yanshuo Wang, Jie Hong, Ali Cheraghian, Shafin Rahman, David Ahmedt-Aristizabal, Lars Petersson, and Mehrtash Harandi. 2024. Continual test-time domain adaptation via dynamic sample selection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1701–1710.
  • Wu et al. (2019) Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. Detectron2. https://github.com/facebookresearch/detectron2.
  • Wulfmeier et al. (2018) Markus Wulfmeier, Alex Bewley, and Ingmar Posner. 2018. Incremental adversarial domain adaptation for continually changing environments. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4489–4495.
  • Xu and Zhu (2018) Ju Xu and Zhanxing Zhu. 2018. Reinforced continual learning. Advances in Neural Information Processing Systems 31 (2018).
  • Yang et al. (2024) Senqiao Yang, Jiarui Wu, Jiaming Liu, Xiaoqi Li, Qizhe Zhang, Mingjie Pan, Yulu Gan, Zehui Chen, and Shanghang Zhang. 2024. Exploring sparse visual prompt for domain adaptive dense prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 16334–16342.
  • You et al. (2021) Fuming You, **g**g Li, and Zhou Zhao. 2021. Test-time batch statistics calibration for covariate shift. arXiv preprint arXiv:2110.04065 (2021).
  • Yu et al. (2023) Zhiqi Yu, **g**g Li, Zhekai Du, Fengling Li, Lei Zhu, and Yang Yang. 2023. Noise-robust continual test-time domain adaptation. In Proceedings of the 31st ACM International Conference on Multimedia. 2654–2662.
  • Zeng et al. (2023) Runhao Zeng, Qi Deng, Huixuan Xu, Shuaicheng Niu, and Jian Chen. 2023. Exploring Motion Cues for Video Test-Time Adaptation. In Proceedings of the 31st ACM International Conference on Multimedia. 1840–1850.
  • Zenke et al. (2017) Friedemann Zenke, Ben Poole, and Surya Ganguli. 2017. Continual learning through synaptic intelligence. In International Conference on Machine Learning. PMLR, 3987–3995.
  • Zhao et al. (2023) Zi**g Zhao, Sitong Wei, Qingchao Chen, Dehui Li, Yifan Yang, Yuxin Peng, and Yang Liu. 2023. Masked retraining teacher-student framework for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19039–19049.
  • Zhu et al. (2023) Jiayi Zhu, Bart Bolsterlee, Brian VY Chow, Yang Song, and Erik Meijering. 2023. Uncertainty and shape-aware continual test-time adaptation for cross-domain segmentation of medical images. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 659–669.

Appendix A Appendix Overview

In this supplementary material, we provide additional details and analysis. In Section B, we provide implementation details on CTAOD and other baseline methodologies. In Section C, we furnish complete results for the quantitative research and additional visualization examples for the qualitative research. In Section D, we discuss the limitations and prospective directions of our research.

Appendix B Additional Implementation Details

Following (VS et al., 2023b), we trained a Faster R-CNN (Ren et al., 2015) with ResNet50 (He et al., 2016) pre-trained on ImageNet (Krizhevsky et al., 2012) on the training set of Cityscapes (Cordts et al., 2016) as the source-trained model. We utilized an SGD optimizer with a learning rate of 0.001, a momentum of 0.9, and a batch size of 1 and 70k iterations. We will specify the implementation details of the CTAOD and other baseline methods in the following subsections.

B.1. Implememtation of CTAOD

Detailed hyper-parameters of CTAOD for each benchmark are listed in Table 8. Furthermore, we present a pseudo-code outline of our CTAOD framework in Algorithm 1. The three key components of CTAOD are highlighted, i.e., Object-level Contrastive Learning, Dynamic Threshold, and Data-driven Stochastic Restoration.

B.2. Implememtation of Baseline Methods

The Tent (Wang et al., 2020), CoTTA (Wang et al., 2022), and SVDP (Yang et al., 2024) are originally designed for the classification or segmentation tasks. We implement them based on their public code, with some modifications to fit into the objection detection. Specifically, the Tent approach is implemented by updating the parameters of the batch normalization in Faster R-CNN through entropy minimization. Meanwhile, we implement the Cotta method by constructing a mean-teacher framework based on Faster R-CNN, which incorporates a stochastic restoration mechanism. Similarly, we implement SVDP with visual prompts on the area of the bounding box in pseudo-labels based on mean-teacher framework. Further, the MemCLR (VS et al., 2023b) and IRG (VS et al., 2023a) are source-free domain adaptation methods for object detection. Therefore, we implement the MemCLR and IRG based on their public code, with little modifications to follow the CTTA setting (i.e., updating the model parameters during evaluation). For a fair comparison, we set the same momentum α𝛼\alphaitalic_α of the Exponential Moving Average (EMA) for CoTTA, SVDP, MemCLR, and IRG as CTAOD, all of which apply the mean-teacher framework (Tarvainen and Valpola, 2017). Other hyper-parameters are retained as specified in their public code.

Appendix C Additional Experiment Results

C.1. Quantitative Results

We present the complete experiment results on the two long-term CTTA tasks, i.e., Cityscapes-to-Cityscapes-C and Cityscapes-to-ACDC adaptation tasks, in Table 9 and 10. Experiments show that the proposed CTAOD can largely maintain strong performance in the long term.

C.2. Qualitative Results

As shown in figure 4 - 8, we present more visualization results to compare the CTAOD and other baseline methods based on the final round adaption on the long-term Cityscapes-to-Cityscapes-C task. The results further validate the effectiveness of CTAOD in distinguishing more foreground object categories and better locating them, compared with other baseline methods.

C.3. Additional Ablation Study

Table 7. Additional Ablation Study. The “FT” represents replacing dynamic threshold in CTAOD with a fixed threshold. The “SR” denotes the stochastic restoration (Wang et al., 2022), and the “DR” indicates the data-driven restoration (Brahma and Rai, 2023). All experiments are done on long-term Cityscapes-to-Cityscapes-C tasks.
CTAOD FT of 0.9 FT of 0.8 FT of 0.7 SR DR Mean
1 \checkmark \checkmark 27.7
2 \checkmark \checkmark 28.2
3 \checkmark \checkmark 28.5
4 \checkmark \checkmark 27.1
5 \checkmark \checkmark 28.3
6 \checkmark 29.0

As presented in Table 7, we perform an additional ablation study to evaluate the effectiveness of the proposed Dynamic Threshold (DT) and Data-driven Stochastic Restoration (DSR). The findings indicate that the DT approach results in improvements of 1.3, 0.8, and 0.5 mAP over fixed thresholds set at 0.9, 0.8, and 0.7, respectively. While the stochastic restoration and data-driven restoration limit the performance to 27.1 and 28.3 mAP, the DSR mechanism raises the performance to 29.0 mAP.

Appendix D Limitation

Firstly, our experiments are primarily based on a single detector backbone, i.e. Faster R-CNN (Ren et al., 2015). Future work will extend to other advanced detector backbone networks to assess the universality and generalization capabilities of CTAOD. Moreover, the proposed DSR mechanism to reset the model every iteration increases the computational cost. Future work could explore more efficient mechanisms for restoration, such as restoring the model only when significant changes in the distribution of the target domain are detected. Finally, the evaluation tasks in our work are designed to simulate real-world adaptation scenarios by incorporating datasets affected by corruption and adverse conditions. However, real-world data distributions are inherently more complex. Consequently, a promising avenue for future research would be to apply our methodology in practical, real-world systems to further validate its effectiveness.

Table 8. Detailed hyper-parameters for each benchmark. “City2City-C-Short” denotes the short-term Cityscapes-to-Cityscapes-C task, “City2City-C-Long” denotes long-term Cityscapes-to-Cityscapes-C task, and “ City2ACDC-Long” denotes long-term Cityscapes-to-ACDC task.
Hyper-parameter Description City2City-C-Short City2City-C-Long SHIFT-Short City2ACDC-Long
γ𝛾\gammaitalic_γ Learning rate 0.001 0.001 0.0001 0.001
α𝛼\alphaitalic_α Momentum in EMA 0.9996 0.9998 0.9999 0.9998
τ𝜏\tauitalic_τ Temperature in CL 0.07 0.07 0.07 0.07
δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Initial valuse in DT 0.8 0.8 0.8 0.8
β𝛽\betaitalic_β Hyper-parameter in DT 0.95 0.95 0.95 0.95
ϵitalic-ϵ\epsilonitalic_ϵ Hyper-parameter in DT 1.3 1.3 1.3 1.3
δmaxsubscript𝛿𝑚𝑎𝑥\delta_{max}italic_δ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT The upper-bound in DT 0.9 0.9 0.9 0.9
δminisubscript𝛿𝑚𝑖𝑛𝑖\delta_{mini}italic_δ start_POSTSUBSCRIPT italic_m italic_i italic_n italic_i end_POSTSUBSCRIPT The lower-bound in DT 0.7 0.7 0.7 0.7
q𝑞qitalic_q Quantile value 0.05 0.001 0.001 0.01
Algorithm 1 Pseudo-code for the CTAOD Method
1:Initialization: Object detectors: The student and teacher model parameters θ0stdsuperscriptsubscript𝜃0𝑠𝑡𝑑\theta_{0}^{std}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT and θ0tchsuperscriptsubscript𝜃0𝑡𝑐\theta_{0}^{tch}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT initialized from source-trained model θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Hyper-parameters: learning rates γ𝛾\gammaitalic_γ, Momentum α𝛼\alphaitalic_α in exponential moving average (EMA), temperature τ𝜏\tauitalic_τ, thresholds δ0csuperscriptsubscript𝛿0𝑐\delta_{0}^{c}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, β𝛽\betaitalic_β and ϵitalic-ϵ\epsilonitalic_ϵ in dynamic threshold, and quantile value q𝑞qitalic_q.
2:Input: Unlabeled test data xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for the target domain
3:for each iteration t𝑡titalic_t do
4:       Generate weakly and strongly augmented images fweak(xt)subscript𝑓𝑤𝑒𝑎𝑘subscript𝑥𝑡f_{weak}(x_{t})italic_f start_POSTSUBSCRIPT italic_w italic_e italic_a italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and fstrong(xt)subscript𝑓𝑠𝑡𝑟𝑜𝑛𝑔subscript𝑥𝑡f_{strong}(x_{t})italic_f start_POSTSUBSCRIPT italic_s italic_t italic_r italic_o italic_n italic_g end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
5:       Generate predictions: yt=fθt1tch(fweak(xt))subscript𝑦𝑡subscript𝑓superscriptsubscript𝜃𝑡1𝑡𝑐subscript𝑓𝑤𝑒𝑎𝑘subscript𝑥𝑡y_{t}=f_{\theta_{t-1}^{tch}}(f_{weak}(x_{t}))italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_w italic_e italic_a italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) and proposal ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the teacher model
6:       // 1. Object-level Contrastive Learning
7:       Extract teacher features FtTsuperscriptsubscript𝐹𝑡𝑇F_{t}^{T}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and student features FtSsuperscriptsubscript𝐹𝑡𝑆F_{t}^{S}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT based on ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using RoIAlign.
8:       Compute contrastive learning loss clsubscript𝑐𝑙\mathcal{L}_{cl}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT:
9:             cl(xt)=1li=1llogexp(fiTfiS/τ)j=1lexp(fiTfjS/τ)subscript𝑐𝑙subscript𝑥𝑡1𝑙superscriptsubscript𝑖1𝑙superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑖𝑆𝜏superscriptsubscript𝑗1𝑙superscriptsubscript𝑓𝑖𝑇superscriptsubscript𝑓𝑗𝑆𝜏\mathcal{L}_{cl}(x_{t})=\frac{1}{l}\sum_{i=1}^{l}-\log\frac{\exp(f_{i}^{T}% \cdot f_{i}^{S}/\tau)}{\sum_{j=1}^{l}\exp(f_{i}^{T}\cdot f_{j}^{S}/\tau)}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT / italic_τ ) end_ARG
10:       // 2. Dynamic Threshold
11:       for each category c𝑐citalic_c do
12:             Compute the mean predicted scores ltc¯¯superscriptsubscript𝑙𝑡𝑐\overline{l_{t}^{c}}over¯ start_ARG italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG of category c𝑐citalic_c
13:             Update dynamic thresholds based on predictions:
14:                   δtcβδt1c+(1β)ϵ(ltc¯)12superscriptsubscript𝛿𝑡𝑐𝛽superscriptsubscript𝛿𝑡1𝑐1𝛽italic-ϵsuperscript¯superscriptsubscript𝑙𝑡𝑐12\delta_{t}^{c}\leftarrow\beta\cdot\delta_{t-1}^{c}+(1-\beta)\cdot\epsilon\cdot% (\overline{l_{t}^{c}})^{\frac{1}{2}}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ← italic_β ⋅ italic_δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + ( 1 - italic_β ) ⋅ italic_ϵ ⋅ ( over¯ start_ARG italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
15:       end for
16:       Generate the pseudo-label y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through the category-specific threshold δ0csuperscriptsubscript𝛿0𝑐\delta_{0}^{c}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
17:       Compute the supervised loss pl(xt)=rpn(xt,y^t)+rcnn(xt,y^t)subscript𝑝𝑙subscript𝑥𝑡subscript𝑟𝑝𝑛subscript𝑥𝑡subscript^𝑦𝑡subscript𝑟𝑐𝑛𝑛subscript𝑥𝑡subscript^𝑦𝑡\mathcal{L}_{pl}(x_{t})=\mathcal{L}_{rpn}(x_{t},\hat{y}_{t})+\mathcal{L}_{rcnn% }(x_{t},\hat{y}_{t})caligraphic_L start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_r italic_p italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_c italic_n italic_n end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
18:       Perform gradient descent to update the student model θtstdsuperscriptsubscript𝜃𝑡𝑠𝑡𝑑\theta_{t}^{std}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT and get the gradient matrix Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
19:       Update teacher model θttchsuperscriptsubscript𝜃𝑡𝑡𝑐\theta_{t}^{tch}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT via EMA strategy
20:       // 3. Data-driven Stochastic Restoration
21:       Generate random matrix RtUniform(0,1)similar-tosubscript𝑅𝑡Uniform01R_{t}\sim\operatorname{Uniform}(0,1)italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Uniform ( 0 , 1 )
22:       Generate the fisher information matrix: Ft=GtGtRtsubscript𝐹𝑡direct-productsubscript𝐺𝑡subscript𝐺𝑡subscript𝑅𝑡F_{t}=G_{t}\odot G_{t}\odot R_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
23:       Find the q𝑞qitalic_q-quantile of Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: η=quantile(Ft,q)𝜂𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒subscript𝐹𝑡𝑞\eta=quantile(F_{t},q)italic_η = italic_q italic_u italic_a italic_n italic_t italic_i italic_l italic_e ( italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q )
24:       Generate the mask matrix: Mt=Ft<ηsubscript𝑀𝑡subscript𝐹𝑡𝜂M_{t}=F_{t}<\etaitalic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_η
25:       Reset the updated student model: θtstd=Mtθ0std+(1Mt)θtstdsuperscriptsubscript𝜃𝑡𝑠𝑡𝑑direct-productsubscript𝑀𝑡superscriptsubscript𝜃0𝑠𝑡𝑑direct-product1subscript𝑀𝑡superscriptsubscript𝜃𝑡𝑠𝑡𝑑\theta_{t}^{std}=M_{t}\odot\theta_{0}^{std}+(1-M_{t})\odot\theta_{t}^{std}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT = italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT + ( 1 - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊙ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_d end_POSTSUPERSCRIPT
26:end for
27:Output: Teacher predictions ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Table 9. Complete experimental results ([email protected]) of Cityscapes-to-Cityscapes-C long-term CTTA task. We evaluate the long-term adaptation performance by continually adapting the source model to the five corruption ten times with the largest corruption severity level 5. “F.”, “M.”, “S.”, “B.”, and “D.” represent Fog, Motion, Snow, Brightness, and Defocus, respectively.
Time t𝑡absentt\xrightarrow{\hskip 403.26341pt}italic_t start_ARROW → end_ARROW
Round 1 2 3 4 5 /
Condition F. M. S. B. D. F. M. S. B. D. F. M. S. B. D. F. M. S. B. D. F. M. S. B. D. / /
Source 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 / /
Tent 35.8 8.1 0.2 29.2 6.2 34.6 7.7 0.1 27.5 5.5 33.5 7.8 0.1 25.5 5.1 32 7.3 0.1 23.2 4.6 30.5 6.8 0.1 21.0 3.9 / /
CoTTA 38.2 10.6 0.4 33.8 9.3 39.9 10.7 0.5 34.0 9.9 40.6 10.4 0.6 34.4 9.8 40.9 10.4 0.6 35.4 9.1 40.8 10.4 0.6 36.5 9.2 / /
SVDP 36.8 8.8 0.6 43.5 11.2 43.9 11.8 1.7 47.9 14.9 46.2 13.9 2.5 50.0 17.8 46.2 14.4 3.2 49.0 19.6 45.0 15.4 3.8 47.8 19.9 / /
IRG 37.9 9.0 0.7 44.3 12.2 46.1 12.9 2.0 49.2 16.6 47.6 15.2 3.0 49.5 19.5 45.2 17.3 4.6 48.6 21.4 45.6 17.7 6.0 46.7 21.9 / /
MemCLR 37.7 8.9 0.8 45.5 13.5 45.8 13.6 2.3 49.8 17.7 47.3 16.4 4.1 50.2 20.3 46.4 17.4 4.8 47.8 21.9 45.0 18.1 5.1 46.8 22.4 / /
Ours 39.0 10.4 0.8 48.0 13.8 48.2 13.7 3.1 52.9 18.1 49.3 15.9 5.3 53.3 21.3 48.5 17.1 7.1 52.2 23.5 49.4 18.6 7.7 51.7 25.0 / /
Round 6 7 8 9 10 Mean Gain
Source 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31 6.7 36.1 8.1 0.2 31.0 6.7 36.1 8.1 0.2 31.0 6.7 16.4 /
Tent 28.6 6.4 0.1 19.0 3.5 26.6 6.3 0.1 16.9 3.0 24.8 5.9 0.1 15.2 2.7 22.9 5.6 0.1 13.7 2.7 20.4 5.0 0.1 11.5 2.4 12.0 -4.4
CoTTA 39.6 10.3 0.6 34.9 9.4 40.2 10.8 0.5 35.3 10.0 40.0 10.8 0.5 35 9.7 40.1 10.5 0.5 34.1 9.5 40.8 10.4 0.5 36.3 9.6 19.1 +2.7
SVDP 44.4 15.6 4.2 47.6 20.5 44.1 16.1 4.3 46.8 20.4 43.0 15.9 4.6 45.7 20.7 42.6 16.0 4.8 45.0 20.2 41.5 16.2 4.6 44.6 20.2 25.3 +8.9
IRG 43.9 18 5.9 44.8 22.1 42.3 17.5 6.6 42.9 22.2 41.7 16.9 7.2 42.1 21.8 38.8 17 7.2 40.1 20.6 37.0 16.7 7.1 38.5 20.5 25.6 +9.2
MemCLR 43.4 19.3 5.9 44.6 23.3 41.1 19.5 6.6 43.1 23.9 39.6 19.1 7.0 40.9 24.3 38.6 18.9 7.2 39.1 22.4 36.8 18.5 7.5 38.4 21.8 26.0 +9.6
Ours 48.5 19.7 8.9 50.7 26.3 47.8 19.6 9.8 49.0 26.1 45.7 19.3 10.5 49.4 26.9 45.2 19.7 11.2 47.4 26.9 45.7 20.0 11.8 46.4 25.8 29.0 +12.6
Table 10. Complete experimental results ([email protected]) of Cityscapes-to-ACDC long-term CTTA task. We evaluate the long-term adaptation performance by continually adapting the source model to the four conditions ten times with the largest corruption severity level 5.
Time t𝑡absentt\xrightarrow{\hskip 390.25534pt}italic_t start_ARROW → end_ARROW
Round 1 2 3 4 5 /
Condition Fog Night Rain Snow Fog Night Rain Snow Fog Night Rain Snow Fog Night Rain Snow Fog Night Rain Snow / /
Source 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 / /
Tent 52.4 18.6 33.4 38.9 52.4 18.1 32.5 37.7 52.2 17.8 32.7 36.7 51.7 17.4 31.4 36 50.4 16.5 31.7 32.9 / /
CoTTA 53.7 19.7 38 42.4 53.6 20.3 37.7 42.2 53.7 20.1 38.1 43.3 53.1 19.7 37.7 42.9 52.6 19.5 37 42.2 / /
SVDP 52.8 20.0 35.6 42.0 53.4 22.0 37.4 43.8 54.0 23.0 38.8 44.0 54.6 23.5 38.7 43.8 53.4 24.0 38.0 44.4 / /
IRG 52.7 20.6 36 42.9 53.2 22.4 37.4 44.6 53.8 23.2 38.1 45.5 53.6 23.2 38.1 45.6 53.5 23.2 38.1 45.3 / /
MemCLR 52.9 21.2 35.2 42.8 52.5 23.1 37.3 44.3 53.9 23.2 38.4 44.8 52.8 23.2 38.2 44.4 51.5 23.5 37.2 43.8 / /
Ours 52.9 20.4 34.5 42.9 53 22.9 38.6 44.6 54.2 24 39.5 45.3 54.5 23.8 39.6 45.8 54.4 24.4 39.2 45.4 / /
Round 6 7 8 9 10 Mean Gain
Source 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 52.3 18.7 33.5 39.6 36 /
Tent 49 15.8 29.5 30.6 45.8 14.6 28.1 28.5 42.4 12.6 25.8 24.9 38.9 11 23.6 21.6 35 9.5 21.5 18.3 30.5 -5.5
CoTTA 52.3 19.1 36.3 42 51.3 19 36.5 41.8 51.2 18.9 36.1 41.4 51.5 19.2 36.4 41.3 50.9 19.1 36 42.4 37.8 +1.8
SVDP 53.3 24.0 38.9 43.8 52.8 24.0 38.6 43.9 52.0 23.8 38.7 43.4 51.8 23.8 38.6 43.6 51.8 23.6 38.2 43.0 39.5 +3.6
IRG 52.6 23.5 37.5 43.7 51 23.1 37.1 42.8 50.2 23.1 37 42.2 49.8 23.2 36.5 41.1 49.7 22.5 36.5 40.7 38.9 +2.9
MemCLR 50.2 23.3 36.6 43 51.4 23.2 37.1 42.9 50.5 23.1 36.6 42.6 49.9 22.8 36.8 42 49.9 22.9 36.6 41 38.7 +2.7
Ours 53.9 24.9 39.2 45.1 53.5 25.1 39 45.3 53.3 24.6 39.3 45.3 53.2 24.5 39.3 44.9 52.5 24.3 38.8 44.7 40.3 +4.3
Refer to caption
(a) Source
Refer to caption
(b) Tent
Refer to caption
(c) CoTTA
Refer to caption
(d) IRG
Refer to caption
(e) MemCLR
Refer to caption
(f) CTAOD (ours)
Figure 4. Complete qualitative results. We compare the detection results of the CTAOD and other baseline methods in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to Brightness corruption on the long-term Cityscapes-to-Cityscapes-C task.
Refer to caption
(a) Source
Refer to caption
(b) Tent
Refer to caption
(c) CoTTA
Refer to caption
(d) IRG
Refer to caption
(e) MemCLR
Refer to caption
(f) CTAOD (ours)
Figure 5. Complete qualitative results. We compare the detection results of the CTAOD and other baseline methods in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to Defocus corruption on the long-term Cityscapes-to-Cityscapes-C task.
Refer to caption
(a) Source
Refer to caption
(b) Tent
Refer to caption
(c) CoTTA
Refer to caption
(d) IRG
Refer to caption
(e) MemCLR
Refer to caption
(f) CTAOD (ours)
Figure 6. Complete qualitative results. We compare the detection results of the CTAOD and other baseline methods in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to Fog corruption on the long-term Cityscapes-to-Cityscapes-C task.
Refer to caption
(a) Source
Refer to caption
(b) Tent
Refer to caption
(c) CoTTA
Refer to caption
(d) IRG
Refer to caption
(e) MemCLR
Refer to caption
(f) CTAOD (ours)
Figure 7. Complete qualitative results. We compare the detection results of the CTAOD and other baseline methods in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to Motion corruption on the long-term Cityscapes-to-Cityscapes-C task.
Refer to caption
(a) Source
Refer to caption
(b) Tent
Refer to caption
(c) CoTTA
Refer to caption
(d) IRG
Refer to caption
(e) MemCLR
Refer to caption
(f) CTAOD (ours)
Figure 8. Complete qualitative results. We compare the detection results of the CTAOD and other baseline methods in the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round of adaption to Snow corruption on the long-term Cityscapes-to-Cityscapes-C task.